Initiative: Platform Observability Modernization

The Bet

Hypothesis: We believe that modernizing the observability stack — standardizing log shipping to a single Fluent Bit pipeline, implementing tiered storage (hot/warm/cold), adopting OpenTelemetry for distributed tracing, and moving to metrics-first monitoring with Prometheus/Micrometer — will reduce logging infrastructure costs by 70% and enable sub-minute incident detection across the entire platform.

Team Metric: Logging infrastructure cost and incident detection time

Current Baseline: - Log shipping: 5 different mechanisms across K8s, Elastic Beanstalk, Chef, Lambda, and S3 - Distributed tracing: none (MDC configured in log4j2 but never populated) - Metrics: limited, no Prometheus/Micrometer instrumentation - Storage: single tier in Sumo Logic (all logs same retention/cost) - Incident detection: relies on Sumo Logic scheduled queries → Lambda → email alerts

Target: - Log shipping: 1 unified mechanism (Fluent Bit everywhere) - Distributed tracing: OpenTelemetry across frontend → backend → services - Metrics: Prometheus/Micrometer for all critical paths - Storage: 3-tier (hot 7d / warm 30d / cold 1yr) — 70% cost reduction - Incident detection: real-time metrics alerts + trace-based debugging

Measurement: Monitor Sumo Logic ingestion costs, count of shipping mechanisms, trace coverage percentage, alert-to-diagnosis time.

Why This Bet?

Research across 150 duettoresearch organization repositories revealed a fragmented observability architecture that has grown organically over time. Five different log shipping mechanisms coexist — Fluent Bit DaemonSets on K8s, native Sumo Logic collectors on Elastic Beanstalk, Chef-installed collectors on legacy infrastructure, a custom Python library for Lambda/Glue, and S3 bucket scanning. This fragmentation creates operational overhead, inconsistent log formats, and difficulty in cross-service correlation.

The application teams are addressing their immediate observability needs (see I-2026-GC-002 for pricing/groups logging improvements). This initiative builds the long-term infrastructure foundation that those application-level improvements will benefit from.

Evidence/Signal

5 shipping mechanisms discovered: K8s Fluent Bit (ops-k8s), EB native collectors (7+ service repos), Chef cookbook (chef-cookbooks/sumologic-collector), Lambda library (datapipelines/de_libs/common-sumo), S3 scanning (openspace-infra/sumo)
No distributed tracing: log4j2 pattern includes traceId=%X{trace_id};spanId=%X{span_id} but MDC is never populated — wasted configuration
OpenTelemetry hint exists: -Ddd.trace.otel.enabled=true found in configs but not fully leveraged
Single-tier storage: All logs have same retention and cost regardless of value
Frontend observability gap: No Web Vitals, no RUM, no session replay

Strategic Alignment

OKR: Reduce infrastructure operational costs and improve platform reliability
Company Priority: Scalable infrastructure that supports growth without proportional cost increase

Current State: Log Pipeline Architecture

┌─────────────────────────────────────────────────────────────┐
│                    APPLICATION LAYER                         │
├──────────────────────┬──────────────────────────────────────┤
│ duetto (backend)     │ duetto-frontend                      │
│ log4j2 → rolling     │ Client: Sentry (LoggerUtil)          │
│ files + stdout       │ Server: Pino → stdout                │
└──────────┬───────────┴───────────┬──────────────────────────┘
           │                       │
┌──────────┴───────────────────────┴──────────────────────────┐
│      5 PARALLEL SHIPPING MECHANISMS (current)               │
├─────────────┬──────────────┬────────────┬───────────────────┤
│ ① Fluent Bit│ ② Sumo       │ ③ Chef     │ ④ de-sumo-log    │
│   (K8s/EKS) │ Collector(EB)│ Cookbook    │   (Lambda/Glue)  │
│   → Kinesis │ on EC2       │ Installed  │   → HTTP          │
│   → Firehose│              │ Collector  │                   │
│             │              │            │  ⑤ S3 Scanning    │
└─────────────┴──────────────┴────────────┴───────────────────┘
                          │
                    ┌─────┴─────┐
                    │SUMO LOGIC │  ← single tier, all logs
                    └─────┬─────┘
                          │
                    Lambda Monitor → Email Alerts

Key Repos: - K8s shipping: ops-k8s/modules/kubernetes/helm-charts/fluentbit/ - EB collectors: .ebextensions/sumo_logic.config in hotel-domain, group-domain, rate-management-domain, etc. - Chef cookbook: chef-cookbooks/sumologic-collector/ - Lambda library: datapipelines/de_libs/common-sumo/ - S3 scanning: openspace-infra/sumo/

Discovery Plan

Time Box

Max Duration: 8 weeks (research + proof of concepts)
Max Experiments: 3
Decision Date: 2026-04-23

Planned Experiments

E-2026-OBS-001: OpenTelemetry POC — Instrument one service with OTel SDK, validate trace propagation through Fluent Bit to a trace backend (Jaeger/Tempo). Measure overhead and trace completeness. (Weeks 1-3)
E-2026-OBS-002: Tiered storage pilot — Configure Sumo Logic partitions with differentiated retention + set up S3 archival with Athena queries for warm tier. Measure cost reduction and query performance. (Weeks 3-5)
E-2026-OBS-003: Unified Fluent Bit migration — Migrate one EB service from native Sumo collector to Fluent Bit sidecar. Validate log completeness and operational simplicity. (Weeks 5-7)

Recommendations (Long-Term Platform Changes)

L1: Metrics-First Observability (Prometheus/Micrometer)

Instrument critical paths with metrics instead of relying on log analysis:

@Component
public class PricingMetrics {
    private final MeterRegistry registry;

    public void recordOptimization(OptimizationContext ctx) {
        registry.timer("pricing.optimization.duration")
            .tag("hotel", ctx.getHotelId())
            .tag("status", ctx.getStatus())
            .record(ctx.getDuration());

        registry.counter("pricing.rules.applied")
            .tag("type", ctx.getRuleType())
            .increment(ctx.getRulesCount());
    }
}

Where: New instrumentation across duetto monolith, domain services
Impact: Real-time alerting, Grafana dashboards, no Sumo Logic query cost
Effort: 10-15 days
Dependencies: Prometheus infrastructure (or hosted solution)

L2: Tiered Storage Strategy

Tier	Retention	Storage	Cost/GB	Use Case
Hot	7 days	Sumo Logic	~$3/GB	Active debugging, real-time queries
Warm	7-30 days	S3 Standard + Athena	~$0.023/GB	Incident investigation, ad-hoc queries
Cold	30+ days	S3 Glacier	~$0.004/GB	Compliance, audit, long-term archive

Impact: ~70% cost reduction on log storage
Effort: 5-8 days
Dependencies: S3 lifecycle policies, Athena table definitions

L3: Distributed Tracing (OpenTelemetry)

End-to-end request tracing from frontend to all backend services:

Leverage existing OTel hint (-Ddd.trace.otel.enabled=true)
Instrument frontend (Sentry Performance or OTel JS SDK)
Deploy Jaeger or Grafana Tempo for trace storage
Correlate traces with logs and metrics
Where: All services, frontend, infrastructure
Impact: Complete request flow visibility, sub-minute root cause identification
Effort: 15-20 days
Dependencies: Trace backend infrastructure, I-2026-GC-002 delivers MDC/correlation foundation

L4: Standardize Log Shipping

Consolidate 5 mechanisms to 1 (Fluent Bit everywhere):

Current Mechanism	Migration Path
① Fluent Bit (K8s)	Keep — already the standard
② Native Sumo Collector (EB)	Replace with Fluent Bit sidecar
③ Chef Cookbook Collector	Replace with Fluent Bit on hosts
④ de-sumo-log (Lambda)	Replace with CloudWatch → Fluent Bit subscription
⑤ S3 Scanning	Keep as archival backup only

Impact: Single config to manage, consistent log format, reduced operational burden
Effort: 8-10 days
Dependencies: EB → K8s migration roadmap may make some obsolete

L5: Frontend Session Replay

Add visual debugging capability for frontend issues:

Options: Sentry Session Replay (already have Sentry), LogRocket, or FullStory
Impact: Visual reproduction of user issues without asking for steps
Effort: 3-5 days
Dependencies: Sentry plan upgrade (if using Sentry Replay)

Exit Criteria

Validate If:

OpenTelemetry POC successfully traces a request end-to-end with <5% overhead
Tiered storage achieves >50% cost reduction in pilot
Fluent Bit migration maintains 100% log completeness vs native collector
At least 2 recommendations demonstrate measurable improvement

Kill If:

OpenTelemetry overhead exceeds 10% on critical paths
Tiered storage queries in Athena are too slow for incident response (>30s)
Fluent Bit migration causes log loss >0.1%
After 8 weeks, no clear path to 50%+ cost reduction

Cost-Benefit Analysis

Recommendation	Effort (Days)	Annual Cost Impact	Operational Impact
L1: Metrics-First	10-15	-$3,600/yr (reduced Sumo queries)	Real-time alerting
L2: Tiered Storage	5-8	-$42,000/yr (70% storage reduction)	Slightly slower warm queries
L3: OpenTelemetry	15-20	+$1,200/yr (trace backend)	Revolutionary debugging
L4: Standardize Shipping	8-10	-$2,400/yr (ops overhead)	Simplified operations
L5: Session Replay	3-5	+$6,000/yr (tooling)	Visual debugging
Total	41-58	-$40,800/yr net savings	Modern observability

I-2026-GC-002 (applications/gamechanger): Pricing & Rate-Management Observability — delivers the short/mid-term application-level logging improvements (smart sampling, structured logging, Sentry enrichment, correlation IDs). This initiative (I-2026-OBS-001) builds the long-term infrastructure foundation.
I-2026-ING-001 (engineering-platform/ingestion): Frontdoor — may benefit from standardized log shipping (L4)

Decision Log

Date	Decision	Rationale	Next Step
2026-02-25	Mapped complete Sumo Logic pipeline across 150 repos	5 shipping mechanisms discovered — fragmentation is a cost and ops problem	Quantify cost impact
2026-02-26	Split from app-level initiative (I-2026-GC-002)	App teams own short/mid-term code changes; platform owns long-term infrastructure	Scope platform experiments
2026-02-26	Created initiative I-2026-OBS-001 focused on L1-L5	Long-term modernization requires dedicated platform engineering effort	Begin E-2026-OBS-001 (OTel POC)

Risk Assessment

Risk	Severity	Likelihood	Mitigation
OpenTelemetry overhead too high for pricing critical path	High	Low	Profile in staging; use sampling if needed
Tiered storage Athena queries too slow for incidents	Medium	Medium	Keep 7-day hot tier for active debugging
Fluent Bit migration causes log gaps during cutover	Medium	Low	Run parallel (old + new) during migration
Team bandwidth insufficient for platform changes	Medium	High	Sequence after I-2026-GC-002 delivers app-level wins
Sumo Logic contract limits partition/budget features	Low	Medium	Evaluate alternatives if needed (Grafana Loki)

Research Artifacts

Unified Strategy: duetto/docs/research/duetto-logging-observability-strategy-2026-02-25.md
Sumo Logic Shipper Discovery: docs/research/sumo-logic-log-shipper-discovery-2026-02-25.md (local research — not in repo)
Sumo Logic Integration Analysis: duetto/docs/research/sumo-logic-sentry-integration-analysis-2026-02-25.md
Logging Strategy Analysis: duetto/docs/research/logging-strategy-analysis-2026-02-25.md

Team & Stakeholders

Initiative Owner: Antonio (Platform Engineering)
Engineering: Platform Engineering / SRE team
Stakeholders: All application teams (beneficiaries of improved infrastructure)
Repos Affected: ops-k8s, chef-cookbooks, datapipelines, openspace-infra, duetto (instrumentation)

Outcome

Status: discovery Key Learning: [To be completed after experiments] Next Step: Begin experiment E-2026-OBS-001 (OpenTelemetry POC) — sequenced after I-2026-GC-002 delivers application-level foundation

Platform Observability Modernization: Unified Logging Infrastructure