initiative discovery

Platform Observability Modernization: Unified Logging Infrastructure

Antonio Updated 2026-03-11 engineering-platform infrastructure
observability infrastructure sumo-logic opentelemetry sre q1-2026

Initiative: Platform Observability Modernization

The Bet

Hypothesis: We believe that modernizing the observability stack — standardizing log shipping to a single Fluent Bit pipeline, implementing tiered storage (hot/warm/cold), adopting OpenTelemetry for distributed tracing, and moving to metrics-first monitoring with Prometheus/Micrometer — will reduce logging infrastructure costs by 70% and enable sub-minute incident detection across the entire platform.

Team Metric: Logging infrastructure cost and incident detection time

Current Baseline: - Log shipping: 5 different mechanisms across K8s, Elastic Beanstalk, Chef, Lambda, and S3 - Distributed tracing: none (MDC configured in log4j2 but never populated) - Metrics: limited, no Prometheus/Micrometer instrumentation - Storage: single tier in Sumo Logic (all logs same retention/cost) - Incident detection: relies on Sumo Logic scheduled queries → Lambda → email alerts

Target: - Log shipping: 1 unified mechanism (Fluent Bit everywhere) - Distributed tracing: OpenTelemetry across frontend → backend → services - Metrics: Prometheus/Micrometer for all critical paths - Storage: 3-tier (hot 7d / warm 30d / cold 1yr) — 70% cost reduction - Incident detection: real-time metrics alerts + trace-based debugging

Measurement: Monitor Sumo Logic ingestion costs, count of shipping mechanisms, trace coverage percentage, alert-to-diagnosis time.

Why This Bet?

Research across 150 duettoresearch organization repositories revealed a fragmented observability architecture that has grown organically over time. Five different log shipping mechanisms coexist — Fluent Bit DaemonSets on K8s, native Sumo Logic collectors on Elastic Beanstalk, Chef-installed collectors on legacy infrastructure, a custom Python library for Lambda/Glue, and S3 bucket scanning. This fragmentation creates operational overhead, inconsistent log formats, and difficulty in cross-service correlation.

The application teams are addressing their immediate observability needs (see I-2026-GC-002 for pricing/groups logging improvements). This initiative builds the long-term infrastructure foundation that those application-level improvements will benefit from.

Evidence/Signal

  • 5 shipping mechanisms discovered: K8s Fluent Bit (ops-k8s), EB native collectors (7+ service repos), Chef cookbook (chef-cookbooks/sumologic-collector), Lambda library (datapipelines/de_libs/common-sumo), S3 scanning (openspace-infra/sumo)
  • No distributed tracing: log4j2 pattern includes traceId=%X{trace_id};spanId=%X{span_id} but MDC is never populated — wasted configuration
  • OpenTelemetry hint exists: -Ddd.trace.otel.enabled=true found in configs but not fully leveraged
  • Single-tier storage: All logs have same retention and cost regardless of value
  • Frontend observability gap: No Web Vitals, no RUM, no session replay

Strategic Alignment

  • OKR: Reduce infrastructure operational costs and improve platform reliability
  • Company Priority: Scalable infrastructure that supports growth without proportional cost increase

Current State: Log Pipeline Architecture

┌─────────────────────────────────────────────────────────────┐
│                    APPLICATION LAYER                         │
├──────────────────────┬──────────────────────────────────────┤
│ duetto (backend)     │ duetto-frontend                      │
│ log4j2 → rolling     │ Client: Sentry (LoggerUtil)          │
│ files + stdout       │ Server: Pino → stdout                │
└──────────┬───────────┴───────────┬──────────────────────────┘
           │                       │
┌──────────┴───────────────────────┴──────────────────────────┐
│      5 PARALLEL SHIPPING MECHANISMS (current)               │
├─────────────┬──────────────┬────────────┬───────────────────┤
│ ① Fluent Bit│ ② Sumo       │ ③ Chef     │ ④ de-sumo-log    │
│   (K8s/EKS) │ Collector(EB)│ Cookbook    │   (Lambda/Glue)  │
│   → Kinesis │ on EC2       │ Installed  │   → HTTP          │
│   → Firehose│              │ Collector  │                   │
│             │              │            │  ⑤ S3 Scanning    │
└─────────────┴──────────────┴────────────┴───────────────────┘
                          │
                    ┌─────┴─────┐
                    │SUMO LOGIC │  ← single tier, all logs
                    └─────┬─────┘
                          │
                    Lambda Monitor → Email Alerts

Key Repos: - K8s shipping: ops-k8s/modules/kubernetes/helm-charts/fluentbit/ - EB collectors: .ebextensions/sumo_logic.config in hotel-domain, group-domain, rate-management-domain, etc. - Chef cookbook: chef-cookbooks/sumologic-collector/ - Lambda library: datapipelines/de_libs/common-sumo/ - S3 scanning: openspace-infra/sumo/

Discovery Plan

Time Box

  • Max Duration: 8 weeks (research + proof of concepts)
  • Max Experiments: 3
  • Decision Date: 2026-04-23

Planned Experiments

  1. E-2026-OBS-001: OpenTelemetry POC — Instrument one service with OTel SDK, validate trace propagation through Fluent Bit to a trace backend (Jaeger/Tempo). Measure overhead and trace completeness. (Weeks 1-3)
  2. E-2026-OBS-002: Tiered storage pilot — Configure Sumo Logic partitions with differentiated retention + set up S3 archival with Athena queries for warm tier. Measure cost reduction and query performance. (Weeks 3-5)
  3. E-2026-OBS-003: Unified Fluent Bit migration — Migrate one EB service from native Sumo collector to Fluent Bit sidecar. Validate log completeness and operational simplicity. (Weeks 5-7)

Recommendations (Long-Term Platform Changes)

L1: Metrics-First Observability (Prometheus/Micrometer)

Instrument critical paths with metrics instead of relying on log analysis:

@Component
public class PricingMetrics {
    private final MeterRegistry registry;

    public void recordOptimization(OptimizationContext ctx) {
        registry.timer("pricing.optimization.duration")
            .tag("hotel", ctx.getHotelId())
            .tag("status", ctx.getStatus())
            .record(ctx.getDuration());

        registry.counter("pricing.rules.applied")
            .tag("type", ctx.getRuleType())
            .increment(ctx.getRulesCount());
    }
}
  • Where: New instrumentation across duetto monolith, domain services
  • Impact: Real-time alerting, Grafana dashboards, no Sumo Logic query cost
  • Effort: 10-15 days
  • Dependencies: Prometheus infrastructure (or hosted solution)

L2: Tiered Storage Strategy

Tier Retention Storage Cost/GB Use Case
Hot 7 days Sumo Logic ~$3/GB Active debugging, real-time queries
Warm 7-30 days S3 Standard + Athena ~$0.023/GB Incident investigation, ad-hoc queries
Cold 30+ days S3 Glacier ~$0.004/GB Compliance, audit, long-term archive
  • Impact: ~70% cost reduction on log storage
  • Effort: 5-8 days
  • Dependencies: S3 lifecycle policies, Athena table definitions

L3: Distributed Tracing (OpenTelemetry)

End-to-end request tracing from frontend to all backend services:

  • Leverage existing OTel hint (-Ddd.trace.otel.enabled=true)
  • Instrument frontend (Sentry Performance or OTel JS SDK)
  • Deploy Jaeger or Grafana Tempo for trace storage
  • Correlate traces with logs and metrics

  • Where: All services, frontend, infrastructure

  • Impact: Complete request flow visibility, sub-minute root cause identification
  • Effort: 15-20 days
  • Dependencies: Trace backend infrastructure, I-2026-GC-002 delivers MDC/correlation foundation

L4: Standardize Log Shipping

Consolidate 5 mechanisms to 1 (Fluent Bit everywhere):

Current Mechanism Migration Path
① Fluent Bit (K8s) Keep — already the standard
② Native Sumo Collector (EB) Replace with Fluent Bit sidecar
③ Chef Cookbook Collector Replace with Fluent Bit on hosts
④ de-sumo-log (Lambda) Replace with CloudWatch → Fluent Bit subscription
⑤ S3 Scanning Keep as archival backup only
  • Impact: Single config to manage, consistent log format, reduced operational burden
  • Effort: 8-10 days
  • Dependencies: EB → K8s migration roadmap may make some obsolete

L5: Frontend Session Replay

Add visual debugging capability for frontend issues:

  • Options: Sentry Session Replay (already have Sentry), LogRocket, or FullStory
  • Impact: Visual reproduction of user issues without asking for steps
  • Effort: 3-5 days
  • Dependencies: Sentry plan upgrade (if using Sentry Replay)

Exit Criteria

Validate If:

  • OpenTelemetry POC successfully traces a request end-to-end with <5% overhead
  • Tiered storage achieves >50% cost reduction in pilot
  • Fluent Bit migration maintains 100% log completeness vs native collector
  • At least 2 recommendations demonstrate measurable improvement

Kill If:

  • OpenTelemetry overhead exceeds 10% on critical paths
  • Tiered storage queries in Athena are too slow for incident response (>30s)
  • Fluent Bit migration causes log loss >0.1%
  • After 8 weeks, no clear path to 50%+ cost reduction

Cost-Benefit Analysis

Recommendation Effort (Days) Annual Cost Impact Operational Impact
L1: Metrics-First 10-15 -$3,600/yr (reduced Sumo queries) Real-time alerting
L2: Tiered Storage 5-8 -$42,000/yr (70% storage reduction) Slightly slower warm queries
L3: OpenTelemetry 15-20 +$1,200/yr (trace backend) Revolutionary debugging
L4: Standardize Shipping 8-10 -$2,400/yr (ops overhead) Simplified operations
L5: Session Replay 3-5 +$6,000/yr (tooling) Visual debugging
Total 41-58 -$40,800/yr net savings Modern observability
  • I-2026-GC-002 (applications/gamechanger): Pricing & Rate-Management Observability — delivers the short/mid-term application-level logging improvements (smart sampling, structured logging, Sentry enrichment, correlation IDs). This initiative (I-2026-OBS-001) builds the long-term infrastructure foundation.
  • I-2026-ING-001 (engineering-platform/ingestion): Frontdoor — may benefit from standardized log shipping (L4)

Decision Log

Date Decision Rationale Next Step
2026-02-25 Mapped complete Sumo Logic pipeline across 150 repos 5 shipping mechanisms discovered — fragmentation is a cost and ops problem Quantify cost impact
2026-02-26 Split from app-level initiative (I-2026-GC-002) App teams own short/mid-term code changes; platform owns long-term infrastructure Scope platform experiments
2026-02-26 Created initiative I-2026-OBS-001 focused on L1-L5 Long-term modernization requires dedicated platform engineering effort Begin E-2026-OBS-001 (OTel POC)

Risk Assessment

Risk Severity Likelihood Mitigation
OpenTelemetry overhead too high for pricing critical path High Low Profile in staging; use sampling if needed
Tiered storage Athena queries too slow for incidents Medium Medium Keep 7-day hot tier for active debugging
Fluent Bit migration causes log gaps during cutover Medium Low Run parallel (old + new) during migration
Team bandwidth insufficient for platform changes Medium High Sequence after I-2026-GC-002 delivers app-level wins
Sumo Logic contract limits partition/budget features Low Medium Evaluate alternatives if needed (Grafana Loki)

Research Artifacts

  • Unified Strategy: duetto/docs/research/duetto-logging-observability-strategy-2026-02-25.md
  • Sumo Logic Shipper Discovery: docs/research/sumo-logic-log-shipper-discovery-2026-02-25.md (local research — not in repo)
  • Sumo Logic Integration Analysis: duetto/docs/research/sumo-logic-sentry-integration-analysis-2026-02-25.md
  • Logging Strategy Analysis: duetto/docs/research/logging-strategy-analysis-2026-02-25.md

Team & Stakeholders

  • Initiative Owner: Antonio (Platform Engineering)
  • Engineering: Platform Engineering / SRE team
  • Stakeholders: All application teams (beneficiaries of improved infrastructure)
  • Repos Affected: ops-k8s, chef-cookbooks, datapipelines, openspace-infra, duetto (instrumentation)

Outcome

Status: discovery Key Learning: [To be completed after experiments] Next Step: Begin experiment E-2026-OBS-001 (OpenTelemetry POC) — sequenced after I-2026-GC-002 delivers application-level foundation