Platform Observability Modernization: Unified Logging Infrastructure
Initiative: Platform Observability Modernization
The Bet
Hypothesis: We believe that modernizing the observability stack — standardizing log shipping to a single Fluent Bit pipeline, implementing tiered storage (hot/warm/cold), adopting OpenTelemetry for distributed tracing, and moving to metrics-first monitoring with Prometheus/Micrometer — will reduce logging infrastructure costs by 70% and enable sub-minute incident detection across the entire platform.
Team Metric: Logging infrastructure cost and incident detection time
Current Baseline: - Log shipping: 5 different mechanisms across K8s, Elastic Beanstalk, Chef, Lambda, and S3 - Distributed tracing: none (MDC configured in log4j2 but never populated) - Metrics: limited, no Prometheus/Micrometer instrumentation - Storage: single tier in Sumo Logic (all logs same retention/cost) - Incident detection: relies on Sumo Logic scheduled queries → Lambda → email alerts
Target: - Log shipping: 1 unified mechanism (Fluent Bit everywhere) - Distributed tracing: OpenTelemetry across frontend → backend → services - Metrics: Prometheus/Micrometer for all critical paths - Storage: 3-tier (hot 7d / warm 30d / cold 1yr) — 70% cost reduction - Incident detection: real-time metrics alerts + trace-based debugging
Measurement: Monitor Sumo Logic ingestion costs, count of shipping mechanisms, trace coverage percentage, alert-to-diagnosis time.
Why This Bet?
Research across 150 duettoresearch organization repositories revealed a fragmented observability architecture that has grown organically over time. Five different log shipping mechanisms coexist — Fluent Bit DaemonSets on K8s, native Sumo Logic collectors on Elastic Beanstalk, Chef-installed collectors on legacy infrastructure, a custom Python library for Lambda/Glue, and S3 bucket scanning. This fragmentation creates operational overhead, inconsistent log formats, and difficulty in cross-service correlation.
The application teams are addressing their immediate observability needs (see I-2026-GC-002 for pricing/groups logging improvements). This initiative builds the long-term infrastructure foundation that those application-level improvements will benefit from.
Evidence/Signal
- 5 shipping mechanisms discovered: K8s Fluent Bit (
ops-k8s), EB native collectors (7+ service repos), Chef cookbook (chef-cookbooks/sumologic-collector), Lambda library (datapipelines/de_libs/common-sumo), S3 scanning (openspace-infra/sumo) - No distributed tracing: log4j2 pattern includes
traceId=%X{trace_id};spanId=%X{span_id}but MDC is never populated — wasted configuration - OpenTelemetry hint exists:
-Ddd.trace.otel.enabled=truefound in configs but not fully leveraged - Single-tier storage: All logs have same retention and cost regardless of value
- Frontend observability gap: No Web Vitals, no RUM, no session replay
Strategic Alignment
- OKR: Reduce infrastructure operational costs and improve platform reliability
- Company Priority: Scalable infrastructure that supports growth without proportional cost increase
Current State: Log Pipeline Architecture
┌─────────────────────────────────────────────────────────────┐
│ APPLICATION LAYER │
├──────────────────────┬──────────────────────────────────────┤
│ duetto (backend) │ duetto-frontend │
│ log4j2 → rolling │ Client: Sentry (LoggerUtil) │
│ files + stdout │ Server: Pino → stdout │
└──────────┬───────────┴───────────┬──────────────────────────┘
│ │
┌──────────┴───────────────────────┴──────────────────────────┐
│ 5 PARALLEL SHIPPING MECHANISMS (current) │
├─────────────┬──────────────┬────────────┬───────────────────┤
│ ① Fluent Bit│ ② Sumo │ ③ Chef │ ④ de-sumo-log │
│ (K8s/EKS) │ Collector(EB)│ Cookbook │ (Lambda/Glue) │
│ → Kinesis │ on EC2 │ Installed │ → HTTP │
│ → Firehose│ │ Collector │ │
│ │ │ │ ⑤ S3 Scanning │
└─────────────┴──────────────┴────────────┴───────────────────┘
│
┌─────┴─────┐
│SUMO LOGIC │ ← single tier, all logs
└─────┬─────┘
│
Lambda Monitor → Email Alerts
Key Repos:
- K8s shipping: ops-k8s/modules/kubernetes/helm-charts/fluentbit/
- EB collectors: .ebextensions/sumo_logic.config in hotel-domain, group-domain, rate-management-domain, etc.
- Chef cookbook: chef-cookbooks/sumologic-collector/
- Lambda library: datapipelines/de_libs/common-sumo/
- S3 scanning: openspace-infra/sumo/
Discovery Plan
Time Box
- Max Duration: 8 weeks (research + proof of concepts)
- Max Experiments: 3
- Decision Date: 2026-04-23
Planned Experiments
- E-2026-OBS-001: OpenTelemetry POC — Instrument one service with OTel SDK, validate trace propagation through Fluent Bit to a trace backend (Jaeger/Tempo). Measure overhead and trace completeness. (Weeks 1-3)
- E-2026-OBS-002: Tiered storage pilot — Configure Sumo Logic partitions with differentiated retention + set up S3 archival with Athena queries for warm tier. Measure cost reduction and query performance. (Weeks 3-5)
- E-2026-OBS-003: Unified Fluent Bit migration — Migrate one EB service from native Sumo collector to Fluent Bit sidecar. Validate log completeness and operational simplicity. (Weeks 5-7)
Recommendations (Long-Term Platform Changes)
L1: Metrics-First Observability (Prometheus/Micrometer)
Instrument critical paths with metrics instead of relying on log analysis:
@Component
public class PricingMetrics {
private final MeterRegistry registry;
public void recordOptimization(OptimizationContext ctx) {
registry.timer("pricing.optimization.duration")
.tag("hotel", ctx.getHotelId())
.tag("status", ctx.getStatus())
.record(ctx.getDuration());
registry.counter("pricing.rules.applied")
.tag("type", ctx.getRuleType())
.increment(ctx.getRulesCount());
}
}
- Where: New instrumentation across duetto monolith, domain services
- Impact: Real-time alerting, Grafana dashboards, no Sumo Logic query cost
- Effort: 10-15 days
- Dependencies: Prometheus infrastructure (or hosted solution)
L2: Tiered Storage Strategy
| Tier | Retention | Storage | Cost/GB | Use Case |
|---|---|---|---|---|
| Hot | 7 days | Sumo Logic | ~$3/GB | Active debugging, real-time queries |
| Warm | 7-30 days | S3 Standard + Athena | ~$0.023/GB | Incident investigation, ad-hoc queries |
| Cold | 30+ days | S3 Glacier | ~$0.004/GB | Compliance, audit, long-term archive |
- Impact: ~70% cost reduction on log storage
- Effort: 5-8 days
- Dependencies: S3 lifecycle policies, Athena table definitions
L3: Distributed Tracing (OpenTelemetry)
End-to-end request tracing from frontend to all backend services:
- Leverage existing OTel hint (
-Ddd.trace.otel.enabled=true) - Instrument frontend (Sentry Performance or OTel JS SDK)
- Deploy Jaeger or Grafana Tempo for trace storage
-
Correlate traces with logs and metrics
-
Where: All services, frontend, infrastructure
- Impact: Complete request flow visibility, sub-minute root cause identification
- Effort: 15-20 days
- Dependencies: Trace backend infrastructure, I-2026-GC-002 delivers MDC/correlation foundation
L4: Standardize Log Shipping
Consolidate 5 mechanisms to 1 (Fluent Bit everywhere):
| Current Mechanism | Migration Path |
|---|---|
| ① Fluent Bit (K8s) | Keep — already the standard |
| ② Native Sumo Collector (EB) | Replace with Fluent Bit sidecar |
| ③ Chef Cookbook Collector | Replace with Fluent Bit on hosts |
| ④ de-sumo-log (Lambda) | Replace with CloudWatch → Fluent Bit subscription |
| ⑤ S3 Scanning | Keep as archival backup only |
- Impact: Single config to manage, consistent log format, reduced operational burden
- Effort: 8-10 days
- Dependencies: EB → K8s migration roadmap may make some obsolete
L5: Frontend Session Replay
Add visual debugging capability for frontend issues:
- Options: Sentry Session Replay (already have Sentry), LogRocket, or FullStory
- Impact: Visual reproduction of user issues without asking for steps
- Effort: 3-5 days
- Dependencies: Sentry plan upgrade (if using Sentry Replay)
Exit Criteria
Validate If:
- OpenTelemetry POC successfully traces a request end-to-end with <5% overhead
- Tiered storage achieves >50% cost reduction in pilot
- Fluent Bit migration maintains 100% log completeness vs native collector
- At least 2 recommendations demonstrate measurable improvement
Kill If:
- OpenTelemetry overhead exceeds 10% on critical paths
- Tiered storage queries in Athena are too slow for incident response (>30s)
- Fluent Bit migration causes log loss >0.1%
- After 8 weeks, no clear path to 50%+ cost reduction
Cost-Benefit Analysis
| Recommendation | Effort (Days) | Annual Cost Impact | Operational Impact |
|---|---|---|---|
| L1: Metrics-First | 10-15 | -$3,600/yr (reduced Sumo queries) | Real-time alerting |
| L2: Tiered Storage | 5-8 | -$42,000/yr (70% storage reduction) | Slightly slower warm queries |
| L3: OpenTelemetry | 15-20 | +$1,200/yr (trace backend) | Revolutionary debugging |
| L4: Standardize Shipping | 8-10 | -$2,400/yr (ops overhead) | Simplified operations |
| L5: Session Replay | 3-5 | +$6,000/yr (tooling) | Visual debugging |
| Total | 41-58 | -$40,800/yr net savings | Modern observability |
Related Initiatives
- I-2026-GC-002 (applications/gamechanger): Pricing & Rate-Management Observability — delivers the short/mid-term application-level logging improvements (smart sampling, structured logging, Sentry enrichment, correlation IDs). This initiative (I-2026-OBS-001) builds the long-term infrastructure foundation.
- I-2026-ING-001 (engineering-platform/ingestion): Frontdoor — may benefit from standardized log shipping (L4)
Decision Log
| Date | Decision | Rationale | Next Step |
|---|---|---|---|
| 2026-02-25 | Mapped complete Sumo Logic pipeline across 150 repos | 5 shipping mechanisms discovered — fragmentation is a cost and ops problem | Quantify cost impact |
| 2026-02-26 | Split from app-level initiative (I-2026-GC-002) | App teams own short/mid-term code changes; platform owns long-term infrastructure | Scope platform experiments |
| 2026-02-26 | Created initiative I-2026-OBS-001 focused on L1-L5 | Long-term modernization requires dedicated platform engineering effort | Begin E-2026-OBS-001 (OTel POC) |
Risk Assessment
| Risk | Severity | Likelihood | Mitigation |
|---|---|---|---|
| OpenTelemetry overhead too high for pricing critical path | High | Low | Profile in staging; use sampling if needed |
| Tiered storage Athena queries too slow for incidents | Medium | Medium | Keep 7-day hot tier for active debugging |
| Fluent Bit migration causes log gaps during cutover | Medium | Low | Run parallel (old + new) during migration |
| Team bandwidth insufficient for platform changes | Medium | High | Sequence after I-2026-GC-002 delivers app-level wins |
| Sumo Logic contract limits partition/budget features | Low | Medium | Evaluate alternatives if needed (Grafana Loki) |
Research Artifacts
- Unified Strategy:
duetto/docs/research/duetto-logging-observability-strategy-2026-02-25.md - Sumo Logic Shipper Discovery:
docs/research/sumo-logic-log-shipper-discovery-2026-02-25.md(local research — not in repo) - Sumo Logic Integration Analysis:
duetto/docs/research/sumo-logic-sentry-integration-analysis-2026-02-25.md - Logging Strategy Analysis:
duetto/docs/research/logging-strategy-analysis-2026-02-25.md
Team & Stakeholders
- Initiative Owner: Antonio (Platform Engineering)
- Engineering: Platform Engineering / SRE team
- Stakeholders: All application teams (beneficiaries of improved infrastructure)
- Repos Affected: ops-k8s, chef-cookbooks, datapipelines, openspace-infra, duetto (instrumentation)
Outcome
Status: discovery Key Learning: [To be completed after experiments] Next Step: Begin experiment E-2026-OBS-001 (OpenTelemetry POC) — sequenced after I-2026-GC-002 delivers application-level foundation