initiative discovery

Pricing & Rate-Management Observability: From Incident Blindness to Always-On Diagnostics

Antonio Updated 2026-03-11 applications gamechanger
observability logging pricing groups sentry sumo-logic q1-2026

Initiative: Pricing & Rate-Management Observability

The Bet

Hypothesis: We believe that replacing permission-gated debug logging with smart sampling (5% baseline + 100% on errors), structured JSON logging, and enriched Sentry context will reduce pricing incident MTTR by 50% while keeping Sumo Logic ingest costs under $50/month for 100 hotels — because the current approach of gating logs behind ENABLE_PRICING_LOGGING forces teams to debug blind during the exact incidents when diagnostic data is most critical.

Team Metric: Mean Time to Resolution (MTTR) for pricing-related incidents

Current Baseline: - MTTR for pricing incidents: 4-8 hours additional debugging time when logs are unavailable - "Insufficient logging" escalations: recurring during pricing optimization stuck scenarios - Observability coverage during incidents: ~0% (logs gated behind disabled perm)

Target: - MTTR reduction: 50% (eliminate 4-8 hour log-enablement delay) - Incident observability coverage: 100% for errors, 5% baseline sampling - Sumo Logic cost: <$50/month for 100 hotels - Zero "insufficient logging" escalations

Measurement: Track MTTR for pricing incidents before/after. Monitor Sumo Logic ingest volume via dashboards. Count escalations tagged "insufficient logging."

Why This Bet?

While investigating stuck pricing optimizations, we discovered that the most valuable diagnostic data — detailed pricerator state, hurdle rates, optimization rules applied — is hidden behind the ENABLE_PRICING_LOGGING dev-only permission. This permission is always disabled in production, creating a paradox: the logs we need most during incidents are the logs we never have.

Enabling the permission manually requires a database update (30-60 min propagation), and when enabled naively it generates ~109,200 log lines (~43.7 MB) per optimization with ~82 seconds of synchronous latency — making "always-on" untenable without architectural changes.

Research across 150 repositories mapped the complete Sumo Logic pipeline and identified that smart sampling + structured logging can deliver the observability we need at 5% of the cost and near-zero performance impact.

Evidence/Signal

  • Stuck optimization debugging: Team discovered critical diagnostic data was hidden behind disabled perm — the exact data needed to diagnose the issue was unavailable
  • Recurring pattern: Every pricing incident investigation hits the same wall — no logs available
  • Three permission gates: ENABLE_PRICING_LOGGING, ENABLE_GROUP_QUOTATION_LOGGING, and ENABLE_GRAPHQL_QUERIES_LOGGING create identical blind spots
  • Frontend-backend gap: No correlation IDs between frontend Sentry errors and backend Sumo Logic logs

Strategic Alignment

  • OKR: Improve platform reliability and reduce incident response time
  • Company Priority: Pricing accuracy directly impacts customer revenue and retention

Current State

Permission-Gated Logging

Three dev-only permissions control critical debug visibility:

Permission Code What It Gates Code Location
ENABLE_PRICING_LOGGING "epl" Pricing optimization diagnostics PricingOptimizer.java, OptimizeResultUpdater.java
ENABLE_GROUP_QUOTATION_LOGGING Group quotation displacement RateRecsGroupQuotationDelegate.java
ENABLE_GRAPHQL_QUERIES_LOGGING GraphQL query parameters GqlController.java

When ENABLE_PRICING_LOGGING is enabled, a triple nested loop produces:

364 days × ~15 pricing targets × ~20 room types = ~109,200 log lines per optimization
~43.7 MB volume | ~82 seconds synchronous latency | ~500K string allocations

Sentry Integration

  • Backend: Gack system → Sentry. Enabled in prod. But Gacks only capture errors, not the pricing decision context needed for diagnosis.
  • Frontend: LoggerUtil → Sentry. Context: user/company/property IDs. No correlation to backend.

Cost Model (Sumo Logic)

Scenario Volume/Day Monthly Cost
Current (perm off) 0 $0
Always-on, 100 hotels ~8.7 GB/day ~$780/month
5% sampling + errors, 100 hotels ~450 MB/day ~$45/month

Discovery Plan

Time Box

  • Max Duration: 6 weeks
  • Max Experiments: 3
  • Decision Date: 2026-04-09

Planned Experiments

  1. E-2026-GC-001: Smart sampling POC — Implement 5% sampling + error-always-on in PricingOptimizer on staging. Measure log volume, latency, and Sumo Logic query capability. (Weeks 1-2)
  2. E-2026-GC-002: Structured logging + async appender pilot — Convert PricingOptimizationLogger to JSON with async log4j2 appender. Measure compression and query performance. (Weeks 3-4)
  3. E-2026-GC-003: End-to-end correlation — MDC trace context in backend + X-Trace-ID from frontend. Validate full request tracing. (Weeks 5-6)

Phase 1: Tactical Wins (Weeks 1-2)

ID Change Where Effort Impact
S1 Replace ENABLE_PRICING_LOGGING with smart sampling (5% + 100% on errors) PricingOptimizer.java (lines 252, 376), OptimizeResultUpdater.java (lines 138, 182, 286) 2-3 days 100% error coverage, 5% baseline
S2 Add Fluent Bit rate-limiting filter as safety net ops-k8s/.../fluentbit/conf/filters.conf 1 day Ingest budget protection
S3 Enrich Sentry Gacks with pricing context (hotel, dates, anomaly type) PricingOptimizer.java, Gack system 1-2 days +50% error context in Sentry
S4 Populate MDC trace context (already in log4j2 pattern, never populated) New servlet filter 2 days Cross-request correlation in Sumo
S5 Frontend-to-backend correlation IDs ApolloClientProvider.tsx + backend controller 2 days End-to-end request tracing

S1: Smart Sampling Implementation

private boolean shouldLogPricingDebug(Hotel hotel, OptimizationContext ctx) {
    if (ctx.hasErrors() || ctx.hasAnomalousRates()) return true;  // Always log errors
    if (hotel.hasTag("debug-pricing")) return true;               // Replaces the perm
    return ThreadLocalRandom.current().nextDouble() <
        Double.parseDouble(System.getProperty("pricing.log.sampleRate", "0.05"));
}

Phase 2: Structural Improvements (Weeks 3-6)

ID Change Where Effort Impact
M1 Structured JSON logging for pricing module PricingOptimizationLogger.java 3-5 days 40% volume reduction, queryable fields in Sumo
M2 Async logging appenders (remove 82s from critical path) log4j2.xml 2 days Near-zero latency impact
M3 Sumo Logic partitions + ingest budgets for pricing Sumo Logic admin / Terraform 1 day Cost control + faster queries
M4 Frontend Web Vitals + RUM duetto-frontend/src/index.tsx 3 days Frontend performance visibility
M5 Apply sampling pattern to other gated perms RateRecsGroupQuotationDelegate.java, GqlController.java 2 days Platform-wide observability

Exit Criteria

Validate If:

  • Smart sampling provides sufficient data to diagnose a pricing incident within 30 minutes (vs current 4-8 hours)
  • Sumo Logic ingest cost stays under $50/month for 100 hotels
  • Structured JSON logs reduce Sumo Logic query time by 50%+ vs current free-text
  • At least 1 real incident is diagnosed using the new logging

Kill If:

  • Smart sampling misses critical diagnostic data in 2+ incidents
  • Sumo Logic costs exceed $200/month for 100 hotels despite sampling
  • Performance regression >5% in optimization cycle time
  • After 6 weeks, MTTR improvement is less than 20%

Cost-Benefit Analysis

Dimension Current State After Phase 1 After Phase 2
MTTR (pricing incidents) 4-8 hrs extra -50% -60%
Sumo Logic cost $0 ~$45/mo ~$35/mo (compression)
Optimization latency 0 (no logging) +1-2s (sampling) ~0 (async)
Incident coverage 0% 100% errors, 5% baseline 100% errors, 5% structured
Queryability N/A Free-text in Sumo JSON fields, fast queries
  • I-2026-OBS-001 (engineering-platform/infrastructure): Long-term platform observability modernization — OpenTelemetry, tiered storage, standardize log shipping, metrics-first observability. This initiative (I-2026-GC-002) delivers the short/mid-term application-level changes; I-2026-OBS-001 builds the long-term infrastructure foundation.
  • I-2026-PRC-001 (analytics/pricing): Pricerator — the pricing engine whose output we're improving observability for

Decision Log

Date Decision Rationale Next Step
2026-02-25 Initiated research into pricing logging architecture Stuck optimization debugging revealed logs gated behind disabled perm Map complete pipeline
2026-02-25 Mapped 5 Sumo Logic shipping mechanisms across 150 repos Log shipper distributed across infrastructure repos, not in monorepo Understand cost model
2026-02-26 Split initiative: app-level (GC-002) + platform-level (OBS-001) Short/mid-term changes are in application code owned by pricing/groups teams; long-term is platform infrastructure Begin experiment E-2026-GC-001

Risk Assessment

Risk Severity Likelihood Mitigation
Sampling misses critical diagnostic data High Low 100% logging on errors/anomalies; configurable sample rate
Sumo Logic ingest budget blowout Medium Low Fluent Bit rate limiter (S2) + Sumo partition budgets (M3)
Async logging causes log ordering issues Medium Low Test in staging; use sequence IDs if needed
Structured logging breaks existing Sumo queries Medium Medium Parallel running; backward compatibility

Research Artifacts

  • Unified Strategy: duetto/docs/research/duetto-logging-observability-strategy-2026-02-25.md
  • Pricing Logging Analysis: docs/research/enable-pricing-logging-analysis-2026-02-25.md (local research — not in repo)
  • Sumo Logic Shipper Discovery: docs/research/sumo-logic-log-shipper-discovery-2026-02-25.md (local research — not in repo)
  • Frontend Patterns: duetto-frontend/docs/research/frontend-logging-patterns-2026-02-25.md

Team & Stakeholders

  • Initiative Owner: Antonio
  • Engineering: Pricing team, Groups team, Frontend team
  • Stakeholders: SRE/Platform Engineering (for S2, M3 infrastructure changes)
  • Repos Affected: duetto (backend), duetto-frontend, ops-k8s (Fluent Bit config)

Outcome

Status: discovery Key Learning: [To be completed after experiments] Next Step: Begin experiment E-2026-GC-001 (Smart Sampling POC)