Pricing & Rate-Management Observability: From Incident Blindness to Always-On Diagnostics
Initiative: Pricing & Rate-Management Observability
The Bet
Hypothesis: We believe that replacing permission-gated debug logging with smart sampling (5% baseline + 100% on errors), structured JSON logging, and enriched Sentry context will reduce pricing incident MTTR by 50% while keeping Sumo Logic ingest costs under $50/month for 100 hotels — because the current approach of gating logs behind ENABLE_PRICING_LOGGING forces teams to debug blind during the exact incidents when diagnostic data is most critical.
Team Metric: Mean Time to Resolution (MTTR) for pricing-related incidents
Current Baseline: - MTTR for pricing incidents: 4-8 hours additional debugging time when logs are unavailable - "Insufficient logging" escalations: recurring during pricing optimization stuck scenarios - Observability coverage during incidents: ~0% (logs gated behind disabled perm)
Target: - MTTR reduction: 50% (eliminate 4-8 hour log-enablement delay) - Incident observability coverage: 100% for errors, 5% baseline sampling - Sumo Logic cost: <$50/month for 100 hotels - Zero "insufficient logging" escalations
Measurement: Track MTTR for pricing incidents before/after. Monitor Sumo Logic ingest volume via dashboards. Count escalations tagged "insufficient logging."
Why This Bet?
While investigating stuck pricing optimizations, we discovered that the most valuable diagnostic data — detailed pricerator state, hurdle rates, optimization rules applied — is hidden behind the ENABLE_PRICING_LOGGING dev-only permission. This permission is always disabled in production, creating a paradox: the logs we need most during incidents are the logs we never have.
Enabling the permission manually requires a database update (30-60 min propagation), and when enabled naively it generates ~109,200 log lines (~43.7 MB) per optimization with ~82 seconds of synchronous latency — making "always-on" untenable without architectural changes.
Research across 150 repositories mapped the complete Sumo Logic pipeline and identified that smart sampling + structured logging can deliver the observability we need at 5% of the cost and near-zero performance impact.
Evidence/Signal
- Stuck optimization debugging: Team discovered critical diagnostic data was hidden behind disabled perm — the exact data needed to diagnose the issue was unavailable
- Recurring pattern: Every pricing incident investigation hits the same wall — no logs available
- Three permission gates:
ENABLE_PRICING_LOGGING,ENABLE_GROUP_QUOTATION_LOGGING, andENABLE_GRAPHQL_QUERIES_LOGGINGcreate identical blind spots - Frontend-backend gap: No correlation IDs between frontend Sentry errors and backend Sumo Logic logs
Strategic Alignment
- OKR: Improve platform reliability and reduce incident response time
- Company Priority: Pricing accuracy directly impacts customer revenue and retention
Current State
Permission-Gated Logging
Three dev-only permissions control critical debug visibility:
| Permission | Code | What It Gates | Code Location |
|---|---|---|---|
ENABLE_PRICING_LOGGING |
"epl" |
Pricing optimization diagnostics | PricingOptimizer.java, OptimizeResultUpdater.java |
ENABLE_GROUP_QUOTATION_LOGGING |
— | Group quotation displacement | RateRecsGroupQuotationDelegate.java |
ENABLE_GRAPHQL_QUERIES_LOGGING |
— | GraphQL query parameters | GqlController.java |
When ENABLE_PRICING_LOGGING is enabled, a triple nested loop produces:
364 days × ~15 pricing targets × ~20 room types = ~109,200 log lines per optimization
~43.7 MB volume | ~82 seconds synchronous latency | ~500K string allocations
Sentry Integration
- Backend: Gack system → Sentry. Enabled in prod. But Gacks only capture errors, not the pricing decision context needed for diagnosis.
- Frontend: LoggerUtil → Sentry. Context: user/company/property IDs. No correlation to backend.
Cost Model (Sumo Logic)
| Scenario | Volume/Day | Monthly Cost |
|---|---|---|
| Current (perm off) | 0 | $0 |
| Always-on, 100 hotels | ~8.7 GB/day | ~$780/month |
| 5% sampling + errors, 100 hotels | ~450 MB/day | ~$45/month |
Discovery Plan
Time Box
- Max Duration: 6 weeks
- Max Experiments: 3
- Decision Date: 2026-04-09
Planned Experiments
- E-2026-GC-001: Smart sampling POC — Implement 5% sampling + error-always-on in PricingOptimizer on staging. Measure log volume, latency, and Sumo Logic query capability. (Weeks 1-2)
- E-2026-GC-002: Structured logging + async appender pilot — Convert PricingOptimizationLogger to JSON with async log4j2 appender. Measure compression and query performance. (Weeks 3-4)
- E-2026-GC-003: End-to-end correlation — MDC trace context in backend + X-Trace-ID from frontend. Validate full request tracing. (Weeks 5-6)
Phase 1: Tactical Wins (Weeks 1-2)
| ID | Change | Where | Effort | Impact |
|---|---|---|---|---|
| S1 | Replace ENABLE_PRICING_LOGGING with smart sampling (5% + 100% on errors) |
PricingOptimizer.java (lines 252, 376), OptimizeResultUpdater.java (lines 138, 182, 286) |
2-3 days | 100% error coverage, 5% baseline |
| S2 | Add Fluent Bit rate-limiting filter as safety net | ops-k8s/.../fluentbit/conf/filters.conf |
1 day | Ingest budget protection |
| S3 | Enrich Sentry Gacks with pricing context (hotel, dates, anomaly type) | PricingOptimizer.java, Gack system |
1-2 days | +50% error context in Sentry |
| S4 | Populate MDC trace context (already in log4j2 pattern, never populated) | New servlet filter | 2 days | Cross-request correlation in Sumo |
| S5 | Frontend-to-backend correlation IDs | ApolloClientProvider.tsx + backend controller |
2 days | End-to-end request tracing |
S1: Smart Sampling Implementation
private boolean shouldLogPricingDebug(Hotel hotel, OptimizationContext ctx) {
if (ctx.hasErrors() || ctx.hasAnomalousRates()) return true; // Always log errors
if (hotel.hasTag("debug-pricing")) return true; // Replaces the perm
return ThreadLocalRandom.current().nextDouble() <
Double.parseDouble(System.getProperty("pricing.log.sampleRate", "0.05"));
}
Phase 2: Structural Improvements (Weeks 3-6)
| ID | Change | Where | Effort | Impact |
|---|---|---|---|---|
| M1 | Structured JSON logging for pricing module | PricingOptimizationLogger.java |
3-5 days | 40% volume reduction, queryable fields in Sumo |
| M2 | Async logging appenders (remove 82s from critical path) | log4j2.xml |
2 days | Near-zero latency impact |
| M3 | Sumo Logic partitions + ingest budgets for pricing | Sumo Logic admin / Terraform | 1 day | Cost control + faster queries |
| M4 | Frontend Web Vitals + RUM | duetto-frontend/src/index.tsx |
3 days | Frontend performance visibility |
| M5 | Apply sampling pattern to other gated perms | RateRecsGroupQuotationDelegate.java, GqlController.java |
2 days | Platform-wide observability |
Exit Criteria
Validate If:
- Smart sampling provides sufficient data to diagnose a pricing incident within 30 minutes (vs current 4-8 hours)
- Sumo Logic ingest cost stays under $50/month for 100 hotels
- Structured JSON logs reduce Sumo Logic query time by 50%+ vs current free-text
- At least 1 real incident is diagnosed using the new logging
Kill If:
- Smart sampling misses critical diagnostic data in 2+ incidents
- Sumo Logic costs exceed $200/month for 100 hotels despite sampling
- Performance regression >5% in optimization cycle time
- After 6 weeks, MTTR improvement is less than 20%
Cost-Benefit Analysis
| Dimension | Current State | After Phase 1 | After Phase 2 |
|---|---|---|---|
| MTTR (pricing incidents) | 4-8 hrs extra | -50% | -60% |
| Sumo Logic cost | $0 | ~$45/mo | ~$35/mo (compression) |
| Optimization latency | 0 (no logging) | +1-2s (sampling) | ~0 (async) |
| Incident coverage | 0% | 100% errors, 5% baseline | 100% errors, 5% structured |
| Queryability | N/A | Free-text in Sumo | JSON fields, fast queries |
Related Initiatives
- I-2026-OBS-001 (engineering-platform/infrastructure): Long-term platform observability modernization — OpenTelemetry, tiered storage, standardize log shipping, metrics-first observability. This initiative (I-2026-GC-002) delivers the short/mid-term application-level changes; I-2026-OBS-001 builds the long-term infrastructure foundation.
- I-2026-PRC-001 (analytics/pricing): Pricerator — the pricing engine whose output we're improving observability for
Decision Log
| Date | Decision | Rationale | Next Step |
|---|---|---|---|
| 2026-02-25 | Initiated research into pricing logging architecture | Stuck optimization debugging revealed logs gated behind disabled perm | Map complete pipeline |
| 2026-02-25 | Mapped 5 Sumo Logic shipping mechanisms across 150 repos | Log shipper distributed across infrastructure repos, not in monorepo | Understand cost model |
| 2026-02-26 | Split initiative: app-level (GC-002) + platform-level (OBS-001) | Short/mid-term changes are in application code owned by pricing/groups teams; long-term is platform infrastructure | Begin experiment E-2026-GC-001 |
Risk Assessment
| Risk | Severity | Likelihood | Mitigation |
|---|---|---|---|
| Sampling misses critical diagnostic data | High | Low | 100% logging on errors/anomalies; configurable sample rate |
| Sumo Logic ingest budget blowout | Medium | Low | Fluent Bit rate limiter (S2) + Sumo partition budgets (M3) |
| Async logging causes log ordering issues | Medium | Low | Test in staging; use sequence IDs if needed |
| Structured logging breaks existing Sumo queries | Medium | Medium | Parallel running; backward compatibility |
Research Artifacts
- Unified Strategy:
duetto/docs/research/duetto-logging-observability-strategy-2026-02-25.md - Pricing Logging Analysis:
docs/research/enable-pricing-logging-analysis-2026-02-25.md(local research — not in repo) - Sumo Logic Shipper Discovery:
docs/research/sumo-logic-log-shipper-discovery-2026-02-25.md(local research — not in repo) - Frontend Patterns:
duetto-frontend/docs/research/frontend-logging-patterns-2026-02-25.md
Team & Stakeholders
- Initiative Owner: Antonio
- Engineering: Pricing team, Groups team, Frontend team
- Stakeholders: SRE/Platform Engineering (for S2, M3 infrastructure changes)
- Repos Affected: duetto (backend), duetto-frontend, ops-k8s (Fluent Bit config)
Outcome
Status: discovery Key Learning: [To be completed after experiments] Next Step: Begin experiment E-2026-GC-001 (Smart Sampling POC)