Initiative: Pricing & Rate-Management Observability

The Bet

Hypothesis: We believe that replacing permission-gated debug logging with smart sampling (5% baseline + 100% on errors), structured JSON logging, and enriched Sentry context will reduce pricing incident MTTR by 50% while keeping Sumo Logic ingest costs under $50/month for 100 hotels — because the current approach of gating logs behind ENABLE_PRICING_LOGGING forces teams to debug blind during the exact incidents when diagnostic data is most critical.

Team Metric: Mean Time to Resolution (MTTR) for pricing-related incidents

Current Baseline: - MTTR for pricing incidents: 4-8 hours additional debugging time when logs are unavailable - "Insufficient logging" escalations: recurring during pricing optimization stuck scenarios - Observability coverage during incidents: ~0% (logs gated behind disabled perm)

Target: - MTTR reduction: 50% (eliminate 4-8 hour log-enablement delay) - Incident observability coverage: 100% for errors, 5% baseline sampling - Sumo Logic cost: <$50/month for 100 hotels - Zero "insufficient logging" escalations

Measurement: Track MTTR for pricing incidents before/after. Monitor Sumo Logic ingest volume via dashboards. Count escalations tagged "insufficient logging."

Why This Bet?

While investigating stuck pricing optimizations, we discovered that the most valuable diagnostic data — detailed pricerator state, hurdle rates, optimization rules applied — is hidden behind the ENABLE_PRICING_LOGGING dev-only permission. This permission is always disabled in production, creating a paradox: the logs we need most during incidents are the logs we never have.

Enabling the permission manually requires a database update (30-60 min propagation), and when enabled naively it generates ~109,200 log lines (~43.7 MB) per optimization with ~82 seconds of synchronous latency — making "always-on" untenable without architectural changes.

Research across 150 repositories mapped the complete Sumo Logic pipeline and identified that smart sampling + structured logging can deliver the observability we need at 5% of the cost and near-zero performance impact.

Evidence/Signal

Stuck optimization debugging: Team discovered critical diagnostic data was hidden behind disabled perm — the exact data needed to diagnose the issue was unavailable
Recurring pattern: Every pricing incident investigation hits the same wall — no logs available
Three permission gates: ENABLE_PRICING_LOGGING, ENABLE_GROUP_QUOTATION_LOGGING, and ENABLE_GRAPHQL_QUERIES_LOGGING create identical blind spots
Frontend-backend gap: No correlation IDs between frontend Sentry errors and backend Sumo Logic logs

Strategic Alignment

OKR: Improve platform reliability and reduce incident response time
Company Priority: Pricing accuracy directly impacts customer revenue and retention

Current State

Permission-Gated Logging

Three dev-only permissions control critical debug visibility:

Permission	Code	What It Gates	Code Location
`ENABLE_PRICING_LOGGING`	`"epl"`	Pricing optimization diagnostics	`PricingOptimizer.java`, `OptimizeResultUpdater.java`
`ENABLE_GROUP_QUOTATION_LOGGING`	—	Group quotation displacement	`RateRecsGroupQuotationDelegate.java`
`ENABLE_GRAPHQL_QUERIES_LOGGING`	—	GraphQL query parameters	`GqlController.java`

When ENABLE_PRICING_LOGGING is enabled, a triple nested loop produces:

364 days × ~15 pricing targets × ~20 room types = ~109,200 log lines per optimization
~43.7 MB volume | ~82 seconds synchronous latency | ~500K string allocations

Sentry Integration

Backend: Gack system → Sentry. Enabled in prod. But Gacks only capture errors, not the pricing decision context needed for diagnosis.
Frontend: LoggerUtil → Sentry. Context: user/company/property IDs. No correlation to backend.

Cost Model (Sumo Logic)

Scenario	Volume/Day	Monthly Cost
Current (perm off)	0	$0
Always-on, 100 hotels	~8.7 GB/day	~$780/month
5% sampling + errors, 100 hotels	~450 MB/day	~$45/month

Discovery Plan

Time Box

Max Duration: 6 weeks
Max Experiments: 3
Decision Date: 2026-04-09

Planned Experiments

E-2026-GC-001: Smart sampling POC — Implement 5% sampling + error-always-on in PricingOptimizer on staging. Measure log volume, latency, and Sumo Logic query capability. (Weeks 1-2)
E-2026-GC-002: Structured logging + async appender pilot — Convert PricingOptimizationLogger to JSON with async log4j2 appender. Measure compression and query performance. (Weeks 3-4)
E-2026-GC-003: End-to-end correlation — MDC trace context in backend + X-Trace-ID from frontend. Validate full request tracing. (Weeks 5-6)

Phase 1: Tactical Wins (Weeks 1-2)

ID	Change	Where	Effort	Impact
S1	Replace `ENABLE_PRICING_LOGGING` with smart sampling (5% + 100% on errors)	`PricingOptimizer.java` (lines 252, 376), `OptimizeResultUpdater.java` (lines 138, 182, 286)	2-3 days	100% error coverage, 5% baseline
S2	Add Fluent Bit rate-limiting filter as safety net	`ops-k8s/.../fluentbit/conf/filters.conf`	1 day	Ingest budget protection
S3	Enrich Sentry Gacks with pricing context (hotel, dates, anomaly type)	`PricingOptimizer.java`, Gack system	1-2 days	+50% error context in Sentry
S4	Populate MDC trace context (already in log4j2 pattern, never populated)	New servlet filter	2 days	Cross-request correlation in Sumo
S5	Frontend-to-backend correlation IDs	`ApolloClientProvider.tsx` + backend controller	2 days	End-to-end request tracing

S1: Smart Sampling Implementation

private boolean shouldLogPricingDebug(Hotel hotel, OptimizationContext ctx) {
    if (ctx.hasErrors() || ctx.hasAnomalousRates()) return true;  // Always log errors
    if (hotel.hasTag("debug-pricing")) return true;               // Replaces the perm
    return ThreadLocalRandom.current().nextDouble() <
        Double.parseDouble(System.getProperty("pricing.log.sampleRate", "0.05"));
}

Phase 2: Structural Improvements (Weeks 3-6)

ID	Change	Where	Effort	Impact
M1	Structured JSON logging for pricing module	`PricingOptimizationLogger.java`	3-5 days	40% volume reduction, queryable fields in Sumo
M2	Async logging appenders (remove 82s from critical path)	`log4j2.xml`	2 days	Near-zero latency impact
M3	Sumo Logic partitions + ingest budgets for pricing	Sumo Logic admin / Terraform	1 day	Cost control + faster queries
M4	Frontend Web Vitals + RUM	`duetto-frontend/src/index.tsx`	3 days	Frontend performance visibility
M5	Apply sampling pattern to other gated perms	`RateRecsGroupQuotationDelegate.java`, `GqlController.java`	2 days	Platform-wide observability

Exit Criteria

Validate If:

Smart sampling provides sufficient data to diagnose a pricing incident within 30 minutes (vs current 4-8 hours)
Sumo Logic ingest cost stays under $50/month for 100 hotels
Structured JSON logs reduce Sumo Logic query time by 50%+ vs current free-text
At least 1 real incident is diagnosed using the new logging

Kill If:

Smart sampling misses critical diagnostic data in 2+ incidents
Sumo Logic costs exceed $200/month for 100 hotels despite sampling
Performance regression >5% in optimization cycle time
After 6 weeks, MTTR improvement is less than 20%

Cost-Benefit Analysis

Dimension	Current State	After Phase 1	After Phase 2
MTTR (pricing incidents)	4-8 hrs extra	-50%	-60%
Sumo Logic cost	$0	~$45/mo	~$35/mo (compression)
Optimization latency	0 (no logging)	+1-2s (sampling)	~0 (async)
Incident coverage	0%	100% errors, 5% baseline	100% errors, 5% structured
Queryability	N/A	Free-text in Sumo	JSON fields, fast queries

I-2026-OBS-001 (engineering-platform/infrastructure): Long-term platform observability modernization — OpenTelemetry, tiered storage, standardize log shipping, metrics-first observability. This initiative (I-2026-GC-002) delivers the short/mid-term application-level changes; I-2026-OBS-001 builds the long-term infrastructure foundation.
I-2026-PRC-001 (analytics/pricing): Pricerator — the pricing engine whose output we're improving observability for

Decision Log

Date	Decision	Rationale	Next Step
2026-02-25	Initiated research into pricing logging architecture	Stuck optimization debugging revealed logs gated behind disabled perm	Map complete pipeline
2026-02-25	Mapped 5 Sumo Logic shipping mechanisms across 150 repos	Log shipper distributed across infrastructure repos, not in monorepo	Understand cost model
2026-02-26	Split initiative: app-level (GC-002) + platform-level (OBS-001)	Short/mid-term changes are in application code owned by pricing/groups teams; long-term is platform infrastructure	Begin experiment E-2026-GC-001

Risk Assessment

Risk	Severity	Likelihood	Mitigation
Sampling misses critical diagnostic data	High	Low	100% logging on errors/anomalies; configurable sample rate
Sumo Logic ingest budget blowout	Medium	Low	Fluent Bit rate limiter (S2) + Sumo partition budgets (M3)
Async logging causes log ordering issues	Medium	Low	Test in staging; use sequence IDs if needed
Structured logging breaks existing Sumo queries	Medium	Medium	Parallel running; backward compatibility

Research Artifacts

Unified Strategy: duetto/docs/research/duetto-logging-observability-strategy-2026-02-25.md
Pricing Logging Analysis: docs/research/enable-pricing-logging-analysis-2026-02-25.md (local research — not in repo)
Sumo Logic Shipper Discovery: docs/research/sumo-logic-log-shipper-discovery-2026-02-25.md (local research — not in repo)
Frontend Patterns: duetto-frontend/docs/research/frontend-logging-patterns-2026-02-25.md

Team & Stakeholders

Initiative Owner: Antonio
Engineering: Pricing team, Groups team, Frontend team
Stakeholders: SRE/Platform Engineering (for S2, M3 infrastructure changes)
Repos Affected: duetto (backend), duetto-frontend, ops-k8s (Fluent Bit config)

Outcome

Status: discovery Key Learning: [To be completed after experiments] Next Step: Begin experiment E-2026-GC-001 (Smart Sampling POC)

Pricing & Rate-Management Observability: From Incident Blindness to Always-On Diagnostics