Platform Modernization: MongoDB, Security, and Growth Readiness
Initiative: Platform Modernization
What This Is
Infrastructure modernization to eliminate the scaling ceilings, security gaps, and manual toil blocking the 70:28 growth target. This is not a product feature -- it's the floor that feature teams stand on.
The full plan is in the Platform Team Charter (TC-001).
Why Now
MongoDB 4.2's findAndModify() contention bug is the root cause of the three highest-impact customer-facing issues. Redis carries an unpatched critical vulnerability. A departed engineer left uncached KMS calls burning $6K+ per 11-day period. Every one of these gets worse with growth.
David Gerrard was told "60-90 days to 4.4" in December 2025. That window has passed.
| Problem | Impact | Ticket |
|---|---|---|
| Stuck optimize jobs | Customers unable to publish rates during peak | PLA-4306 / RATE-6809 |
| Import performance | Support's #1 priority | TSQ-1598 |
| Batch server ceiling | Capped at 16 since June 2025 rollback (was 24) | War room, June 2025 |
| KMS cost spike | CloudTrail $713 → $6K in 11 days; 12% stage error rate | Arek's analysis, Feb 2 |
| Redis CVE | Unpatched critical vuln on v3.2.12 in production | SRE-2987 |
| Ghost hotel onboarding | ~$10K FTE cost per deal | David Gerrard |
| Config drift | Stage crashed Jan 9 from one missing property | PR #8781 |
| 15 years unarchived data | Inflated storage, slower queries, harder sharding design | -- |
Delivery Model
This initiative uses a team charter rather than the APEX experiment → PRD pipeline. The work is a phased infrastructure upgrade with known steps, not a product hypothesis requiring validation.
Charter: TC-001 Platform Team Charter Team: 5 engineers, 3.5 FTE effective (2 full-time anchors + 3 at 50%) Duration: 8 months (March -- October 2026) Owner: Shiv Yadav, Director of Engineering
Success Criteria
Pulled directly from the charter. At the end of 8 months:
- Production MongoDB on v7+ by Month 5; sharding strategy for RS2-RS9 in implementation
- Batch servers at 24+ with no CPU saturation incidents
- Zero customer-impacting stuck optimize jobs per month
- Redis CVE SRE-2987 closed; non-monolith services on managed ElastiCache
- Zero config-drift-caused outages; CI validation gate in place
- KMS/CloudTrail costs at pre-January 2026 baseline
- Ghost hotel onboarding time reduced >80%
- Data retention policy in effect; RS2-RS9 working set measurably reduced
- No single-person dependency -- every critical system (MongoDB, Redis, KMS, config, DRE) has 2+ engineers who can independently operate, troubleshoot, and recover it
Measurement: Datadog dashboards, CloudTrail cost reports, PLA-4306 recurrence rate, batch server count, SRE-2987 status
Decision Log
| Date | Decision | Rationale |
|---|---|---|
| 2026-02-22 | Charter model, not experiment → PRD | Infrastructure upgrade with known steps; not a product hypothesis |
| 2026-02-22 | MongoDB upgrade as critical path | findAndModify() bug is root cause of top 3 customer issues; 4.4 mitigates, 7+ enables sharding |
| 2026-02-22 | Delphix for pipelined validation | Compresses 12+ month sequential upgrade to 5 months by parallelizing app validation on virtual clones |
| 2026-02-22 | 3.5 FTE team (Matthew, Yancy full-time; Ian, Ram R, Arek at 50%) | Covers MongoDB infra, monolith internals, app validation, deployment safety, data architecture |
| 2026-02-22 | DRE upgrade deferred past monolith | Haskell community driver not in official compat matrix; AL2023 VM blocker; not a hard gate for main path |
Outcome
Status: delivery Key Learning: TBD (charter in execution) Next Step: Monthly leadership reviews per charter operating model