initiative delivery

Platform Modernization: MongoDB, Security, and Growth Readiness

Shiv Yadav Updated 2026-03-11 engineering-platform monolith
mongodb infrastructure security platform h2-2026

Initiative: Platform Modernization

What This Is

Infrastructure modernization to eliminate the scaling ceilings, security gaps, and manual toil blocking the 70:28 growth target. This is not a product feature -- it's the floor that feature teams stand on.

The full plan is in the Platform Team Charter (TC-001).

Why Now

MongoDB 4.2's findAndModify() contention bug is the root cause of the three highest-impact customer-facing issues. Redis carries an unpatched critical vulnerability. A departed engineer left uncached KMS calls burning $6K+ per 11-day period. Every one of these gets worse with growth.

David Gerrard was told "60-90 days to 4.4" in December 2025. That window has passed.

Problem Impact Ticket
Stuck optimize jobs Customers unable to publish rates during peak PLA-4306 / RATE-6809
Import performance Support's #1 priority TSQ-1598
Batch server ceiling Capped at 16 since June 2025 rollback (was 24) War room, June 2025
KMS cost spike CloudTrail $713 → $6K in 11 days; 12% stage error rate Arek's analysis, Feb 2
Redis CVE Unpatched critical vuln on v3.2.12 in production SRE-2987
Ghost hotel onboarding ~$10K FTE cost per deal David Gerrard
Config drift Stage crashed Jan 9 from one missing property PR #8781
15 years unarchived data Inflated storage, slower queries, harder sharding design --

Delivery Model

This initiative uses a team charter rather than the APEX experiment → PRD pipeline. The work is a phased infrastructure upgrade with known steps, not a product hypothesis requiring validation.

Charter: TC-001 Platform Team Charter Team: 5 engineers, 3.5 FTE effective (2 full-time anchors + 3 at 50%) Duration: 8 months (March -- October 2026) Owner: Shiv Yadav, Director of Engineering

Success Criteria

Pulled directly from the charter. At the end of 8 months:

  1. Production MongoDB on v7+ by Month 5; sharding strategy for RS2-RS9 in implementation
  2. Batch servers at 24+ with no CPU saturation incidents
  3. Zero customer-impacting stuck optimize jobs per month
  4. Redis CVE SRE-2987 closed; non-monolith services on managed ElastiCache
  5. Zero config-drift-caused outages; CI validation gate in place
  6. KMS/CloudTrail costs at pre-January 2026 baseline
  7. Ghost hotel onboarding time reduced >80%
  8. Data retention policy in effect; RS2-RS9 working set measurably reduced
  9. No single-person dependency -- every critical system (MongoDB, Redis, KMS, config, DRE) has 2+ engineers who can independently operate, troubleshoot, and recover it

Measurement: Datadog dashboards, CloudTrail cost reports, PLA-4306 recurrence rate, batch server count, SRE-2987 status

Decision Log

Date Decision Rationale
2026-02-22 Charter model, not experiment → PRD Infrastructure upgrade with known steps; not a product hypothesis
2026-02-22 MongoDB upgrade as critical path findAndModify() bug is root cause of top 3 customer issues; 4.4 mitigates, 7+ enables sharding
2026-02-22 Delphix for pipelined validation Compresses 12+ month sequential upgrade to 5 months by parallelizing app validation on virtual clones
2026-02-22 3.5 FTE team (Matthew, Yancy full-time; Ian, Ram R, Arek at 50%) Covers MongoDB infra, monolith internals, app validation, deployment safety, data architecture
2026-02-22 DRE upgrade deferred past monolith Haskell community driver not in official compat matrix; AL2023 VM blocker; not a hard gate for main path

Outcome

Status: delivery Key Learning: TBD (charter in execution) Next Step: Monthly leadership reviews per charter operating model