Initiative: Platform Modernization

What This Is

Infrastructure modernization to eliminate the scaling ceilings, security gaps, and manual toil blocking the 70:28 growth target. This is not a product feature -- it's the floor that feature teams stand on.

The full plan is in the Platform Team Charter (TC-001).

Why Now

MongoDB 4.2's findAndModify() contention bug is the root cause of the three highest-impact customer-facing issues. Redis carries an unpatched critical vulnerability. A departed engineer left uncached KMS calls burning $6K+ per 11-day period. Every one of these gets worse with growth.

David Gerrard was told "60-90 days to 4.4" in December 2025. That window has passed.

Problem	Impact	Ticket
Stuck optimize jobs	Customers unable to publish rates during peak	PLA-4306 / RATE-6809
Import performance	Support's #1 priority	TSQ-1598
Batch server ceiling	Capped at 16 since June 2025 rollback (was 24)	War room, June 2025
KMS cost spike	CloudTrail $713 → $6K in 11 days; 12% stage error rate	Arek's analysis, Feb 2
Redis CVE	Unpatched critical vuln on v3.2.12 in production	SRE-2987
Ghost hotel onboarding	~$10K FTE cost per deal	David Gerrard
Config drift	Stage crashed Jan 9 from one missing property	PR #8781
15 years unarchived data	Inflated storage, slower queries, harder sharding design	--

Delivery Model

This initiative uses a team charter rather than the APEX experiment → PRD pipeline. The work is a phased infrastructure upgrade with known steps, not a product hypothesis requiring validation.

Charter: TC-001 Platform Team Charter Team: 5 engineers, 3.5 FTE effective (2 full-time anchors + 3 at 50%) Duration: 8 months (March -- October 2026) Owner: Shiv Yadav, Director of Engineering

Success Criteria

Pulled directly from the charter. At the end of 8 months:

Production MongoDB on v7+ by Month 5; sharding strategy for RS2-RS9 in implementation
Batch servers at 24+ with no CPU saturation incidents
Zero customer-impacting stuck optimize jobs per month
Redis CVE SRE-2987 closed; non-monolith services on managed ElastiCache
Zero config-drift-caused outages; CI validation gate in place
KMS/CloudTrail costs at pre-January 2026 baseline
Ghost hotel onboarding time reduced >80%
Data retention policy in effect; RS2-RS9 working set measurably reduced
No single-person dependency -- every critical system (MongoDB, Redis, KMS, config, DRE) has 2+ engineers who can independently operate, troubleshoot, and recover it

Measurement: Datadog dashboards, CloudTrail cost reports, PLA-4306 recurrence rate, batch server count, SRE-2987 status

Decision Log

Date	Decision	Rationale
2026-02-22	Charter model, not experiment → PRD	Infrastructure upgrade with known steps; not a product hypothesis
2026-02-22	MongoDB upgrade as critical path	`findAndModify()` bug is root cause of top 3 customer issues; 4.4 mitigates, 7+ enables sharding
2026-02-22	Delphix for pipelined validation	Compresses 12+ month sequential upgrade to 5 months by parallelizing app validation on virtual clones
2026-02-22	3.5 FTE team (Matthew, Yancy full-time; Ian, Ram R, Arek at 50%)	Covers MongoDB infra, monolith internals, app validation, deployment safety, data architecture
2026-02-22	DRE upgrade deferred past monolith	Haskell community driver not in official compat matrix; AL2023 VM blocker; not a hard gate for main path

Outcome

Status: delivery Key Learning: TBD (charter in execution) Next Step: Monthly leadership reviews per charter operating model