proposal draft

Polaris: Unified Architecture Direction for Domain-Driven Platform Evolution

Robert Matsuoka Updated 2026-03-11
architecture polaris domain-driven streaming shadow-vampire observability platform

Polaris Architecture Proposal

Author: Robert Matsuoka, CTO Date: 2026-02-28 Status: DRAFT Audience: Senior Engineering Leadership Team (SELT) Supersedes: North Star Architecture (2024)


Executive Summary

North Star served Duetto well from 2023 to 2024. It gave us a shared language, established domain thinking, and produced real production successes — market-domain, integrations extraction, the BIRCH API gateway layer. Its overall score of 7.2/10 was earned.

But honest evaluation reveals a pattern: North Star was strategically excellent and operationally incomplete. The gaps are not minor. Observability scored 4/10. Resilience scored 5/10. Event-driven design was absent. Domain models stayed anemic — CRUD wrappers, not domain services. We added distributed complexity without removing monolith complexity.

2026 is a decision year. We have the production infrastructure to move decisively. The question is whether we course-correct around North Star's edges or adopt a sharper frame that addresses the gaps head-on.

This document proposes Polaris: a unified architectural direction built on four pillars that correct North Star's weaknesses while preserving what worked. Polaris is not a rebrand. It is a materially different set of commitments — about interface contracts, streaming-first integration, first-class data science access, and a faster path out of the monolith.

The name is deliberate. A polaris is a fixed point for navigation. We have spent two years discussing where to go. This proposal defines the fixed point we orient from.

This document asks SELT for six specific commitments. They are listed in Section 6.


1. The Opportunity: Why Now, Why Polaris

What North Star Got Right

North Star established strategic clarity that Duetto lacked before 2023:

  • Domain-driven decomposition gave teams ownership boundaries
  • The strangler fig pattern prevented a catastrophic big-bang rewrite
  • BIRCH (Duetto's GraphQL API gateway — distinct from the Apollo Federation standard) provided a sensible abstraction layer
  • Three production extractions validated the direction: market-domain, integrations, intelligence-domain

The strategic score was 9/10. That assessment stands.

What North Star Left Unfinished

Catherine Daves' critique, delivered during our Q4 engineering review, was precise and correct:

The pattern that emerged: we added complexity without removing any. The new services became CRUD wrappers — proxying the database rather than modeling the domain — and we have no visibility into what's actually happening across service boundaries.

The data supports this:

Dimension North Star Score Industry Standard
Strategic Direction 9/10 Achieved
Observability 4/10 8/10 minimum
Resilience Patterns 5/10 7/10 minimum
Event-Driven Integration Not present Table stakes
Domain Model Richness Anemic Rich required

The observability gap is not cosmetic. Distributed systems without tracing are operationally blind. When a request spans the monolith, BIRCH, market-domain, and intelligence-domain, we have no end-to-end visibility. Every incident investigation is archaeology.

The anemic domain model problem is a design debt that compounds. CRUD wrappers are easy to build and expensive to evolve. A service that is just a database proxy with a REST facade is not a domain service — it is a deployment unit that adds latency without adding value.

Why 2026 Is the Year to Act

Three conditions converge in 2026 that did not exist in 2023:

  1. Production infrastructure exists. gcdb-stream-processor is running in production as an Anti-Corruption Layer. SaveHooks instrument 20+ entities in the monolith. Kinesis Firehose to S3 is operational. The "thin needles" are already in place.

  2. Data lake is real. Bronze/Silver/Gold layers exist in S3 with Iceberg. datapipelines (4,194 commits, active) is a mature production system. This is not a future capability — it is today's infrastructure.

  3. ML is production-grade. ml_elasticity (Andrew Crane-Droesch, 2,261 commits, DoubleML/MLflow) demonstrates what a well-structured Data Science service looks like. ML Platform (Hakim's team) provides the serving infrastructure. Together they are the existence proof that first-class ML access works at Duetto.

We are not proposing to build new infrastructure. We are proposing to use what exists, with a sharper architectural frame.


2. The Polaris Principle

Polaris has one organizing idea: domain-specific computing with a hard shell and a soft interior.

The principle is borrowed from Netflix's composable services model, adapted for Duetto's context.

Hard shell: Every domain service publishes a formal API contract. The contract specifies the interface, versioning policy, SLOs (availability, latency, error budget), and observability hooks. This shell is non-negotiable. It is the only thing consumers depend on.

Soft interior: The team owning the domain chooses its [^1]language, framework, data model, and internal structure. Nothing inside the shell is visible to consumers. The shell is the contract; the interior is an implementation detail.

This is less prescriptive than North Star on technology choices and more prescriptive on interface discipline.

What Polaris Is

  • Interface-prescriptive: hard shell required, no exceptions
  • Technology-agnostic: team chooses language, framework, database
  • Event-first: streaming for cross-domain communication
  • Data lake-first: analytics and ML access via lake, not direct database queries
  • Migration-realistic: 9-12 months via Shadow Vampire, not 18-24 via strangler fig

What Polaris Is Not

  • A stack mandate (no "you must use Node.js Lambda")
  • A microservices purity play (the monolith remains until domains prove parity)
  • A rewrite (we are extracting with the existing codebase as the oracle)
  • A research project (everything proposed already exists in some form in production)

The DRE Existence Proof

DRE (dari-haskell) is already a Polaris-aligned service. The team chose Haskell. They expose a clean HTTP API. They own their MongoDB. Data flows from the Java monolith via SaveHooks → DreEnqueuer → DrePublishService. External PMS consumers (Agilysys, Opera) hit the DRE API directly and have no knowledge of the monolith.

This is the pattern. Hard shell, soft interior, owns its data, clear consumers. DRE is Polaris-aligned not because it followed a mandate — it is Polaris-aligned because good engineering instincts converged on the same principle independently. Polaris names and formalizes what DRE already [^2]demonstrates.


3. Four Pillars of Polaris

Pillar 1: Composable Domains

Every domain service ships with a hard shell: a published API contract that specifies the interface, versioning policy, SLOs, and observability hooks. The contract is the only dependency surface for consumers.

Inside the shell, the owning team decides everything: language, framework, internal data model, persistence choice. The shell is a promise. The interior is an implementation.

Domain data ownership is absolute. No domain shares a database with another domain. Cross-domain data access happens exclusively through the published API. This eliminates the hidden coupling that makes monolith-era systems fragile at scale.

Existing evidence: - DRE (dari-haskell): Haskell interior, HTTP API shell, MongoDB owned exclusively, zero shared database access - market-domain: clean REST API, owns its data, consumed by BIRCH via published contract - integrations service: 1,217 releases, mature API, independent deployment lifecycle

The pattern works. Pillar 1 makes it mandatory for all new domain services.

Pillar 2: Streaming Data Backbone

Synchronous REST calls between domains are coupling in disguise. When Domain A calls Domain B synchronously, A's reliability depends on B's availability. This is the failure mode that created the monolith we are trying to leave.

Polaris uses Kinesis as the integration primitive for cross-domain communication. Domains emit events. Consumers subscribe. There is no synchronous dependency between domains at runtime.

The data lake (S3 + Iceberg, Bronze/Silver/Gold) is the analytics source of truth. No analytics or ML workload queries domain service databases directly. Data flows to the lake; analytics and ML read from the lake.

Existing production infrastructure: - gcdb-stream-processor: Kinesis Anti-Corruption Layer, already running in production - SaveHooks: 20+ entities instrumented in the monolith, events flowing today - Kinesis Firehose to S3: Bronze layer pipeline, operational - datapipelines repo: 4,194 commits, AWS Glue + Great Expectations + Kinesis, active production system

Pillar 2 does not require building new infrastructure. It requires using existing infrastructure as the integration standard and stopping new synchronous inter-domain REST integrations.

Pillar 3: ML and Data Science as First-Class Citizens

North Star treated ML as a downstream consumer of services. The consequence: data extraction for ML was an afterthought, and ML teams operated outside the architecture rather than within it.

Polaris inverts this. Data is extracted to the lake before new services consume it, not after. Shadow pipeline outputs (before/after monolith pairs) become ML training data automatically — the extraction mechanism and the training data are the same artifact.

The shadow vampire process described in Pillar 4 generates 100K+ comparison pairs per day during Phase 2. These are labeled training examples: monolith output (ground truth) and new service output (candidate). This is a training data flywheel built into the migration process itself.

Two distinct teams own this space and must be treated as separate:

  • ML Platform (Hakim's team) — owns the infrastructure layer: model serving, feature stores, training infrastructure, and the shared platform that ML services run on. This is the engineering foundation.
  • Data Science (Andrew Crane-Droesch's team) — owns the research and modeling layer: experimentation frameworks, model development, and production ML services like ml_elasticity.

The ml_elasticity pattern is the architecture target for Data Science services. Andrew Crane-Droesch's work (2,261 commits, DoubleML/MLflow, rigorous experimentation framework) demonstrates what a well-structured ML service looks like in this codebase. The conventions in ml_elasticity — model registry via MLflow, experiment tracking, clear separation of training and serving — are the conventions Polaris adopts as standard for ML services. ML Platform provides the infrastructure these services run on.

Existing evidence: - ml_elasticity: DoubleML/MLflow, 2,261 commits, production model serving (Data Science) - datapipelines: Bronze/Silver/Gold lake, Great Expectations validation, production (ML Platform) - PaceWorkflow, ForecastWorkflow, PricingWorkflow: ML-ready pipeline patterns established

Pillar 3 does not add ML as a bolt-on. It positions the data lake and shadow pipeline as the foundation from which ML operates.

Pillar 4: Shadow Vampire Transition

The strangler fig pattern (North Star's migration approach) requires 18-24 months because it builds around the monolith incrementally. During the transition period, both systems run, integration complexity is high, and the team cannot validate correctness until traffic shifts.

Shadow Vampire is different. It runs a new domain service in parallel — receiving the same inputs, producing outputs that are compared against the monolith oracle — without shifting any traffic. The monolith remains the system of record throughout validation. Traffic shifts only after 95%+ output parity is demonstrated.

The time reduction comes from risk reduction. The 9-12 month target is achievable because the parallel validation period gives the team high confidence before any production exposure.

The thin needles are already in place. The monolith already has: - SaveHooks: 20+ entities emit events on write, feeding downstream services - gcdb-stream-processor: Kinesis Anti-Corruption Layer translating monolith events to clean domain events - BackfillService: historical data extraction for new services

The shadow extraction infrastructure does not need to be built from scratch. It needs to be formalized and extended to additional domains.

The rollback capability is structural. Traffic routing is the mechanism, not data migration. Rolling back means re-pointing traffic — seconds, not hours. This is the decisive advantage over strangler fig, where rollback becomes progressively more complex as the new service accumulates state.


4. Polaris vs. North Star

The following table states what changes and what is preserved.

Dimension North Star (2024) Polaris (2026)
Migration approach Strangler Fig (18-24 months) Shadow Vampire (9-12 months)
Frontend pattern Microservice-based decomposition (one service per frontend surface) BLAST (new in 2026; Vercel + Next.js + Clerk + Neon; composable, platform-agnostic)
Cross-domain integration Synchronous REST/GraphQL Event-driven via Kinesis (streaming first)
Internal technology Node.js Lambda (prescriptive) Team choice (hard shell required, interior free)
ML and Data Science Secondary, afterthought First-class; data lake before service extraction
Observability Retrofit after extraction Built in from day 1; OpenTelemetry required
Domain models Anemic (CRUD wrappers) Rich (behavior + data, not just persistence)
API contracts Informal, ad hoc Published, versioned, SLO-backed
Rollback mechanism Complex (state in new service) Traffic routing only (instant reversal)
Data ownership Implicit Explicit; no shared databases across domain boundaries

What Polaris preserves from North Star: - Domain-driven decomposition as the organizational principle - Pragmatic incremental migration — no big-bang rewrite - The domain boundaries already established (market-domain, integrations, intelligence-domain)

What Polaris introduces that North Star did not have: - BLAST stack (Vercel + Next.js + Clerk + Neon) — composable, platform-agnostic frontend pattern; North Star would have decomposed frontends to microservices - Shadow Vampire transition (replaces Strangler Fig) - Event-driven streaming backbone as the integration standard - First-class ML and data science access via data lake

North Star's strategic direction was sound. Polaris does not repudiate that work. It addresses the operational gaps that North Star left open.


5. The Path to Polaris: 12-Month Roadmap

Phase 1 — Months 1-6: Shadow Build

The goal of Phase 1 is to have shadow extraction infrastructure running for the first target domains, with the data lake receiving real production data and the first domain services under shadow validation.

Infrastructure work: - Formalize ETL tap framework using existing gcdb-stream-processor and SaveHooks as the pattern - Extend SaveHooks coverage to target domain entities not yet instrumented - Validate Bronze/Silver/Gold lake completeness for target domains; close data gaps

Domain selection: - Identify first 2-3 domains for Polaris treatment - Candidates: market-domain (already partially extracted), integrations (mature API, good candidate for hard-shell formalization), rate-management (high business value, clear domain boundary) - Selection criteria: clear domain boundary, existing event instrumentation, business value justification

Observability: - Instrument ALL new services with OpenTelemetry from day 1 - This is not retrofit work — it is a requirement for any service that enters development - Define correlation ID standard; enforce via PR review checklist

Frontend: - BLAST apps (Vision360, Lighthouse, Tour Operator) demonstrate the composable frontend pattern in production - Document the BLAST pattern as the standard for new frontend surfaces - Establish VPC peering standards for secure data access from BLAST apps

Output of Phase 1: Shadow pipeline receiving production events for target domains; data lake populated with Bronze/Silver/Gold layers for those domains; first domain services running in shadow mode.

Phase 2 — Months 7-9: Parity Validation

The goal of Phase 2 is to prove that shadow services produce outputs matching the monolith oracle at the required confidence level before any traffic is shifted.

Shadow testing: - Run 100K+ monolith vs. shadow comparisons per day for each target domain - Output parity target: 95% before proceeding to Phase 3 - Discrepancy analysis: every mismatch is a bug in the shadow service or a discovery that the monolith behavior was incorrect (both are valuable)

ML training data: - Shadow comparison pairs (monolith output, shadow output, inputs) are captured as labeled training examples - This is automatic — the validation framework is the training data pipeline - ML teams operate on this data in parallel; no special extraction work required

API contracts: - Publish domain API contracts with versioning and SLOs during Phase 2 - Contracts are tested against shadow service; consumers begin integrating against published contract - SLOs set realistic targets based on shadow performance data

Output of Phase 2: 95%+ output parity demonstrated for target domains; published API contracts; ML training data available; consumer integration testing complete.

Phase 3 — Months 10-12: Vampire Drain

The goal of Phase 3 is to transfer production traffic from the monolith to validated Polaris domain services, with the monolith remaining available as instant rollback.

Traffic routing progression: - 10% → 25% → 50% → 100% over four weeks per domain - Each step requires sustained parity at the previous level before proceeding - Rollback at any step: re-point traffic to monolith (no data migration required)

Monolith retirement: - As each domain assumes ownership of its traffic, the corresponding monolith module is deactivated — not deleted - The monolith remains available as a dormant safety net - Module deactivation follows validation that the domain service has handled full traffic without incident for two weeks

Output of Phase 3: First 2-3 domains fully operational as Polaris services; monolith modules dormant for those domains; validated playbook for subsequent domain extractions.


6. What We Are Asking of SELT

This section is the actionable core of this document. The architectural direction described above requires six specific commitments from SELT members. These are not aspirational guidelines — they are engineering disciplines that must be enforced in every team's work from this point forward.

1. Hard-shell discipline on all new services Every new service ships with a published API contract: interface specification, versioning policy, SLOs, and observability hooks. No service enters production without a formal contract. The contract is the dependency surface. Nothing behind the contract is visible to consumers.

Owner: All SELT members, enforced in architecture review.

2. Observability is non-negotiable OpenTelemetry instrumentation and correlation IDs are required on all new services. This is a merge gate, not a guideline. Distributed tracing is not added later as a retrofit project — it is present from the first commit. The observability gap that scored 4/10 in North Star does not transfer to Polaris services.

Owner: Chris Mountford (Infrastructure), enforced in CI/CD pipeline.

3. Event-first for new cross-domain integrations No new synchronous REST calls are created between domains. New cross-domain integrations use Kinesis events. This applies to net-new integrations only — existing synchronous calls are not immediately replaced but are flagged for migration. This discipline stops the accumulation of new tight coupling.

Owner: Shiv Yadav (Architecture), enforced in design review.

4. One domain per quarter for Shadow Vampire treatment Each SELT leader nominates one domain under their ownership for Shadow Vampire extraction over the next four quarters. The nomination includes: domain boundary definition, extraction justification, and owner commitment. This is the mechanism that converts roadmap intent into domain-by-domain progress.

Owner: All SELT members; coordination through Architecture Review Board.

5. Formally retire North Star as the guiding document North Star served its purpose. Polaris supersedes it. SELT endorses this transition by treating this document as the architectural reference going forward. The North Star document is preserved in git history and archived at projects/architecture/north-star/ — it is not deleted, but it is no longer the current standard.

Owner: Robert Matsuoka; requires SELT acknowledgment.

6. Rich domain models — no new CRUD wrappers Domain services must encapsulate actual business behavior, not proxy database operations with a REST facade. If a proposed service is reducible to "create, read, update, delete on table X," it is not a domain service — it is a database with an HTTP interface. The DRE pattern is the reference: business behavior lives inside the hard shell alongside the data.

Owner: Antonio Cortes (Core Platform) and Shiv Yadav (Architecture), enforced in design review.



Appendix: Architecture Glossary

Term Definition
Hard shell The published API contract of a domain service: interface, versioning, SLOs, observability hooks
Soft interior Everything inside the shell: language, framework, data model, internal structure — team's choice
Shadow Vampire Migration pattern: run new service in parallel against same inputs; compare to monolith oracle; shift traffic only after parity proven
Strangler Fig Martin Fowler's migration pattern: gradually replace monolith components by routing traffic to new services incrementally (North Star's approach; 18-24 months)
Thin needles Existing production instrumentation enabling Shadow Vampire: SaveHooks, gcdb-stream-processor, BackfillService
Bronze/Silver/Gold Data lake tiers: Bronze (raw ingestion), Silver (cleaned/normalized), Gold (analytics-ready)
ACL Anti-Corruption Layer: gcdb-stream-processor translates monolith events to clean domain events
BLAST Vercel + Next.js + Clerk + Neon (Postgres); composable frontend architecture
DRE dari-haskell; Haskell domain service with clean HTTP API, owns its MongoDB; canonical Polaris example
ml_elasticity DoubleML/MLflow ML service (Andrew Crane-Droesch, 2,261 commits); canonical Polaris ML example
Output parity Shadow service produces outputs matching monolith oracle; 95% target before traffic shift

Appendix: Key Production Evidence

The following production systems are cited throughout this document as validation that Polaris components already exist:

System Evidence Pillar
DRE (dari-haskell) Haskell service, clean HTTP API, owns MongoDB, PMS consumers Pillar 1
gcdb-stream-processor Kinesis ACL, Bronze layer feed, production Pillar 2
SaveHooks 20+ entities instrumented, event emission Pillars 2, 4
Kinesis Firehose to S3 Bronze layer pipeline, operational Pillar 2
datapipelines AWS Glue + Great Expectations, 4,194 commits, active Pillars 2, 3
ml_elasticity DoubleML/MLflow, 2,261 commits, model serving Pillar 3
BackfillService Historical extraction for new services Pillar 4
BLAST apps Vision360, Lighthouse, Tour Operator, VPC peering Pillar 1

Supporting Documents


Document status: DRAFT. All content subject to revision. Not for external distribution. Last updated: 2026-02-28

[^1]: Within reason. Anything other than Java, Typescript, Python, React is a discussion with justifications and ROI, and in general technology choices are a part of the proposal/review process.

[^2]: Having said that, the use of Haskell does create resourcing and knowledge transfer/bus rider issues. We're not looking to fix what works, but the future toolchain of DRE should be evaluated.