Quality Engineering Strategy: Best Practices for Duetto

Author: Antonio Cortés Date: 2026-03-04 Status: DRAFT Audience: Engineering Leadership, Engineering Managers, Tech Leads Applies to: All engineering teams — App, Intelligence, and Platform areas

Related Team Charters: - Quality Guild Charter (TC-006) - Quality Engineering Team Charter (TC-007)

1. Executive Summary

This document proposes a quality engineering strategy for Duetto, informed by industry best practices from organizations like Google, Spotify, Atlassian, GitLab, Stripe, and Netflix. It addresses Duetto's current challenges — manual QA, inconsistent ways of working across teams, and the unique demands of AI-first engineering — and proposes a path forward.

Current State

Dimension	Current Reality
QA model	Manual QA + some Automation Engineers, not embedded in teams
Ways of working	Inconsistent across tens of teams — no shared testing standards
Test automation	Some automation exists but coverage, tools, and practices vary by team
E2E testing	Both Playwright and Cypress in use — no consolidation strategy
AI-generated code	50-70% of code is AI-generated, but QA practices haven't adapted
Architecture transition	Active monolith-to-microservices migration creates new testing challenges at service boundaries
Intelligence / ML testing	Minimal test coverage in ML repos (~100 tests across 15+ repos); existing strengths in MLflow, MyPy strict mode, Great Expectations, and pre-commit hooks — but no shared ML testing standards

Proposed Target State

Dimension	Target
QA model	Hybrid/Guild with two tracks: App/Platform QEs + Intelligence (ML/Data) QE + central Quality Engineering team
Ways of working	Shared testing standards with track-specific adaptations, enforced through CI/CD quality gates and Quality Guild
Test automation	Developer-owned testing with QE coaching; testing honeycomb for App/Platform, testing diamond for Intelligence (data quality, pipeline tests, model validation)
E2E testing	Consolidated on Playwright (App/Platform); golden file + data quality testing (Intelligence)
AI-generated code	Specific QA strategies for AI-generated code; mutation testing; AI-aware code review
Architecture transition	Contract testing (Pact) at service boundaries; parallel run testing for high-risk extractions

2. QA Organizational Model

2.1 Why the Hybrid/Guild Model

Three QA models exist in the industry. For Duetto's context (tens of teams, lean core, monolith-to-microservices migration), the Hybrid/Guild model is the strongest fit.

Model	How it works	Pros	Cons	Used by
Centralized	Separate QA team serves all product teams	Consistent standards, clear career ladder	Bottleneck, "throw it over the wall" mentality, slow feedback	Legacy enterprises
Embedded	QA engineers sit within product teams	Deep domain context, fast feedback, team ownership	Isolated QA, inconsistent practices, no shared infrastructure	Spotify (evolved)
Hybrid/Guild	Embedded in teams + central enablement guild	Best of both: context + consistency, shared infrastructure, career growth	Requires guild leadership, matrix reporting complexity	GitLab, Stripe, Atlassian

Why not centralized: With tens of teams, a centralized QA team would be a severe bottleneck. It reinforces the "QA tests my code" anti-pattern that doesn't scale.

Why not pure embedded: Without a guild, tens of teams would each invent their own practices. QA engineers would be isolated with no career path and no shared tooling.

Why hybrid/guild: Embedded QE gets domain context for microservices testing. The central guild provides consistency, shared infrastructure, career development, and drives org-wide quality initiatives — critical during the monolith migration and AI-first transformation.

2.2 Proposed Structure for Duetto

Duetto's engineering organization spans two distinct technology domains: App/Platform (Java/Spring Boot + React/TypeScript) and Intelligence (Python ML/data pipelines). The guild model must serve both, with shared governance but domain-appropriate practices.

Phasing note: The Intelligence track is designed to be added as a second phase of the Quality Guild. Phase 1 focuses on establishing the guild with the App/Platform track (highest team count, largest test debt, Selenium/Cypress migration). Once the guild is operational and the App/Platform practices are stable (approximately months 3-6), the Intelligence track is introduced with its own embedded QE(s) and domain-specific practices. This phased approach avoids overloading the guild at inception and allows the Intelligence track to benefit from the patterns and infrastructure already established by the App/Platform track.

┌──────────────────────────────────────────────────────────────────────┐
│                          Quality Guild                                │
│  (All QEs + interested engineers meet bi-weekly)                      │
│  (Guild lead coordinates standards, hiring, career growth)            │
├──────────────────────────────────────────────────────────────────────┤
│                                                                       │
│  ┌──────────────────────┐   ┌──────────────────────┐                 │
│  │  App/Platform Track   │   │  Intelligence Track   │                │
│  │                       │   │  (ML / Data)          │                │
│  │  Embedded QEs         │   │  Embedded QE(s)       │                │
│  │  (3-5 across          │   │  (1-2 across          │                │
│  │   App + Platform      │   │   Pricing, Forecast,  │                │
│  │   teams)              │   │   Elasticity, Data)   │                │
│  │                       │   │                       │                │
│  │  • Test strategy      │   │  • Data quality       │                │
│  │  • Dev coaching       │   │  • Pipeline testing   │                │
│  │  • Exploratory testing│   │  • Model validation   │                │
│  │  • Quality metrics    │   │  • ML test coaching   │                │
│  │  • Playwright E2E     │   │  • Great Expectations │                │
│  │  • Pact contracts     │   │  • Golden file tests  │                │
│  └──────────────────────┘   └──────────────────────┘                 │
│                                                                       │
│  ┌──────────────────────────────────────────────────────────────┐    │
│  │       Quality Engineering Team (2-3 people)                      │    │
│  │       Led by Staff/Lead QE                                     │    │
│  │                                                               │    │
│  │  Shared across both tracks:                                   │    │
│  │  • CI/CD quality gates (GitHub Actions templates)             │    │
│  │  • Flaky test detection & remediation systems                 │    │
│  │  • Quality dashboards (DataDog — per team, per track)         │    │
│  │  • AI code testing tools (CodeRabbit, Augment Code config)    │    │
│  │                                                               │    │
│  │  App/Platform-specific:        Intelligence-specific:         │    │
│  │  • Testcontainers configs      • Great Expectations infra     │    │
│  │  • Playwright infra            • MLflow validation gates      │    │
│  │  • Pact broker management      • Data drift monitoring        │    │
│  │  • Test data factories         • Pipeline test frameworks     │    │
│  └──────────────────────────────────────────────────────────────┘    │
│                                                                       │
└──────────────────────────────────────────────────────────────────────┘

App/Platform Track

Embedded Quality Engineers (3-5 people): - 1 QE serves 2-4 App or Platform teams — lean, coaching-focused - Embedded in team ceremonies (standups, planning, retros) - Focuses on test strategy, exploratory testing, and developer coaching - Expertise: Java/Spring Boot testing, React/TypeScript testing, Playwright E2E, Pact contracts - Does NOT do all the testing — developers own their tests

Intelligence Track (ML / Data) — Phase 2

This track is proposed as a second phase, introduced once the Quality Guild is established with the App/Platform track (see phasing note above). During Phase 1, Intelligence teams benefit from shared guild standards (Ruff, pre-commit, coverage visibility) and Quality Engineering infrastructure, but the dedicated Intelligence QE(s) and ML-specific practices are added in Phase 2.

Embedded Quality Engineer(s) (1-2 people): - Serves Pricing, Forecasting, Elasticity, Anomaly Detection, and Data Pipeline teams - Different skill set than App/Platform QEs — requires Python, data engineering, and ML pipeline experience - Focuses on data quality strategy (Great Expectations), pipeline testing, model validation gates, and golden file testing - Coaches ML engineers on testing the testable parts — utility functions, transformations, API layers — rather than forcing the honeycomb onto ML training code - Does NOT validate model accuracy (that's the ML engineer's domain) — instead ensures the infrastructure for validation exists (MLflow gates, champion/challenger pipelines, drift detection)

Why Intelligence needs its own QE track: - The technology stack (Python, LightGBM, Airflow, MLflow) shares almost nothing with the App stack (Java, React, Playwright) - Quality risks are different: data drift and model degradation vs. UI bugs and API contract violations - Testing tools are different: Great Expectations and golden file testing vs. Playwright and Pact - A QE trained in Playwright and Pact cannot coach an ML engineer on data validation or pipeline testing — and vice versa - However, both tracks share governance, CI/CD infrastructure, and quality culture — hence one guild, two tracks

Quality Engineering Team (2-3 people, led by Staff/Lead QE):

The team is led by the Staff/Lead QE (L6) — the most senior technical IC in the guild. This person: - Sets the technical vision for test automation across the organization — framework choices, infrastructure design, migration strategies (e.g., Selenium-to-Playwright) - Designs reusable CI/CD quality gate architecture — GitHub Actions templates, quality gate tiers, pipeline optimization - Makes tooling decisions — evaluates and selects tools (Playwright over Cypress, Pact for contracts, Great Expectations for data quality), defines integration patterns - Acts as the technical counterpart to the Guild Lead — the Guild Lead owns people and governance, the Staff/Lead QE owns technical strategy and infrastructure - Mentors Quality Engineering team members and embedded QEs on automation best practices

The team as a whole: - Builds and maintains shared test infrastructure across both tracks - Owns CI/CD quality gates and pipeline optimization - Manages flaky test detection, remediation systems, and quality dashboards - Develops AI-specific testing strategies and tooling (CodeRabbit, Augment Code) - Intelligence-specific: maintains Great Expectations infrastructure, MLflow validation gates, data drift monitoring - App/Platform-specific: maintains Testcontainers configs, Playwright infrastructure, Pact broker

Quality Guild: - All QEs from both tracks + interested developers and ML engineers meet bi-weekly - Guild lead (can be an Engineering Manager) coordinates standards, hiring, and career growth - Two-track agenda: shared topics (CI/CD, AI code quality, metrics) + rotating track-specific deep dives - Cross-pollination between tracks: App engineers learn about data quality; ML engineers learn about contract testing - Shared playbooks where practices overlap (e.g., pytest best practices, pre-commit hooks, coverage reporting)

2.3 Ratios and Sizing

Role	Track	Count	Ratio	Notes
Embedded QE (App/Platform)	App/Platform	3-5	1 QE : 2-4 teams	Start with 2 in pilot, scale based on results
Embedded QE (ML/Data)	Intelligence	1-2	1 QE : 3-5 teams	Phase 2. Requires Python + ML pipeline experience; start with 1
Quality Engineering	Shared	2-3	Central team	Led by Staff/Lead QE + 1-2 Engineers
Guild Lead	Shared	1	Part of EM role	Coordinates both tracks, not a full-time role
Total QE headcount	Both	7-11	~1 QE : 12-19 devs	Lean — quality is a shared responsibility

Hiring priority and phasing: Start with App/Platform QEs in Phase 1 (largest surface area, most teams, highest test debt in Selenium/Cypress migration). The Intelligence track QE is hired in Phase 2 once the guild is operational — this person needs a rare blend of testing expertise and data/ML familiarity, so allow longer time to hire. Intelligence teams still participate in the guild from Phase 1 and benefit from shared standards and infrastructure.

This is significantly leaner than traditional QA models (1:3-5 ratio) because developers own testing. QE enables and elevates, not replaces.

3. Role Definitions

3.1 The Quality Engineer Role (Proposed Primary Role)

The industry is converging on the Quality Engineer as the dominant quality role in modern SaaS. It replaces the traditional QA Engineer, focusing on prevention over detection, coaching over gatekeeping.

Philosophy shift: - FROM: "QA finds bugs after development" (detective work) - TO: "QE prevents bugs by improving the system" (preventive work)

What a Quality Engineer's week looks like at Duetto:

Day	Activities
Monday	Sprint planning — reviewing stories for testability, suggesting acceptance criteria
Tuesday	Pair programming with developer on test strategy for new microservice
Wednesday	Analyzing test metrics dashboard, identifying flaky tests, reviewing CI pipeline health
Thursday	Exploratory testing on high-risk feature, writing up findings and risk assessment
Friday	Quality Guild meeting — sharing testing patterns for event-driven architecture

3.2 Role Comparison

Dimension	QA Engineer (traditional)	Test Automation Eng.	SDET	Quality Engineer (proposed)
Primary focus	Test execution & defect finding	Automated test creation	Test tooling & infrastructure	Quality strategy & enablement
Coding level	Light (scripting)	Moderate (test code)	Heavy (production-grade)	Moderate (tooling + automation)
Production code	Rarely	Never	Frequently	Sometimes
Manual testing	Significant	Minimal	Rare	Strategic/exploratory only
Developer coaching	Minimal	Some	Some (on testability)	Primary responsibility
Quality strategy	Limited	No	Some	Primary responsibility
Typical dev ratio	1:3-5	1:5-8	1:8-15	1:10-20

3.3 Why Not SDET?

The SDET role (Software Development Engineer in Test) was pioneered by Microsoft in the early 2000s at a 1:1 SDE-to-SDET ratio. Microsoft eliminated the title in 2014, merging all SDETs into SDEs under a "Combined Engineering" model. Google similarly moved away from dedicated SETs (Software Engineers in Test) around 2016-2018.

Why it declined: - The separate role created a "someone else will test my code" mentality - SDETs often became test-only engineers despite being hired as software engineers - The 1:1 ratio was expensive and unsustainable - Modern CI/CD made developer-owned testing more practical

What replaced it: Quality Engineers (coaching model) + Platform/Infrastructure Engineers (shared tooling) + developer-owned testing. This is the model we propose for Duetto.

3.4 Engineer Testing Responsibilities

In the hybrid/guild model, developers own testing — QEs coach and enable, but do not replace engineer responsibility. Every engineer is expected to contribute across the testing honeycomb.

Unit Tests (40% of effort — owned by engineers): - Write unit tests for all new business logic, domain models, and utility functions - Maintain test coverage for code they modify — no PR without corresponding tests - Use JUnit 5 + Mockito (Java) or Jest + React Testing Library (frontend) - Focus on behavior verification, not implementation testing: test what the code does, not how it does it - AI-generated code requires the same (or higher) testing standard — engineers must verify AI-generated tests are meaningful, not tautological - Target: every function with branching logic or business rules has a corresponding test

Integration Tests (50% of effort — engineers with QE guidance): - Write integration tests for service-to-service communication, database queries, message handling, and GraphQL resolvers - Use Testcontainers for real infrastructure (MongoDB, PostgreSQL, Redis, LocalStack, RabbitMQ) instead of mocks - Write Pact consumer contracts when consuming another team's API or events - Test the full request-response cycle through your service, not just internal methods - QEs help define the integration test strategy and identify critical boundaries; engineers execute - For event-driven services: test the complete publish → queue → consume → side-effect path

E2E Tests (10% of effort — engineers + QE collaboration): - Write Playwright E2E tests for critical user journeys that touch your team's features - Focus on happy paths and high-risk scenarios — E2E tests are expensive to maintain - Follow the Page Object pattern for maintainable test code - QEs define which journeys need E2E coverage and review test design; engineers implement - Run E2E tests locally before pushing — don't rely solely on CI

General expectations for all engineers:

Responsibility	Expectation
Test with every PR	No production code change merges without corresponding tests
Fix broken tests	If your change breaks a test, you fix it — not the QE
Flaky test ownership	If you wrote a test that becomes flaky, you own remediation (within SLA)
Review test quality	In code reviews, evaluate test coverage and quality, not just production code
Test data management	Use shared factories and fixtures; don't hardcode test data
AI-generated test review	Critically evaluate AI-generated tests: check for tautological assertions, missing edge cases, and implementation coupling

3.5 Career Ladder for Quality Engineers

Level	Title	Focus
L1-L2	Junior QE	Learns testing fundamentals, assists with test strategy, basic automation
L3-L4	Quality Engineer	Defines test strategy for a team, coaches developers, builds automation
L5	Senior QE	Test strategy across multiple teams, drives quality initiatives, analyzes metrics
L6	Staff/Lead QE	Org-wide quality strategy, defines quality engineering practices, influences engineering culture. Leads the Quality Engineering Team — owns technical vision for automation infrastructure, framework decisions, and CI/CD quality gate design
L7	Principal QE / Head of Quality	Sets quality vision for the organization, industry thought leadership

4. Testing Strategy

4.1 The Testing Honeycomb (Not the Pyramid)

The traditional testing pyramid (many unit tests, fewer integration, fewer E2E) was designed for monoliths. For Duetto's microservices migration, the testing honeycomb (Spotify model) is more appropriate:

Why the pyramid doesn't work for microservices: The majority of bugs in microservices occur at service boundaries — serialization mismatches, API contract violations, message schema drift, network timeout handling. Unit tests within a single service catch fewer of these real-world failures.

The honeycomb model:

Layer	% of effort	What to test	Tools
E2E / UI tests	~10%	Critical user journeys through the full stack	Playwright
Integration tests	~50%	Service-to-service communication, database interactions, message handling, GraphQL resolvers	Testcontainers, Spring Boot Test, Pact
Unit tests	~40%	Complex business logic (pricing algorithms, revenue calculations)	JUnit 5, Jest, React Testing Library

Key insight: In microservices, integration tests are the most valuable. Internal logic within a well-designed microservice is often simple; the complexity lives in the interactions.

4.2 Testing by Architecture Layer

Backend (Java / Spring Boot)

Test Type	What	Tools	Who Writes
Unit tests	Business logic, domain models, utilities	JUnit 5, Mockito	Developers
Integration tests	Repository operations, service interactions, message handlers	Testcontainers (MongoDB, PostgreSQL, Redis, LocalStack, RabbitMQ), Spring Boot Test	Developers + QE guidance
API tests	REST/GraphQL endpoints, request/response validation	MockMvc, Spring Boot Test, Apollo subgraph testing	Developers
Contract tests	Service boundary contracts	Pact (pact-jvm)	Developers + QE
Performance tests	Load, latency, throughput	k6 (with DataDog integration)	QE / Quality Engineering

Frontend (React / TypeScript / Next.js)

Test Type	What	Tools	Who Writes
Unit tests	Component logic, hooks, utilities, state management	Jest, React Testing Library	Developers
Component tests	Visual + behavioral component validation	Storybook, React Testing Library	Developers
Integration tests	Apollo Client queries, form flows, multi-component interactions	Jest + MockedProvider, Playwright	Developers
E2E tests	Critical user journeys, cross-page flows	Playwright	Developers + QE
Visual regression	UI consistency, design system compliance	Playwright built-in screenshots	QE / Quality Engineering
Accessibility	WCAG 2.1 AA compliance	axe-core, Playwright accessibility assertions	Developers + QE

GraphQL (Apollo Federation)

Test Type	What	Tools
Schema validation	Breaking change detection	Apollo rover `subgraph check` in CI
Resolver unit tests	Individual resolver logic	DGS test utilities / graphql-java, Jest + MockedProvider
Composition tests	Cross-subgraph query resolution	Apollo Router in test mode
Contract tests	Consumer-provider contracts for subgraphs	Pact (supports GraphQL)
Operation tests	Real query regression against test env	Apollo Studio operation checks

Test Type	What	Tools
Handler unit tests	Message processing logic, idempotency	JUnit 5, Jest
Integration tests	End-to-end message flow (publish → queue → consume → side effects)	Testcontainers (LocalStack, RabbitMQ)
Contract tests	Message schema compatibility	Pact message contracts
Ordering/failure tests	Out-of-order delivery, DLQ routing, batch failures	Testcontainers + custom scenarios

4.3 Contract Testing with Pact

Contract testing is critical during the monolith-to-microservices migration. It verifies that services agree on their API contracts without needing full integration environments.

How Pact works:

Consumer side: Write tests describing expected interactions → Pact generates a contract JSON
Pact Broker: Contracts are published to a central broker
Provider side: Provider runs verification tests against the contract
Can-I-Deploy: Before deploying, pact-broker can-i-deploy checks all contracts are verified

When to introduce Pact at Duetto: - Start writing contracts for the modules you plan to extract from the monolith next - The monolith acts as "consumer" of the new service; React/Next.js frontends are also consumers - Use Pact message contracts for SQS/SNS/Kinesis event schemas - Pact supports GraphQL interactions natively

4.4 E2E Framework Consolidation: Playwright

Duetto currently uses both Playwright and Cypress. We propose consolidating on Playwright.

Dimension	Playwright	Cypress
Browser support	Chromium, Firefox, WebKit (Safari)	Chromium, Firefox, WebKit (experimental)
Language support	JS, TS, Python, Java, .NET	JS, TS only
Multi-tab/window	Full support	Not supported
Parallel execution	Built-in (worker-based)	Requires Cypress Cloud (paid)
CI performance	2-3x faster in parallel scenarios	Slower (sequential default)
Visual regression	Built-in screenshot comparison	Via plugins (Percy, Applitools)
API testing	Full HTTP client (`request` context)	Limited (`cy.request`)
Cost	Fully open source	Core OSS, paid Cloud features
Next.js integration	Native via `@playwright/test`	Limited

Why Playwright wins for Duetto: - Java bindings — backend teams can write integration tests in familiar tools - Native parallelization — reduces CI pipeline time by 40-60% - Next.js (BLAST) compatibility — tests server components, edge functions, middleware - Apollo Federation testing — request context is superior for GraphQL API testing - Cost — fully open source vs. Cypress Cloud (paid for parallelization and dashboard)

Migration strategy: 1. Stop writing new Cypress tests immediately 2. All new E2E tests in Playwright 3. Gradually migrate critical Cypress tests (prioritize by business value) 4. Set a deadline (~6 months) for full Cypress decommission

4.5 Testing During the Monolith Migration

The monolith-to-microservices migration creates a period where the same functionality exists in both systems. Testing strategies must account for this duality.

Strangler Fig Testing Strategy:

Testing Layer	What	When
Contract tests	Define contracts before extracting functionality	Before each service extraction
Routing tests	Verify the strangler facade correctly directs traffic	During migration
Parallel run tests	Route to both old and new, compare responses	During high-risk extractions
Data consistency tests	Verify data migration completeness and dual-write consistency	When service takes data ownership
Rollback tests	Verify switching back to monolith works	Before each cutover

Parallel run testing (for high-risk extractions):

Request --> [Router]
              |
              +---> [Monolith] ---> Response (returned to user)
              |
              +---> [New Service] ---> Response (logged, compared)

Route a shadow copy of requests to the new service, compare responses, and track the discrepancy rate. When it drops below 0.1%, switch primary. Tools: Scientist4J pattern, DataDog for tracking comparison results.

4.6 Testing Strategy for Intelligence Teams (ML/Data)

Phase 2 scope. The Intelligence-specific practices in this section are designed to be introduced as a second phase of the Quality Guild, after the App/Platform track is established (see Section 2.2). During Phase 1, Intelligence teams adopt shared standards (linting, pre-commit, coverage visibility) and benefit from Quality Engineering infrastructure. The dedicated Intelligence QE, ML-specific quality gates, and full testing diamond are introduced in Phase 2.

The Intelligence domain (Pricing, Forecasting, Elasticity, Anomaly Detection, Optimization) operates on a fundamentally different technology stack and development lifecycle than App and Platform teams. The testing honeycomb (Section 4.1) applies to the Java services within Intelligence, but the core ML training pipelines, data engineering, and algorithmic code require a distinct quality approach.

4.6.1 Intelligence Domain Technology Landscape

Dimension	App / Platform Teams	Intelligence Teams
Primary language	Java 17 + TypeScript/React	Python 3.10-3.12 (dominant), some Java
Frameworks	Spring Boot, Next.js, Apollo GraphQL	Flask, LightGBM, Prophet, Ray, PyTorch, Optuna, PuLP/HiGHS
Code nature	CRUD APIs, UI components, service orchestration	ML training pipelines, optimization solvers, statistical models, feature engineering
Data stores	MongoDB, PostgreSQL, Redis	S3, MLflow, DynamoDB, Athena/Glue, PostgreSQL (caching)
Deployment	Kubernetes, continuous delivery	Docker containers, AWS Lambda, SSM-based version promotion (stage→demo→prod)
Orchestration	Request/response APIs	Airflow DAGs, DynamoDB config-driven jobs, Ray distributed training
Testing maturity	~3,100+ tests across repos	~100 tests across all ML repos; some repos have 3 tests for 30K+ LOC

Key repos analyzed: pricerator (pricing engine, 82 source files, 47 test files — best-tested), forecasting (LightGBM + Prophet, 30.5K LOC, 3 tests), ml_elasticity (DoubleML, 50+ tests), ml_pricing_engine (LP/MILP optimizer, 15 tests), intelligence-domain (Java/Spring Boot with JaCoCo), datapipelines (31+ Airflow DAGs, Great Expectations).

4.6.2 Why the Standard Honeycomb Doesn't Fully Apply

The testing honeycomb (unit → integration → E2E) assumes request/response services where bugs live at boundaries. In ML systems, the primary quality risks are different:

Risk Category	What Can Go Wrong	Standard Honeycomb Coverage
Model accuracy degradation	New training run produces worse predictions than previous version	Not covered — no concept of "model accuracy" in unit/integration tests
Data quality drift	Input data schema changes, nulls appear in critical columns, distributions shift	Partially covered — data validation is not traditional testing
Feature engineering bugs	Incorrect temporal joins, data leakage across train/test split, wrong aggregation windows	Partially covered — unit tests can catch some, but the bugs are subtle and domain-specific
Training pipeline failure	OOM during distributed training, config-driven jobs silently skip steps	Not covered — pipeline orchestration testing is distinct
Numerical instability	Floating-point issues in optimization solvers, edge cases in elasticity calculations	Partially covered — requires property-based and boundary testing
Reproducibility failure	Same config + data produces different results across runs	Not covered — requires deterministic seeding and environment locking

4.6.3 The ML Testing Diamond

For Intelligence teams, we propose a testing diamond adapted from Google's ML testing guidelines and industry practice:

            ┌───────────────┐
            │  Model        │  ~5% of effort
            │  Validation   │  Accuracy, backtesting, champion/challenger
            ├───────────────┤
          ┌─┤  Pipeline     ├─┐  ~15% of effort
          │ │  Tests        │ │  DAG correctness, config validation, idempotency
          │ ├───────────────┤ │
        ┌─┤ │  Data         │ ├─┐  ~30% of effort
        │ │ │  Quality      │ │ │  Schema enforcement, drift detection, expectations
        │ │ ├───────────────┤ │ │
        │ │ │  Unit +       │ │ │  ~50% of effort
        │ │ │  Integration  │ │ │  Transformations, utilities, API contracts
        └─┴─┴───────────────┴─┴─┘

Layer	% Effort	What to Test	Tools	Who
Unit + Integration	~50%	Data transformations, utility functions, API endpoints, service contracts	pytest, unittest, moto (AWS mocking), Testcontainers, Pact	Engineers
Data Quality	~30%	Input schema validation, null/range checks, distribution drift, referential integrity	Great Expectations (already in datapipelines), Pandera, custom validators	Engineers + QE
Pipeline Tests	~15%	Airflow DAG validation, config-driven job correctness, idempotency, failure recovery	pytest-docker, DAG unit tests, golden file testing	Engineers + QE
Model Validation	~5%	Accuracy metrics vs baseline, backtesting on holdout data, champion/challenger comparison	MLflow (already in use), custom metrics, SHAP analysis	ML Engineers + QE

4.6.4 Testing by Intelligence Architecture Layer

ML Training Pipelines (Python — forecasting, ml_elasticity, ml_pricing_engine)

Test Type	What	Tools	Current State	Target
Utility unit tests	Data transformations, feature engineering functions, math utilities	pytest, numpy.testing	Low (~15 tests in ml_pricing_engine, 3 in forecasting)	Every transformation function has tests
Data validation	Input schema checks, null detection, range validation, distribution alerts	Great Expectations, Pandera	Present in datapipelines; absent in training repos	Validation at every pipeline stage
Golden file tests	Known input → expected output regression tests	pytest + snapshot files	Present in group-forecast-service	Expand to all model inference paths
Model validation	Accuracy vs previous version, backtesting, metric tracking	MLflow metrics, custom	MLflow experiment tracking in place	Automated champion/challenger gates
Config validation	YAML/DynamoDB config correctness, required fields, value ranges	JSON Schema, Pydantic	Pydantic validation in pricerator	All config-driven jobs validate before execution
Reproducibility	Same config + data → same results	Seed locking, pytest-randomly	Partial (deterministic seeds in some models)	Deterministic training with CI reproducibility check

Pricing/Optimization Services (Python — pricerator, group-forecast-service)

Test Type	What	Tools	Current State	Target
API tests	Flask endpoint validation, request/response schemas	pytest + Flask test client	Good coverage in pricerator (47 test files)	Maintain; add contract tests for consumers
Algorithm tests	Pricing step correctness, constraint application, rate selection	pytest, property-based testing (Hypothesis)	Unit tests exist	Expand with property-based tests for edge cases
Integration tests	External API client calls (Rate API, Elasticity API, Monolith)	pytest-mock, moto (S3), responses	Present with moto for S3	Add contract tests for upstream ML services

Intelligence Java Services (intelligence-domain, dynamic-optimization)

These repos align with the standard strategy (Sections 4.1-4.2): - intelligence-domain — Spring Boot with JaCoCo, PostgreSQL, RabbitMQ, LLM enrichment (Claude) - dynamic-optimization — Spring Boot with RabbitMQ, Awaitility for async testing, MockServer

Standard honeycomb applies: JUnit 5, Testcontainers, Pact contracts, Playwright for any UI exposure.

Data Pipelines (datapipelines)

Test Type	What	Tools	Current State	Target
DAG validation	DAG structure, task dependencies, schedule correctness	Airflow test utilities, pytest	31+ DAGs, no DAG-specific tests found	DAG structure tests for every pipeline
Data quality	Great Expectations suites per pipeline stage	Great Expectations, DataDog	Present in monitoring module	Expand to all critical pipelines; block on failures
Transformation tests	Spark/Glue job logic, SQL correctness	pytest, PySpark test utilities	Limited	Unit tests for all transformation functions

4.6.5 Existing Intelligence Strengths to Preserve

The Intelligence domain already has quality practices that should be recognized and built upon:

Practice	Where	Value
MLflow experiment tracking	forecasting, ml_elasticity, ml_pricing_engine	Model versioning, hyperparameter logging, SHAP analysis — the ML equivalent of a quality dashboard
MyPy strict mode	pricerator, ml_elasticity	Stricter type enforcement than any app team repo — prevents a class of runtime errors
Great Expectations	datapipelines	Data validation framework already in production — expand coverage
Golden file testing	group-forecast-service	Snapshot-based output validation — effective for deterministic inference
detect-secrets	pricerator, ml_elasticity (pre-commit)	Secret detection in pre-commit hooks — ahead of app teams
Ruff + pre-commit	All Python repos	Consistent linting and formatting already enforced
Semantic version promotion	pricerator, forecasting	stage-NEXT → demo-NEXT → prod-NOW pipeline with SSM gating
Hammer (pricing regression)	duetto monolith (`hammer/` module)	Branch-vs-branch pricing optimization comparison with statistical aggregation — the most sophisticated existing regression testing tool in the org. Currently on-demand only (see Section 4.6.7)

4.6.6 Intelligence-Specific Quality Gates

Supplement the standard CI/CD gates (Section 7) with ML-specific gates:

Phase 1 (align with org Phase 1): - Ruff lint + format check (already present — standardize config across repos) - MyPy strict mode (already on some repos — extend to all) - pytest unit tests with --coverage reporting (establish baseline)

Phase 2: - Great Expectations data quality checks as blocking gates in DAGs - Golden file regression tests for inference endpoints - Model accuracy comparison vs baseline (MLflow metric check) — warning, not blocking

Phase 3: - Automated champion/challenger testing before model promotion - Data drift detection alerts (via Great Expectations or custom monitors) - pytest coverage threshold (start at 40% for ML repos — lower than app teams, but meaningful) - Property-based testing (Hypothesis) for algorithmic code

4.6.7 Hammer: Pricing Regression Testing — Current State and Modernization

What Hammer is:

Hammer is an existing pricing optimization regression testing tool in the duetto monolith (hammer/ module). It is the most sophisticated quality validation tool currently in the Intelligence domain. It consists of two executables:

hammerOptimizer — Runs the full pricing optimization for a set of hotels, generating rate recommendations, constrained/unconstrained forecasts, and rate sync data transactions (RSDT). Results are stored in S3 under a {runName}/{branchName}/ prefix.
hammerDiffer — Compares optimization outputs from two branches with statistical aggregation: mean squared error per hotel, rate deviation percentages, one-sided diffs (prices in one branch but not the other), and per-hotel forecast diffs. Exits with code 1 if forecast differences are detected.

Typical workflow (current — manual):

# 1. Run optimization on feature branch
gradle runHammer -PbranchName=feature-branch -PrunName=run1 ...

# 2. Run optimization on develop
gradle runHammer -PbranchName=develop -PrunName=run1 ...

# 3. Compare results
gradle runHammerDiffer -PfirstBranch=feature-branch -PsecondBranch=develop ...

Current limitations:

Limitation	Impact
On-demand only	No CI/CD integration — pricing regressions can ship undetected if nobody runs Hammer
Manual 3-step process	Run branch A → run branch B → run differ. Easy to forget or misconfigure
Binary pass/fail	Any forecast diff triggers `System.exit(1)` — no tolerance threshold; a 0.001% rate deviation is treated the same as 50%
No historical tracking	Results live as text files in S3 with no dashboard, alerting, or trend analysis
Heavy monolith coupling	Loads the full Spring context (`:api`, `:data`, `:query`, `:scheduler`, `:server`) — slow startup, tightly coupled to monolith
No subset mode for CI	Running all active hotels is too slow for PR pipelines; no curated sample for fast feedback
Opaque reporting	Pipe-delimited text files in S3 — no structured output, no PR comments, no Slack notifications
No DataDog integration	The rest of the org uses DataDog for observability, but Hammer results are isolated

Proposed modernization — 3 phases:

Phase 1 — Automate in CI (Months 1-3):

GitHub Actions workflow for pricing PRs: Trigger Hammer automatically on PRs that modify pricing-related code paths (api/src/**/pricing/**, api/src/**/optimizer/**, hammer/**). Use path filters to avoid running on unrelated PRs.
Representative hotel sample: Maintain a curated registry of 10-20 hotels (small, medium, large, different regions and configurations) for fast CI runs. Full hotel set reserved for nightly runs.
Scheduled nightly run on develop: Compare develop against the last release tag. Full hotel set. Slack notification if diffs exceed threshold.
Configurable tolerance thresholds: Replace binary pass/fail with threshold-based reporting — warn on small deviations (<0.5% mean rate deviation), block on large deviations (>2%). Allow expected changes to be annotated in the PR.

Phase 2 — Structured reporting and observability (Months 3-6):

Structured JSON output: Extend Hammer to produce machine-readable JSON alongside the current text format, enabling consumption by dashboards and CI tooling.
PR comment with summary table: GitHub Actions posts a structured comment on pricing PRs:
Hotels tested / hotels with diffs
Mean rate deviation, largest deviation (hotel + percentage)
Status: PASS / WARN / FAIL
DataDog integration: Push Hammer results as custom metrics — rate deviation per hotel, number of affected hotels, MSE trends. Build a dashboard tracking pricing stability over time. Alert on trends ("pricing deviation increasing over last 5 runs").

Phase 3 — Decouple and modernize (Months 6-12):

Containerize Hammer: Docker image with Hammer JAR + dependencies. Eliminates the need to build the full monolith to run Hammer. Runs on any CI runner or as a scheduled ECS task.
Decouple from monolith Spring context: Extract PricingOptimizerService interface so Hammer can operate with a lighter context. As pricing moves to microservices, Hammer transitions from loading code in-process to calling the pricing service HTTP endpoint.
AI-assisted diff analysis: Feed Hammer diff output to Claude Code to generate human-readable summaries for PR reviewers — e.g., "This PR changes rates for 3 APAC hotels by an average of 2.3%, concentrated in room types rt1234 and rt5678 for Q3 dates. This is consistent with the PR description."

Target state:

Dimension	Current	Target
Trigger	Manual, on-demand	Automatic on pricing PRs + nightly on develop
Scope	All hotels or manual selection	Curated sample for PRs, full set for nightly
Pass/fail	Binary (any diff = fail)	Threshold-based (warn / block with configurable tolerance)
Reporting	S3 text files	PR comments, DataDog dashboard, Slack alerts
Speed	Slow (full hotel set + Spring context)	Fast for PRs (sample + containerized), thorough overnight
Historical tracking	None	DataDog metrics with trend analysis
Monolith coupling	Full Spring context load	Containerized → eventually HTTP client to pricing service

5. QA for AI-Generated Code

With 50-70% of Duetto's code AI-generated via Claude Code, QA strategy must explicitly address this reality.

5.1 How AI-Generated Code Changes Testing

Traditional Code	AI-Generated Code
Developer understands every line they wrote	Developer reviews code they didn't write — attention gaps are common
Bugs correlate with developer skill and domain knowledge	Bugs correlate with prompt quality and review thoroughness
Testing verifies the developer's implementation	Testing must verify the AI's implementation AND the reviewer's understanding
Test quality depends on developer testing discipline	AI can generate tests too — but tautological tests (testing the implementation, not the behavior) are a known risk

5.2 Specific Risks and Mitigations

Risk	Mitigation
AI-generated code passes review but has subtle logic errors	Mutation testing to verify test suite effectiveness
AI generates tests that test the implementation, not the behavior	QE reviews test strategy, not just test code; property-based testing for complex logic
AI-generated code has security vulnerabilities (dependency confusion, injection, etc.)	SAST/DAST in CI, CodeRabbit security rules, dedicated security scanning
AI generates overly complex code when simple code would do	Complexity metrics in CI (warn on high cyclomatic complexity), code review standards
AI-generated tests provide false confidence (high coverage, low signal)	Mutation testing score as a quality gate, QE exploratory testing

5.3 Mutation Testing

Mutation testing is the strongest technique for verifying test suite quality. It works by inserting small faults ("mutants") into production code and checking whether tests detect them.

Why it matters for AI-generated code: - AI can generate tests with high line coverage that don't actually catch bugs - Mutation testing reveals whether tests are truly effective, not just comprehensive - Google uses mutation testing at scale (6,000+ engineers, 14,000+ code authors) via a diff-based probabilistic approach

Tools: - PIT (pitest) — mutation testing for Java/JVM (Spring Boot, JUnit) - Stryker — mutation testing for JavaScript/TypeScript (Jest, React)

Proposal: Start with mutation testing on critical business logic (pricing algorithms, revenue calculations) and expand. Use as a quality signal, not a hard gate initially.

5.4 AI Tools for QA

Tool	What It Does	Applicability to Duetto
Claude Code	Generate test scaffolding, write test cases from specs, debug test failures	Primary — already in use. QE should develop testing-specific prompts and skills.
Diffblue Cover	AI-generated unit tests for Java	Evaluate for Spring Boot services — auto-generates JUnit tests for existing code
Playwright Codegen	Records browser actions and generates Playwright test code	Use for rapid E2E test creation; QE reviews and refines generated tests
CodeRabbit	AI code review that can enforce testing standards via path-based rules	Configure to flag PRs missing tests, enforce coverage thresholds, detect test anti-patterns
Augment	AI-assisted code review	Complement CodeRabbit for reviewing AI-generated code quality

5.5 AI Code Review Tools for Testing Standards Enforcement

Two AI-powered code review tools are available in Duetto's stack. They serve complementary purposes for quality enforcement.

CodeRabbit

Automated AI code review with rule-based enforcement:

Path-based instructions: Require tests for specific directories (e.g., src/services/** must have corresponding src/__tests__/services/**)
Code guidelines: Reference testing standards document so CodeRabbit checks compliance
AST-based rules (Pro): Enforce patterns like "no console.log in production code" or "test files must use describe/it structure"
Default protections: Automatically skips review of generated code, lock files, build artifacts
Strengths: Excellent for systematic rule enforcement, configurable per-repo, catches structural test issues (missing test files, coverage regressions, anti-patterns)

Augment Code

AI-powered code review and development assistant with deep codebase context:

Codebase-aware reviews: Augment indexes the full codebase to provide contextual review comments — it understands existing patterns and flags deviations
Test quality analysis: Can identify when tests don't adequately cover the changed code, suggest missing test scenarios, and flag weak assertions
Architecture awareness: Understands service boundaries and can flag when integration or contract tests are missing for cross-service changes
IDE integration: Available in VS Code and JetBrains IDEs, providing real-time feedback during development (shift-left)
Strengths: Superior contextual understanding of the codebase; better at catching logic-level issues and suggesting what should be tested rather than enforcing structural rules

Comparison and Proposed Usage

Dimension	CodeRabbit	Augment Code
Primary mode	Automated PR review (CI)	IDE assistant + PR review
Rule enforcement	Strong — configurable path-based and AST rules	Moderate — guideline-based, not rule-based
Codebase context	Limited to PR diff + configured instructions	Deep — indexes full repository
Test gap detection	Structural (missing test files, coverage delta)	Semantic (missing scenarios, weak assertions)
Custom configuration	Extensive (`.coderabbit.yaml`, path instructions)	Moderate (team-level instructions)
Best for	Enforcing standards consistently at scale	Catching logic-level issues humans miss

Proposal: Use both tools in complement: - CodeRabbit as the systematic enforcer — ensures every PR meets structural quality gates (test files exist, patterns followed, no anti-patterns) - Augment Code as the intelligent reviewer — catches semantic issues like insufficient test scenarios, missing edge cases, and tests that don't match the intent of the change - Configure CodeRabbit rules first (quick wins), then layer Augment for deeper quality insights

6. Cross-Team Consistency

6.1 The Problem

With tens of autonomous teams, inconsistency is the default. Without intervention, each team: - Chooses its own testing tools and patterns - Has different coverage thresholds (or none) - Writes tests at different levels (some heavy on unit, some on E2E, some on nothing) - Has different CI pipeline configurations - Handles flaky tests differently (or doesn't handle them at all)

6.2 The "Paved Road" Approach

Inspired by Stripe and Netflix: make the right thing easy, not mandatory.

Instead of mandating practices through policy, build infrastructure that makes good practices the path of least resistance:

Paved Road	What It Means
Shared test templates	GitHub repo templates that include pre-configured test setup — App/Platform: Jest, JUnit, Playwright, Testcontainers; Intelligence: pytest, Great Expectations, golden file harness
CI pipeline templates	Reusable GitHub Actions workflows with quality gates built in — separate templates for Java/Spring, React/Next.js, and Python ML repos
Test data libraries	Shared factories and fixtures for common Duetto entities (hotels, rates, reservations, users); Intelligence: shared test data schemas and sample hotel model fixtures
Quality dashboard	DataDog dashboard showing quality metrics per team and per track — visibility drives behavior
Example repos	"Golden path" example services showing the proposed testing approach for backend, frontend, full-stack, and ML/data pipeline repos

6.3 Testing Standards Document

The Quality Guild should own a lightweight testing standards document. What to include and what to leave to teams:

Standardize (guild-owned — all teams): - Test categorization (unit, integration, E2E, contract, data quality, pipeline — definitions and expectations) - Minimum quality gates for CI/CD (what blocks a merge) - Test naming conventions - Flaky test policy (quarantine, SLA for remediation) - Pre-commit hook standards (Ruff/ESLint, type checking, secret detection) - Coverage reporting (visibility required; thresholds by track)

Standardize (App/Platform track): - E2E framework choice (Playwright — not optional) - Contract testing approach (Pact — for service boundaries) - Testcontainers for integration tests (real infrastructure, not mocks)

Standardize (Intelligence track): - Data quality framework (Great Expectations — for all data pipelines) - Golden file testing for inference endpoints - MyPy strict mode for all Python repos - MLflow experiment tracking with minimum logged metrics - Model promotion requires accuracy comparison vs baseline

Leave to teams: - Internal test organization (file structure, test grouping) - Mocking strategies (as long as they follow "test behavior not implementation") - Which specific scenarios to test (teams know their domain best) - Test execution speed optimization (teams manage their own pipeline budget) - ML model architecture and hyperparameter choices

6.4 Quality Guild Cadence

Activity	Frequency	Participants	Purpose
Guild meeting (all-hands)	Bi-weekly, 45 min	Both tracks + interested engineers	Shared topics: CI/CD, AI code quality, metrics review, cross-pollination
App/Platform deep dive	Monthly, 30 min	App/Platform QEs + leads	Track-specific: Playwright, Pact, Testcontainers, frontend testing
Intelligence deep dive	Monthly, 30 min	Intelligence QE + ML engineers	Track-specific: data quality, pipeline testing, model validation, MLflow
Standards review	Quarterly	Full guild	Update testing standards for both tracks based on what's working
Quality metrics review	Monthly	Guild lead + EM	Review dashboards by track, identify teams needing support
Tool evaluation	As needed	Relevant track	Evaluate new tools, make proposals
Onboarding	Per new QE/engineer	Relevant track QE	Testing expectations and tooling walkthrough (track-appropriate)

7. Quality Gates in CI/CD

7.1 Progressive Adoption

Do not implement all gates at once. This creates overwhelming friction. Phase them in:

Phase 1 — Foundation (Weeks 1-4): - Build verification (compilation, Docker image) - Unit tests with existing coverage - Linting/formatting (ESLint, Prettier, Checkstyle — autofix where possible) - GraphQL schema checks (rover subgraph check)

Phase 2 — Contracts and Integration (Weeks 5-8): - Integration tests (with Testcontainers) - Pact contract verification - Security scanning (Snyk, Dependabot, or Trivy — critical/high block) - Code coverage reporting (no hard threshold yet — just visibility)

Phase 3 — Performance and Quality (Weeks 9-12): - Performance regression testing (k6 in CI) - Code coverage thresholds (start conservative: 60%, increase over time) - Bundle size monitoring (frontend) - E2E smoke tests on staging deployment

Phase 4 — Optimization (Ongoing): - Flaky test quarantine system - Test impact analysis (run only tests affected by changed files) - Visual regression testing - Mutation testing on critical paths

7.2 Gate Classification

Tier	Behavior	Examples
Blocking	Must pass to merge	Build, unit tests, integration tests, linting, schema checks, contract tests, security (critical/high)
Warning	Report but don't block	Coverage trending down, performance regression >10%, bundle size growth, complexity increase
Informational	Report only	Test execution time trends, flaky test rate, dead code, accessibility audit, TODO count

7.3 Test Execution Optimization

Technique	How	Impact
Parallelization	Playwright `--shard`, JUnit 5 parallel execution, GitHub Actions matrix strategy	2-4x faster CI
Selective test runs	Jest `--changedSince=main`, Gradle `--tests` with file mapping	50-80% fewer tests on average PR
Fail-fast	Run unit tests first; skip integration/E2E if they fail	Faster feedback on obvious failures
Caching	Cache Docker images (Testcontainers), Playwright browsers, Maven/npm dependencies	30-50% faster pipeline startup
Test result aggregation	Publish to DataDog for trend analysis, GitHub Actions test summary annotations	Flaky test detection, regression tracking

8. Current State: CI/CD & Test Infrastructure Analysis

This section documents the current state of Duetto's CI/CD pipelines, static code analysis, test coverage, and flaky test handling — derived from analysis of the duetto (backend) and duetto-frontend repositories. Understanding the baseline is essential for prioritizing improvements.

8.1 Current CI/CD Pipeline Architecture

Repository	PR Pipeline	Push (develop) Pipeline	Scheduled
duetto (backend)	Static analysis → All tests (Jest + basic + Selenium)	Static analysis → All tests → Docker build → GraphQL schema publish	Every 2 hours (with commit-change check)
duetto-frontend	Lint → Jest → Cypress (12 parallel containers)	Lint → Jest → Cypress → Trigger external Playwright E2E	Weekly flaky test issue creation (Mondays)
duetto-playwright-e2e	PR-specific tests	Regression suite (triggered by frontend push)	On-demand via workflow dispatch

Key observations: - Pipeline structure is sound — progressive checks with fast failures first - Backend uses larger runners (ARM 64-core for static analysis, ubuntu-latest-m for Jest) - Playwright tests live in a separate repository, triggered via webhook — this adds latency and reduces developer visibility - Scheduled all-tests run (every 2 hours) with Slack notifications provides good ongoing monitoring

8.2 Static Code Analysis — Current State

Tool	Repository	Configuration	Blocking?	Notes
Checkstyle 10.26.1	Backend	120-char line length, naming conventions, whitespace, modifier order	Yes — fails PR	Well-configured, enforces consistent Java style
SpotBugs 6.1.7	Backend	Exclude filter for known false positives, HTML+XML reports	No — `ignoreFailures: true`	Reports generated but don't block merges
ESLint (Airbnb + TS)	Frontend	Airbnb + TypeScript config, no-only-tests (error level), deprecated import warnings	Yes — fails PR	Good configuration; prevents `.only()` leaks
Prettier 2.4.1	Frontend	Integrated into lint-staged pre-commit hooks	Yes — pre-commit	Formatting consistency ensured
TypeScript compiler	Frontend	`tsc --noEmit` strict check	Yes — fails PR	Catches type errors before runtime
Xray / Frogbot	Backend	Security scanning via JFrog	Manual trigger	Available but not on every PR

Gaps and proposals:

Gap	Impact	Proposal	Priority
SpotBugs is non-blocking	Potential bugs slip through to production	Make SpotBugs blocking for new violations (allow existing baseline)	High
No SonarQube or equivalent	No centralized quality dashboard, no code smell tracking, no technical debt measurement	Evaluate SonarCloud (SaaS) for unified quality visibility across repos	Medium
No SAST/DAST security scanning on PRs	Security vulnerabilities in dependencies and code not caught early	Add Snyk or Trivy to PR pipeline for dependency scanning (critical/high = blocking)	High
No frontend complexity analysis	Complex components grow unchecked	Add ESLint complexity rules (max cyclomatic complexity warning)	Low

8.3 Test Coverage — Current State

Critical finding: No code coverage thresholds are enforced in either repository.

Dimension	Backend	Frontend
Coverage tool	Not configured (no JaCoCo)	Not configured in jest.config.js
Coverage threshold	None	None
Coverage reporting	None	None
Coverage trend tracking	None	None
Test count	~2,318 Java unit tests + 176 Selenium tests	~404 Jest specs + 175 Cypress specs + 32 Playwright tests

Proposals:

Immediate (Phase 1): Add coverage reporting without thresholds — visibility first
Backend: Add JaCoCo to Gradle (jacocoTestReport), publish HTML reports as CI artifacts
Frontend: Add --coverage flag to Jest CI command, publish lcov reports
Integrate with CodeCov or SonarCloud for PR-level coverage delta comments
Phase 2: Introduce conservative thresholds (not blocking yet)
Start with warning-level thresholds: 50% line coverage (likely below current levels to avoid disruption)
Track coverage trend per PR — flag PRs that decrease coverage
Phase 3: Enforce blocking thresholds
Increase to 60% line coverage (blocking)
Require no decrease in coverage per PR (blocking)
Target 80%+ for critical business logic (pricing, revenue calculations, event processing)

8.4 Flaky Test Handling — Current State

Duetto has invested meaningfully in flaky test infrastructure, but the approach is fragmented across frameworks.

Backend — Custom Retry System

Mechanism	Details
`@RetryTest` annotation	Custom JUnit 5 extension: retries test N times on specific exception types (e.g., `TimeoutException`, `InvocationTargetException`)
Usage	~9 annotated tests, primarily Selenium page tests
`validate-flapper-fix` workflow	Manual-trigger workflow that runs a single test up to 250 times to verify a flaky fix
`@Disabled` tests	~20+ Selenium tests disabled with notes like "To Be Fixed in Later Ticket"
Test splitting	15 runners (basic) + 20 runners (Selenium) with line-count distribution via `split-tests` action
Slack alerts	Scheduled test failures notify via Slack webhook

Frontend — Cypress Retry + Automated Detection

Mechanism	Details
Cypress retries	`retries=10` — each test can fail up to 10 times before final failure
`.xspec.ts` quarantine	5 test files renamed to `.xspec.ts` (excluded from runs)
Weekly automation	GitHub Action runs every Monday: parses Cypress logs for failures, creates GitHub issue with skip instructions
Cypress Cloud	Records all runs for post-mortem analysis
`it.skip()` / `describe.skip()`	Manual skip annotations throughout the codebase

Playwright E2E

Mechanism	Details
Retry	1 retry on CI only (`retries: process.env.CI ? 1 : 0`)
Workers	1 worker on CI (stability), 4 locally (speed)
Reporting	Allure reports with per-team breakdown, DataDog test visibility integration
Traces	Retained on failure for debugging

Gaps and proposals:

Gap	Impact	Proposal	Priority
Cypress retries=10 is excessive	Masks genuinely flaky tests; 10 retries can add minutes to CI	Reduce to 2-3 retries; any test needing >3 retries is flaky and should be quarantined	High
No unified flaky test tracking	Flaky tests tracked differently per framework; no org-wide view	Build a DataDog dashboard tracking flaky rate per test suite, per team	Medium
~20+ disabled Selenium tests	Unknown test debt; regression risk	Audit disabled tests: either fix, delete, or convert to Playwright. Set SLA: no test disabled >30 days without a ticket	High
No automatic quarantine	Manual process to skip flaky tests; relies on someone noticing	Implement automatic quarantine: if a test fails >3 times in 7 days, auto-quarantine + create ticket	Medium
Frontend weekly automation is reactive	Only runs Mondays; flaky tests can block PRs all week	Run flaky detection daily or on every develop push	Low

8.5 Test Infrastructure Summary

┌──────────────────────────────────────────────────────────────────┐
│                     Current Test Landscape                        │
├────────────────┬───────────────┬──────────────────────────────────┤
│ Layer          │ Count         │ Framework & Notes                │
├────────────────┼───────────────┼──────────────────────────────────┤
│ Java unit      │ ~2,318 tests  │ JUnit 5 + Mockito (15 runners)  │
│ Backend Jest   │ Subset        │ Jest (frontend-in-backend)       │
│ Frontend Jest  │ ~404 specs    │ Jest + RTL (jsdom)               │
│ Cypress E2E    │ ~175 specs    │ Cypress 7.7 (12 runners, ret=10)│
│ Selenium E2E   │ ~176 tests    │ Selenium 4.29 + Firefox (20 run)│
│ Playwright E2E │ ~32 tests     │ Playwright (separate repo)       │
│ Hammer (price) │ On-demand     │ Custom Java (monolith module)    │
├────────────────┼───────────────┼──────────────────────────────────┤
│ TOTAL          │ ~3,100+ tests │ 7 frameworks across 3+ repos     │
└────────────────┴───────────────┴──────────────────────────────────┘

Key takeaway: The test suite is substantial (~3,100+ tests) but fragmented across 7 frameworks and multiple repositories. Consolidation on Playwright (replacing Cypress and Selenium), unifying test reporting, and automating Hammer (currently the only pricing regression tool, but on-demand only) will dramatically improve signal quality and reduce maintenance burden.

9. Quality Metrics

9.1 Metrics That Matter

Beyond test coverage — metrics that actually correlate with software quality:

Leading Indicators (predict quality):

Metric	What It Measures	Target	Tool
Test coverage trend	Direction of coverage over time (not absolute %)	Increasing quarter-over-quarter	CodeCov, SonarQube
Mutation score	% of mutants killed — measures test effectiveness	>70% for critical business logic	PIT, Stryker
Flaky test rate	% of test runs that are non-deterministic	<1% of total test suite	Custom tracking, DataDog
Build success rate	% of CI builds that pass on first run	>90%	GitHub Actions metrics
PR test coverage delta	Coverage change per PR	No decrease (warning), increase (goal)	CodeCov PR comments

Lagging Indicators (measure quality outcomes):

Metric	What It Measures	Target	Tool
Escaped defect rate	Bugs that reach production per release	Decreasing trend	Jira/incident tracking
Mean Time to Recovery (MTTR)	How fast production issues are fixed	<1 hour for P1	DataDog, PagerDuty
Deployment frequency	How often teams ship to production	Multiple times/week per team	GitHub Actions
Change failure rate	% of deployments causing incidents	<5%	Incident tracking
Incident frequency	Production incidents per team per month	Decreasing trend	PagerDuty, DataDog

DORA metrics (Deployment Frequency, Lead Time, Change Failure Rate, MTTR) should be the north-star metrics for engineering quality.

9.2 Quality Dashboard

Build a DataDog dashboard showing quality health per team:

Section	Metrics	Audience
Team Health	DORA metrics, escaped defects, incident rate	Engineering leadership
Test Health	Coverage, flaky rate, test execution time, mutation score	Quality Guild, team leads
Pipeline Health	Build success rate, pipeline duration, quality gate pass rate	Quality Engineering team
Production Health	Error rates, p99 latency, SLO compliance	All engineers

10. Tooling Proposals

10.1 Proposed Tool Stack

Category	Tool	Why
Unit testing (Java)	JUnit 5 + Mockito	Industry standard for Spring Boot
Unit testing (JS/TS)	Jest + React Testing Library	Already in use, excellent for React
Integration testing	Testcontainers	Real Docker containers for MongoDB, PostgreSQL, Redis, LocalStack, RabbitMQ
E2E testing	Playwright	Consolidate from Playwright + Cypress + Selenium (see Sections 4.4 and 11)
Contract testing	Pact	Consumer-driven contracts for service boundaries
GraphQL schema	Apollo Rover	Schema checks in CI for federation
Performance testing	k6	JS-based, DataDog integration, GraphQL support, CI-native
Security scanning	Snyk or Trivy	Dependency vulnerabilities in CI
Mutation testing	PIT (Java), Stryker (JS/TS)	Validate test suite effectiveness
Visual regression	Playwright built-in screenshots	Free, no additional tooling
Accessibility	axe-core + Playwright	Automated WCAG 2.1 AA checks
Code review (AI)	CodeRabbit + Augment	Enforce testing standards, catch quality issues
Observability	DataDog + OpenTelemetry	Production quality signals
Chaos engineering	AWS FIS + Toxiproxy	Resilience testing (Phase 4+)

Intelligence Track
Unit testing (Python)	pytest + pytest-mock	Already in use across ML repos; standardize fixtures and markers
Data quality	Great Expectations	Already in datapipelines — expand to training repos; schema + distribution checks
Type checking (Python)	MyPy (strict mode)	Already on pricerator, ml_elasticity — standardize across all Python repos
Linting (Python)	Ruff	Already adopted — standardize config (line-length, rule sets) across all ML repos
ML experiment tracking	MLflow	Already in use — add automated validation gates (accuracy vs baseline)
Golden file testing	pytest + snapshot files	Already in group-forecast-service — expand to all inference endpoints
Property-based testing	Hypothesis	For algorithmic code: pricing constraints, elasticity calculations, optimization solvers
Pipeline testing	Airflow test utilities + pytest-docker	DAG structure validation, pipeline integration tests
AWS mocking	moto	Already in pricerator — standardize for all boto3/S3 testing

10.2 Testcontainers Setup for Duetto's Stack

Testcontainers should be the standard for all backend integration tests, replacing mocks with real infrastructure:

Container	What it replaces	Use case
`MongoDBContainer`	Mock MongoDB / embedded MongoDB	Repository tests, aggregation pipeline tests
`PostgreSQLContainer`	Mock PostgreSQL / H2	Neon Postgres compatibility, migration tests
`GenericContainer("redis:7")`	Mock Redis	Cache integration tests
`LocalStackContainer` (SQS, SNS, Kinesis, S3)	Mock AWS SDK	Event-driven architecture tests
`RabbitMQContainer`	Mock RabbitMQ	Message handler tests

11. Selenium-to-Playwright Migration

11.1 Migration Scope

Duetto has ~176 Selenium tests (Java, Selenium 4.29, Firefox) running across 20 parallel CI runners. These tests use the Page Object pattern, custom @RetryTest annotations, and rely on Xvfb virtual display + Jetty server for test execution. In parallel, ~175 Cypress specs (TypeScript) run across 12 CI containers.

Both test suites should migrate to Playwright. This section focuses on the Selenium migration, which is the larger and more complex effort.

11.2 Why Migrate Now

Driver	Details
Maintenance burden	20+ disabled Selenium tests with "to be fixed later" notes; custom retry infrastructure needed for stability
Browser limitation	Selenium tests only run Firefox; no Safari or Chrome coverage
Infrastructure cost	20 parallel runners + Xvfb virtual display is heavyweight compared to Playwright's headless-by-default approach
Framework age	Selenium 4.x is stable but less developer-friendly than Playwright's auto-waiting, built-in assertions, and tracing
Consolidation	Running 3 E2E frameworks (Selenium + Cypress + Playwright) across 3 repos is unsustainable; converging to 1 cuts maintenance by ~60%
BLAST compatibility	Next.js E2E testing is natively supported by Playwright; Selenium has no first-class Next.js integration

11.3 Migration Strategy: AI-Accelerated Conversion

The migration of ~176 Selenium tests is a high-volume, pattern-based task — ideal for AI acceleration. Using Claude Code (and optionally Augment Code for codebase-aware suggestions), the migration can be completed in a fraction of the time manual rewriting would require.

Phase 1 — Foundation (Weeks 1-2)

Set up the Playwright project and migrate the base infrastructure:

Create Playwright project in the existing duetto-playwright-e2e repo (or a new directory in duetto)
Convert base test utilities:
Selenium WebDriver setup → Playwright Browser/BrowserContext/Page
Custom @RetryTest annotation → Playwright's built-in retries config
Xvfb display server → Playwright headless mode (no display server needed)
Jetty server start/stop → Playwright webServer config in playwright.config.ts
MongoDB/Redis setup → reuse existing GitHub Actions setup or migrate to Testcontainers
Create a mapping reference for the AI tools:

Selenium (Java)	Playwright (TypeScript)
`driver.findElement(By.id("x"))`	`page.locator('#x')`
`driver.findElement(By.cssSelector(".x"))`	`page.locator('.x')`
`driver.findElement(By.xpath("//div"))`	`page.locator('div')` or `page.locator('xpath=//div')`
`element.click()`	`await locator.click()`
`element.sendKeys("text")`	`await locator.fill('text')`
`element.getText()`	`await locator.textContent()`
`element.isDisplayed()`	`await locator.isVisible()`
`new WebDriverWait(driver, 10).until(...)`	Auto-waiting built into Playwright actions
`driver.navigate().to(url)`	`await page.goto(url)`
`driver.switchTo().frame(...)`	`await page.frameLocator(...)`
`Thread.sleep(ms)`	`await page.waitForSelector(...)` or `await expect(locator).toBeVisible()`
`Actions(driver).moveToElement(e)`	`await locator.hover()`
`Select(element).selectByValue(v)`	`await locator.selectOption(v)`
`driver.manage().window().setSize(...)`	`await page.setViewportSize({...})`
`Assert.assertEquals(expected, actual)`	`await expect(locator).toHaveText(expected)`

Phase 2 — AI-Powered Page Object Conversion (Weeks 3-6)

Use Claude Code to bulk-convert Selenium Page Objects to Playwright:

Prompt strategy for Claude Code:

Given this Selenium Page Object class (Java), convert it to a Playwright
Page Object (TypeScript). Follow these rules:

1. Replace all Selenium WebDriver calls with Playwright equivalents
2. Replace explicit waits (WebDriverWait) with Playwright auto-waiting
3. Replace Thread.sleep() with proper Playwright waitFor* methods
4. Convert Java assertions to Playwright expect() assertions
5. Use Playwright's built-in locator strategies (prefer role, text,
   test-id over CSS/XPath)
6. Keep the Page Object pattern but adapt to TypeScript class syntax
7. Add proper TypeScript types for all methods
8. Replace any Selenium-specific retry logic with Playwright's
   built-in retry mechanisms

Source Selenium class:
[paste class]

Output the Playwright TypeScript equivalent.

Batch conversion workflow:

List all Page Objects from selenium/src/test/java/com/duetto/frontend/selenium/
Prioritize by business value: Start with pages that cover critical user journeys (login, pricing, rate management, dashboard)
Feed each Page Object to Claude Code with the prompt above
Human review: Engineer verifies each converted Page Object — check locator strategies, ensure business logic is preserved, validate assertions
Run both old and new tests in parallel for the same pages to validate conversion accuracy

Expected velocity with AI assistance: - Manual conversion: ~2-3 Page Objects per engineer per day - AI-assisted conversion: ~10-15 Page Objects per engineer per day (3-5x speedup) - Human review remains essential — AI may miss Duetto-specific patterns, custom wait conditions, or domain-specific assertions

Phase 3 — Test Conversion (Weeks 5-10)

Convert test files in priority order:

Priority	Tests	Criteria
P0	Login, authentication, critical navigation	Break = users can't access the product
P1	Pricing, rate management, revenue dashboards	Core business functionality
P2	Settings, admin, user management	Important but lower traffic
P3	Disabled tests (~20+)	Evaluate: convert or permanently delete

For each test file: 1. Feed the Selenium test + its Page Objects to Claude Code 2. AI generates the Playwright equivalent 3. Engineer reviews, adjusts for Duetto-specific patterns 4. Run the new Playwright test against the same environment 5. Validate it covers the same scenarios (compare step-by-step) 6. Once green, mark the Selenium test for deprecation

Claude Code skills for migration:

Consider creating a dedicated Claude Code skill (.claude/skills/selenium-to-playwright.md) that encodes: - The mapping reference table above - Duetto-specific Page Object conventions - Common patterns in Duetto's Selenium tests (e.g., how they handle MongoDB test data, Jetty server initialization) - Preferred Playwright locator strategy (test-id > role > text > CSS > xpath) - Assertion patterns used in the Playwright E2E repo

Phase 4 — Validation and Cutover (Weeks 9-12)

Parallel run period (2-3 weeks):
Run both Selenium and Playwright suites in CI
Track: same scenarios should produce same pass/fail results
Investigate any discrepancies — usually timing or locator differences
Decommission Selenium:
Remove Selenium tests from CI pipeline
Archive (don't delete) the Selenium directory for reference
Remove Selenium dependencies from build.gradle
Remove Xvfb and Firefox setup from GitHub Actions workflows
Reduce CI runners from 20 → Playwright's built-in sharding
Update CI infrastructure:
Replace 20 Selenium runners with Playwright --shard across fewer runners
Playwright's native parallelization typically needs 3-5 shards for equivalent coverage
Expected CI time reduction: 40-60%

11.4 Cypress Migration (Parallel Track)

The Cypress-to-Playwright migration follows a similar pattern but is simpler due to both being JavaScript/TypeScript:

Selenium → Playwright	Cypress → Playwright
Language change (Java → TS)	Same language (TS → TS)
Page Object pattern rewrite	Page Object pattern adaptation
New assertion library	Similar assertion patterns
~176 tests	~175 specs
12 weeks	6-8 weeks

Key Cypress → Playwright differences:

Cypress	Playwright
`cy.visit(url)`	`await page.goto(url)`
`cy.get('.selector')`	`page.locator('.selector')`
`cy.contains('text')`	`page.getByText('text')`
`cy.intercept()`	`page.route()`
`cy.wait('@alias')`	`await page.waitForResponse(...)`
Automatic chaining	Explicit `await` on each action
`cy.request()`	`request.get()` / `request.post()`

Claude Code can perform this conversion even faster since no language change is involved. Expected velocity: 15-25 specs per engineer per day with AI assistance.

11.5 Migration Timeline

Month 1          Month 2          Month 3          Month 4
┌────────────────┬────────────────┬────────────────┬────────────┐
│ SELENIUM       │ SELENIUM       │ SELENIUM       │            │
│ Phase 1:       │ Phase 2-3:     │ Phase 3-4:     │ Selenium   │
│ Foundation +   │ AI-convert     │ Convert P2-P3  │ decomm.    │
│ base infra     │ P0-P1 tests    │ + validation   │            │
├────────────────┼────────────────┼────────────────┤            │
│ CYPRESS        │ CYPRESS        │                │ Cypress    │
│ Phase 1:       │ Phase 2:       │ Cypress        │ decomm.    │
│ Stop new tests │ AI-convert     │ validation     │            │
│ + P0 migration │ P1-P3 tests    │ + cutover      │            │
└────────────────┴────────────────┴────────────────┴────────────┘

Total effort estimate: - With AI assistance: 2-3 engineers × 3 months (including validation) - Without AI: 3-4 engineers × 6 months - AI acceleration saves ~50% of migration effort

11.6 Risk Mitigation

Risk	Mitigation
AI generates incorrect locators	Every converted test must be run and verified by an engineer; use Playwright Codegen to validate locator strategies
Test behavior changes during conversion	Run old and new tests in parallel during validation; compare results test-by-test
Team bandwidth	Spread migration across teams — each team converts its own Selenium/Cypress tests, guided by QE
Loss of Selenium-specific infrastructure	Document all custom Selenium utilities before removal; ensure Playwright equivalents exist
Disabled tests never get converted	Audit disabled tests in Phase 1: decide convert or delete. No indefinite quarantine during migration

12. Implementation Roadmap

Phase 1 — Foundation (Months 1-3)

App/Platform track: - [ ] Define Quality Engineer role and career ladder (both tracks) - [ ] Hire or reassign 2-3 App/Platform QEs for pilot embedding (choose teams with willing tech leads) - [ ] Audit current testing: what exists, what's automated, where are gaps per team - [ ] Establish basic testing standards document (guild v0 — shared + track-specific sections) - [ ] Implement Phase 1 CI quality gates (build, unit tests, linting, schema checks) - [ ] Stop writing new Cypress and Selenium tests — all new E2E in Playwright - [ ] Begin Selenium-to-Playwright migration: foundation + base infrastructure (Section 11) - [ ] Create Claude Code skill for Selenium-to-Playwright conversion patterns - [ ] Add JaCoCo (backend) and Jest --coverage (frontend) for coverage visibility - [ ] Make SpotBugs blocking for new violations (baseline existing) - [ ] Set up shared Testcontainers configurations for common Duetto infrastructure - [ ] Create quality metrics dashboard in DataDog (basic version — both tracks)

Intelligence track: - [ ] Audit Intelligence repo testing: catalog test counts, coverage gaps, and existing strengths per repo - [ ] Standardize Ruff config across all Python ML repos (line-length, rule sets, pre-commit) - [ ] Extend MyPy strict mode to all Intelligence Python repos (currently only pricerator, ml_elasticity) - [ ] Add pytest --coverage to Intelligence CI pipelines for visibility (no thresholds yet) - [ ] Document existing MLflow, Great Expectations, and golden file testing practices - [ ] Automate Hammer: GitHub Actions workflow for pricing PRs + curated hotel sample (Section 4.6.7) - [ ] Define Hammer tolerance thresholds (warn vs block) to replace binary pass/fail - [ ] Add scheduled nightly Hammer run on develop with Slack notifications

Phase 2 — Pilot and Guild Formation (Months 3-6)

App/Platform track: - [ ] Embed QEs in 3-4 App/Platform pilot teams — focus on coaching, not gatekeeping - [ ] Form Quality Guild with two-track structure — bi-weekly all-hands, monthly track deep dives - [ ] Implement Phase 2 CI quality gates (integration tests, contract testing, security scanning) - [ ] Introduce Pact for first service boundary being extracted from monolith - [ ] AI-powered bulk conversion of Selenium P0-P1 tests to Playwright (Section 11.3) - [ ] Begin Cypress-to-Playwright migration (critical paths first) - [ ] Implement CodeRabbit + Augment Code testing enforcement rules - [ ] Add Snyk or Trivy for dependency security scanning on PRs - [ ] Start tracking DORA metrics per team

Intelligence track: - [ ] Hire or assign 1 Intelligence track QE (Python + ML pipeline experience required) - [ ] Expand Great Expectations from datapipelines into training repos (forecasting, ml_elasticity) - [ ] Implement golden file testing for pricerator and group-forecast-service inference endpoints - [ ] Add MLflow automated accuracy comparison vs baseline (warning gate, not blocking) - [ ] Create CI pipeline templates for Python ML repos (Ruff + MyPy + pytest + coverage) - [ ] Begin pytest coverage improvement: target utility functions and data transformations first - [ ] Hammer Phase 2: structured JSON output, PR comment summaries, DataDog metrics integration

Phase 3 — Scale and Standardize (Months 6-12)

App/Platform track: - [ ] Expand App/Platform QEs to cover all App + Platform teams (1 QE per 2-4 teams) - [ ] Implement Phase 3 CI quality gates (performance testing, coverage thresholds, E2E smoke) - [ ] Complete Selenium and Cypress decommission (Section 11.5) - [ ] Introduce mutation testing for critical business logic (pricing, revenue) - [ ] Build shared test data libraries (hotel, rate, reservation factories) - [ ] Create CI pipeline templates (reusable GitHub Actions workflows) - [ ] Establish quality onboarding for new developers

Intelligence track: - [ ] Intelligence QE embedded across Pricing, Forecasting, and Data teams - [ ] Implement data quality gates as blocking in Airflow DAGs (Great Expectations) - [ ] Introduce Hypothesis property-based testing for algorithmic code (pricing constraints, elasticity calculations) - [ ] pytest coverage threshold for Intelligence repos (40% — lower than app teams but meaningful) - [ ] Automated champion/challenger model testing before promotion to production - [ ] Data drift detection alerts integrated with DataDog

Phase 4 — Optimize and Mature (Months 12-18)

Both tracks: - [ ] All teams have QE support (embedded or shared) - [ ] Implement Phase 4 quality gates (flaky test quarantine, test impact analysis, visual regression) - [ ] Explore chaos engineering (AWS FIS + Toxiproxy for critical services) - [ ] Evaluate AI-powered test generation tools (Diffblue for Java, Claude Code test skills) - [ ] Formalize parallel run testing for high-risk service extractions - [ ] Annual review: guild health, tooling satisfaction, quality metrics by track

Intelligence-specific: - [ ] End-to-end ML pipeline validation (data → training → inference → output) as CI workflow - [ ] Model monitoring dashboards per model family (forecasting, elasticity, pricing optimizer) - [ ] Reproducibility CI checks: same config + data → deterministic output - [ ] Hammer Phase 3: containerize, decouple from monolith Spring context, AI-assisted diff analysis

13. Pre-Identified Potential Initiatives

The following initiatives have been identified during the analysis that produced this strategy. They are listed here as a starting backlog — not a commitment. Final prioritisation, sequencing, and scope will be determined by the Quality Guild and Quality Engineering Team once established.

Each initiative is tagged with the team charter it belongs to: Guild (TC-006) for governance and standards work, or QE Team (TC-007) for infrastructure and tooling delivery. QE Team initiatives are split by track.

13.1 Quality Guild Initiatives (TC-006)

These initiatives relate to governance, standards, coaching, and culture — owned by the guild as a community of practice.

#	Initiative	Phase	Primary Metric Impact
1	Author testing standards document v1 — shared section + track-specific sections, including engineer testing responsibilities (unit/integration/E2E ownership expectations, PR review standards)	1	Cross-Team Testing Standard Adoption
2	Establish flaky test policy and SLA — quarantine rules, remediation ownership, maximum days disabled	1-2	Flaky Test Rate
3	AI-generated code QA strategy — mutation testing adoption criteria, AI-generated test review guidelines, tautological test detection	2	Escaped Defect Rate
4	DORA metrics tracking and review — define measurement per team, monthly review cadence with EMs	2	DORA Change Failure Rate
5	Testing standards document v2 — add Intelligence track sections (ML testing diamond, data quality standards, pipeline testing expectations, Intelligence-specific quality gates)	2	Cross-Team Testing Standard Adoption

13.2 Quality Engineering Team Initiatives — Shared (TC-007)

Infrastructure and tooling that serves both App/Platform and Intelligence tracks.

#	Initiative	Phase	Primary Metric Impact
1	Phase 1 CI quality gates — build and improve existing checks: build verification, unit tests, linting/formatting, GraphQL schema checks, code coverage visibility (JaCoCo, Jest, pytest — reporting only, no thresholds)	1	Quality Gate Adoption Rate
2	Quality metrics dashboard in DataDog — test health, pipeline health, per-team and per-track views	1	Quality Gate Adoption Rate
3	Phase 2 CI quality gates — integration tests, contract verification, dependency security scanning (Snyk/Trivy, critical/high = blocking), coverage delta PR comments	2	Quality Gate Adoption Rate
4	CodeRabbit + Augment Code configuration — path-based test enforcement rules, anti-pattern detection, semantic test gap review	2	Quality Gate Adoption Rate
5	Flaky test auto-quarantine system — detection (>3 failures in 7 days), automatic quarantine, Jira ticket creation, unified dashboard	2	Defect Detection Rate
6	Incident RCA tagging model — establish `caught-in-ci` / `escaped-to-production` / `could-have-been-caught` / `not-ci-detectable` classification in Jira	2	Defect Detection Rate
7	Phase 3 CI quality gates and reusable CI pipeline templates — k6 performance regression, coverage thresholds (60% blocking), E2E smoke on staging; published as versioned GitHub Actions workflows for Java/Spring, React/Next.js, Python ML	3	Quality Gate Adoption Rate

13.3 Quality Engineering Team Initiatives — App/Platform Track (TC-007)

#	Initiative	Phase	Primary Metric Impact
1	Selenium-to-Playwright migration — foundation, base infrastructure, Claude Code migration skill; stop new Cypress and Selenium tests, all new E2E in Playwright (Section 11)	1	E2E Framework Consolidation
2	Make SpotBugs blocking for new violations — baseline existing, fail PR on new	1	Quality Gate Adoption Rate
3	Shared Testcontainers configurations — MongoDB, PostgreSQL, Redis, LocalStack (SQS/SNS/Kinesis/S3), RabbitMQ	1	Quality Gate Adoption Rate
4	E2E framework bulk conversion — AI-powered Selenium P0-P1 test conversion with parallel run validation + Cypress critical path migration (Sections 11.3, 11.4)	2	E2E Framework Consolidation
5	Pact broker setup — contract broker infrastructure, `can-i-deploy` gate, message contract support for first service extraction	2	Quality Gate Adoption Rate
6	Selenium and Cypress decommission — remove from CI, archive tests, remove 20 Selenium runners + 12 Cypress containers, replace with Playwright sharding	3	E2E Framework Consolidation
7	Mutation testing infrastructure — PIT for Java, Stryker for JS/TS on critical business logic (pricing, revenue)	3	Defect Detection Rate
8	Shared test data libraries — factories and fixtures for hotels, rates, reservations, users	3	CI Build Success Rate

13.4 Quality Engineering Team Initiatives — Intelligence Track (TC-007)

#	Initiative	Phase	Primary Metric Impact
1	Audit Intelligence repo testing — catalog test counts, coverage gaps, and existing strengths per repo	1	Quality Gate Adoption Rate
2	Standardize Python tooling across all Intelligence repos — unified Ruff config (line-length, rule sets, pre-commit) and MyPy strict mode (currently only pricerator, ml_elasticity)	1	Quality Gate Adoption Rate
3	Hammer CI automation — GitHub Actions for pricing PRs + curated hotel sample + nightly develop run + Slack notifications + configurable tolerance thresholds replacing binary pass/fail (Section 4.6.7)	1	Hammer CI Automation
4	Expand Great Expectations from datapipelines into training repos (forecasting, ml_elasticity)	2	Quality Gate Adoption Rate
5	Model and inference validation — golden file testing for inference endpoints (pricerator, group-forecast-service) + MLflow automated accuracy comparison vs baseline (warning gate, not blocking)	2	Defect Detection Rate
6	Python ML CI pipeline templates — Ruff + MyPy strict + pytest + coverage as reusable GitHub Actions	2	Quality Gate Adoption Rate
7	Hammer structured reporting — JSON output, PR comment summaries, DataDog metrics integration	2	Hammer CI Automation
8	Data quality gates as blocking in Airflow DAGs — Great Expectations suites per pipeline stage	3	Defect Detection Rate
9	Hypothesis property-based testing for algorithmic code — pricing constraints, elasticity calculations, optimization solvers	3	Defect Detection Rate
10	pytest coverage thresholds for Intelligence repos — 40% initial target	3	Quality Gate Adoption Rate
11	Automated champion/challenger model testing before promotion to production	3	Defect Detection Rate
12	Data drift detection alerts integrated with DataDog	3	Defect Detection Rate

13.5 Initiative Summary

Owner	Count	Phase 1	Phase 2	Phase 3
Quality Guild	5	2	3	0
QE Team — Shared	7	2	4	1
QE Team — App/Platform	8	3	2	3
QE Team — Intelligence	12	3	4	5
Total	32	10	13	9

Note: These are pre-identified potential initiatives derived from the analysis in this strategy document. They are not commitments. The Quality Guild and Quality Engineering Team will refine, reprioritise, merge, or discard initiatives as they begin work and learn from early phases. Formal APEX initiative IDs (I-YYYY-XX-NNN) will be assigned when initiatives are approved and enter the APEX pipeline.

14. Frequently Asked Questions

Q: Does this mean we're getting rid of manual QA? A: We're evolving it. Manual exploratory testing remains valuable — trained exploratory testers consistently find bugs that automated tests miss. But manual regression testing is eliminated. Developers own automated testing; QE focuses on strategy, coaching, and high-value exploratory work.

Q: Who writes the tests — developers or QEs? A: Developers write unit and integration tests for their code. QEs define the test strategy, coach developers on what and how to test, conduct exploratory testing, and build shared infrastructure. Think of it like security: every developer writes secure code, but security engineers set standards and do penetration testing.

Q: How does this work with AI-generated code? A: AI can generate tests, but AI-generated tests have known risks (tautological tests, testing implementation not behavior). QEs review test strategy and effectiveness, mutation testing validates test suite quality, and CodeRabbit enforces testing standards in code review.

Q: Won't this slow teams down? A: Short-term, introducing quality gates adds friction. Long-term, it reduces escaped defects, incident frequency, and time spent firefighting. Google's data shows that investing in testing infrastructure pays back within 6-12 months through fewer production issues and faster development velocity.

Q: What about our existing Automation Engineers? A: They can transition to Quality Engineers (with coaching/strategy focus) or join the Quality Engineering team (infrastructure focus), depending on their strengths and interests. Both paths are valuable.

Q: Why Playwright over Cypress? A: Multi-browser support, Java bindings for backend teams, native parallelization (free vs. Cypress Cloud paid), built-in visual regression, superior Next.js/BLAST compatibility, and 2-3x faster CI execution. See Section 4.4 for full comparison.

Q: How does this strategy apply to Intelligence/ML teams? A: Intelligence teams operate on a different technology stack (Python, LightGBM, Airflow, MLflow) and face different quality risks (data drift, model accuracy degradation, pipeline failures). The guild has a dedicated Intelligence track with its own QE(s), testing practices (data quality, pipeline tests, model validation), and tools (Great Expectations, golden file testing, Hypothesis). The standard testing honeycomb is replaced by a "testing diamond" adapted for ML systems. See Section 4.6.

Q: Do ML teams need the same coverage thresholds as App teams? A: No. ML training code is inherently harder to unit test because outcomes depend on data distributions, not deterministic logic. We set a lower initial coverage threshold (40% vs 60% for App) but target the testable parts — utility functions, data transformations, API layers, and configuration validation. Quality in ML comes from data validation (Great Expectations), model validation (MLflow metrics), and pipeline testing — not just code coverage.

Q: How long will the Selenium and Cypress migration take? A: With AI-assisted conversion (Claude Code), approximately 3 months with 2-3 engineers. Without AI, it would take 6+ months. The migration runs Selenium and Cypress tests in parallel with new Playwright tests during validation, so there's no coverage gap. See Section 11 for the full plan.

References

Google, Software Engineering at Google — Chapter 11 (Testing Overview), Chapter 12 (Unit Testing)
Google Research, State of Mutation Testing at Google (2018)
Google Testing Blog, Where Do Our Flaky Tests Come From? (2017)
GitLab, Testing Guide (docs.gitlab.com/ee/development/testing_guide)
Atlassian, Quality Assistance vs Quality Assurance model
Spotify, Testing Honeycomb and Squad/Chapter/Guild organizational model
Stripe, Paved Road approach to developer experience
Netflix, Paved Road infrastructure and Toxiproxy
Microsoft, Combined Engineering announcement (2014)
Pact Foundation, Consumer-Driven Contract Testing (pact.io)
CodeRabbit, Skills and Configuration documentation
Google, Reliable Machine Learning — ML testing and validation patterns
Great Expectations, Data Quality and Testing documentation (greatexpectations.io)
Hypothesis, Property-Based Testing for Python (hypothesis.readthedocs.io)

Document History

Date	Author	Change
2026-03-04	Antonio Cortés	Initial draft
2026-03-04	Antonio Cortés	Added: engineer testing responsibilities (3.4), Augment Code evaluation (5.5), current CI/CD state analysis (Section 8), Selenium-to-Playwright migration plan (Section 11). Updated QE staffing to 4-6 embedded.
2026-03-04	Antonio Cortés	Added: Intelligence domain testing strategy (Section 4.6) with ML testing diamond, repo analysis, and ML-specific quality gates. Restructured guild model to two-track (App/Platform + Intelligence) with separate embedded QE profiles, cadence, standards, tooling, and roadmap items.
2026-03-04	Antonio Cortés	Updated L6 title to Staff/Lead QE. Changed language from "recommendation" to "proposal" throughout. Added phasing notes: Intelligence track as Phase 2 for the Quality Guild. Updated folder structure. Staff/Lead QE leads Quality Engineering Team with automation architecture responsibilities.
2026-03-04	Antonio Cortés	Added Hammer pricing regression testing analysis and 3-phase modernization plan (Section 4.6.7). Added Hammer to existing strengths (4.6.5) and test infrastructure summary (Section 8.5). Added Hammer automation items to implementation roadmap (Phases 1, 2, and 4 Intelligence track).
2026-03-04	Antonio Cortés	Renamed "Quality Platform Team" to "Quality Engineering Team" throughout to avoid confusion with the App/Platform engineering domain.
2026-03-05	Antonio Cortés	Renamed document from "QA & Automation Strategy" to "Quality Engineering Strategy." Added Section 13: pre-identified potential initiatives for Quality Guild (10), QE Team Shared (13), QE Team App/Platform (11), and QE Team Intelligence (20) — 54 total.

Quality Engineering Strategy: Best Practices for Duetto

Quality Engineering Strategy: Best Practices for Duetto

1. Executive Summary

Current State

Proposed Target State

2. QA Organizational Model

2.1 Why the Hybrid/Guild Model

2.2 Proposed Structure for Duetto

App/Platform Track

Intelligence Track (ML / Data) — Phase 2

2.3 Ratios and Sizing

3. Role Definitions

3.1 The Quality Engineer Role (Proposed Primary Role)

3.2 Role Comparison

3.3 Why Not SDET?

3.4 Engineer Testing Responsibilities

3.5 Career Ladder for Quality Engineers

4. Testing Strategy

4.1 The Testing Honeycomb (Not the Pyramid)

4.2 Testing by Architecture Layer

Backend (Java / Spring Boot)

Frontend (React / TypeScript / Next.js)

GraphQL (Apollo Federation)

Event-Driven Architecture (SQS, SNS, Kinesis, RabbitMQ)

4.3 Contract Testing with Pact

4.4 E2E Framework Consolidation: Playwright

4.5 Testing During the Monolith Migration

4.6 Testing Strategy for Intelligence Teams (ML/Data)

4.6.1 Intelligence Domain Technology Landscape

4.6.2 Why the Standard Honeycomb Doesn't Fully Apply

4.6.3 The ML Testing Diamond

4.6.4 Testing by Intelligence Architecture Layer

ML Training Pipelines (Python — forecasting, ml_elasticity, ml_pricing_engine)

Pricing/Optimization Services (Python — pricerator, group-forecast-service)

Intelligence Java Services (intelligence-domain, dynamic-optimization)

Data Pipelines (datapipelines)

4.6.5 Existing Intelligence Strengths to Preserve

4.6.6 Intelligence-Specific Quality Gates

4.6.7 Hammer: Pricing Regression Testing — Current State and Modernization

5. QA for AI-Generated Code

5.1 How AI-Generated Code Changes Testing

5.2 Specific Risks and Mitigations

5.3 Mutation Testing

5.4 AI Tools for QA

5.5 AI Code Review Tools for Testing Standards Enforcement

CodeRabbit

Augment Code

Comparison and Proposed Usage

6. Cross-Team Consistency

6.1 The Problem

6.2 The "Paved Road" Approach

6.3 Testing Standards Document

6.4 Quality Guild Cadence

7. Quality Gates in CI/CD

7.1 Progressive Adoption

7.2 Gate Classification

7.3 Test Execution Optimization

8. Current State: CI/CD & Test Infrastructure Analysis

8.1 Current CI/CD Pipeline Architecture

8.2 Static Code Analysis — Current State

8.3 Test Coverage — Current State

8.4 Flaky Test Handling — Current State

Backend — Custom Retry System

Frontend — Cypress Retry + Automated Detection

Playwright E2E

8.5 Test Infrastructure Summary

9. Quality Metrics

9.1 Metrics That Matter

9.2 Quality Dashboard

10. Tooling Proposals

10.1 Proposed Tool Stack

10.2 Testcontainers Setup for Duetto's Stack

11. Selenium-to-Playwright Migration

11.1 Migration Scope

11.2 Why Migrate Now

11.3 Migration Strategy: AI-Accelerated Conversion

Phase 1 — Foundation (Weeks 1-2)

Phase 2 — AI-Powered Page Object Conversion (Weeks 3-6)

Phase 3 — Test Conversion (Weeks 5-10)

Phase 4 — Validation and Cutover (Weeks 9-12)