proposal draft

Quality Engineering Strategy: Best Practices for Duetto

Antonio Cortés Updated 2026-03-11
quality-engineering testing guild ci-cd developer-experience automation intelligence ml-testing

Quality Engineering Strategy: Best Practices for Duetto

Author: Antonio Cortés Date: 2026-03-04 Status: DRAFT Audience: Engineering Leadership, Engineering Managers, Tech Leads Applies to: All engineering teams — App, Intelligence, and Platform areas

Related Team Charters: - Quality Guild Charter (TC-006) - Quality Engineering Team Charter (TC-007)


1. Executive Summary

This document proposes a quality engineering strategy for Duetto, informed by industry best practices from organizations like Google, Spotify, Atlassian, GitLab, Stripe, and Netflix. It addresses Duetto's current challenges — manual QA, inconsistent ways of working across teams, and the unique demands of AI-first engineering — and proposes a path forward.

Current State

Dimension Current Reality
QA model Manual QA + some Automation Engineers, not embedded in teams
Ways of working Inconsistent across tens of teams — no shared testing standards
Test automation Some automation exists but coverage, tools, and practices vary by team
E2E testing Both Playwright and Cypress in use — no consolidation strategy
AI-generated code 50-70% of code is AI-generated, but QA practices haven't adapted
Architecture transition Active monolith-to-microservices migration creates new testing challenges at service boundaries
Intelligence / ML testing Minimal test coverage in ML repos (~100 tests across 15+ repos); existing strengths in MLflow, MyPy strict mode, Great Expectations, and pre-commit hooks — but no shared ML testing standards

Proposed Target State

Dimension Target
QA model Hybrid/Guild with two tracks: App/Platform QEs + Intelligence (ML/Data) QE + central Quality Engineering team
Ways of working Shared testing standards with track-specific adaptations, enforced through CI/CD quality gates and Quality Guild
Test automation Developer-owned testing with QE coaching; testing honeycomb for App/Platform, testing diamond for Intelligence (data quality, pipeline tests, model validation)
E2E testing Consolidated on Playwright (App/Platform); golden file + data quality testing (Intelligence)
AI-generated code Specific QA strategies for AI-generated code; mutation testing; AI-aware code review
Architecture transition Contract testing (Pact) at service boundaries; parallel run testing for high-risk extractions

2. QA Organizational Model

2.1 Why the Hybrid/Guild Model

Three QA models exist in the industry. For Duetto's context (tens of teams, lean core, monolith-to-microservices migration), the Hybrid/Guild model is the strongest fit.

Model How it works Pros Cons Used by
Centralized Separate QA team serves all product teams Consistent standards, clear career ladder Bottleneck, "throw it over the wall" mentality, slow feedback Legacy enterprises
Embedded QA engineers sit within product teams Deep domain context, fast feedback, team ownership Isolated QA, inconsistent practices, no shared infrastructure Spotify (evolved)
Hybrid/Guild Embedded in teams + central enablement guild Best of both: context + consistency, shared infrastructure, career growth Requires guild leadership, matrix reporting complexity GitLab, Stripe, Atlassian

Why not centralized: With tens of teams, a centralized QA team would be a severe bottleneck. It reinforces the "QA tests my code" anti-pattern that doesn't scale.

Why not pure embedded: Without a guild, tens of teams would each invent their own practices. QA engineers would be isolated with no career path and no shared tooling.

Why hybrid/guild: Embedded QE gets domain context for microservices testing. The central guild provides consistency, shared infrastructure, career development, and drives org-wide quality initiatives — critical during the monolith migration and AI-first transformation.

2.2 Proposed Structure for Duetto

Duetto's engineering organization spans two distinct technology domains: App/Platform (Java/Spring Boot + React/TypeScript) and Intelligence (Python ML/data pipelines). The guild model must serve both, with shared governance but domain-appropriate practices.

Phasing note: The Intelligence track is designed to be added as a second phase of the Quality Guild. Phase 1 focuses on establishing the guild with the App/Platform track (highest team count, largest test debt, Selenium/Cypress migration). Once the guild is operational and the App/Platform practices are stable (approximately months 3-6), the Intelligence track is introduced with its own embedded QE(s) and domain-specific practices. This phased approach avoids overloading the guild at inception and allows the Intelligence track to benefit from the patterns and infrastructure already established by the App/Platform track.

┌──────────────────────────────────────────────────────────────────────┐
│                          Quality Guild                                │
│  (All QEs + interested engineers meet bi-weekly)                      │
│  (Guild lead coordinates standards, hiring, career growth)            │
├──────────────────────────────────────────────────────────────────────┤
│                                                                       │
│  ┌──────────────────────┐   ┌──────────────────────┐                 │
│  │  App/Platform Track   │   │  Intelligence Track   │                │
│  │                       │   │  (ML / Data)          │                │
│  │  Embedded QEs         │   │  Embedded QE(s)       │                │
│  │  (3-5 across          │   │  (1-2 across          │                │
│  │   App + Platform      │   │   Pricing, Forecast,  │                │
│  │   teams)              │   │   Elasticity, Data)   │                │
│  │                       │   │                       │                │
│  │  • Test strategy      │   │  • Data quality       │                │
│  │  • Dev coaching       │   │  • Pipeline testing   │                │
│  │  • Exploratory testing│   │  • Model validation   │                │
│  │  • Quality metrics    │   │  • ML test coaching   │                │
│  │  • Playwright E2E     │   │  • Great Expectations │                │
│  │  • Pact contracts     │   │  • Golden file tests  │                │
│  └──────────────────────┘   └──────────────────────┘                 │
│                                                                       │
│  ┌──────────────────────────────────────────────────────────────┐    │
│  │       Quality Engineering Team (2-3 people)                      │    │
│  │       Led by Staff/Lead QE                                     │    │
│  │                                                               │    │
│  │  Shared across both tracks:                                   │    │
│  │  • CI/CD quality gates (GitHub Actions templates)             │    │
│  │  • Flaky test detection & remediation systems                 │    │
│  │  • Quality dashboards (DataDog — per team, per track)         │    │
│  │  • AI code testing tools (CodeRabbit, Augment Code config)    │    │
│  │                                                               │    │
│  │  App/Platform-specific:        Intelligence-specific:         │    │
│  │  • Testcontainers configs      • Great Expectations infra     │    │
│  │  • Playwright infra            • MLflow validation gates      │    │
│  │  • Pact broker management      • Data drift monitoring        │    │
│  │  • Test data factories         • Pipeline test frameworks     │    │
│  └──────────────────────────────────────────────────────────────┘    │
│                                                                       │
└──────────────────────────────────────────────────────────────────────┘

App/Platform Track

Embedded Quality Engineers (3-5 people): - 1 QE serves 2-4 App or Platform teams — lean, coaching-focused - Embedded in team ceremonies (standups, planning, retros) - Focuses on test strategy, exploratory testing, and developer coaching - Expertise: Java/Spring Boot testing, React/TypeScript testing, Playwright E2E, Pact contracts - Does NOT do all the testing — developers own their tests

Intelligence Track (ML / Data) — Phase 2

This track is proposed as a second phase, introduced once the Quality Guild is established with the App/Platform track (see phasing note above). During Phase 1, Intelligence teams benefit from shared guild standards (Ruff, pre-commit, coverage visibility) and Quality Engineering infrastructure, but the dedicated Intelligence QE(s) and ML-specific practices are added in Phase 2.

Embedded Quality Engineer(s) (1-2 people): - Serves Pricing, Forecasting, Elasticity, Anomaly Detection, and Data Pipeline teams - Different skill set than App/Platform QEs — requires Python, data engineering, and ML pipeline experience - Focuses on data quality strategy (Great Expectations), pipeline testing, model validation gates, and golden file testing - Coaches ML engineers on testing the testable parts — utility functions, transformations, API layers — rather than forcing the honeycomb onto ML training code - Does NOT validate model accuracy (that's the ML engineer's domain) — instead ensures the infrastructure for validation exists (MLflow gates, champion/challenger pipelines, drift detection)

Why Intelligence needs its own QE track: - The technology stack (Python, LightGBM, Airflow, MLflow) shares almost nothing with the App stack (Java, React, Playwright) - Quality risks are different: data drift and model degradation vs. UI bugs and API contract violations - Testing tools are different: Great Expectations and golden file testing vs. Playwright and Pact - A QE trained in Playwright and Pact cannot coach an ML engineer on data validation or pipeline testing — and vice versa - However, both tracks share governance, CI/CD infrastructure, and quality culture — hence one guild, two tracks

Quality Engineering Team (2-3 people, led by Staff/Lead QE):

The team is led by the Staff/Lead QE (L6) — the most senior technical IC in the guild. This person: - Sets the technical vision for test automation across the organization — framework choices, infrastructure design, migration strategies (e.g., Selenium-to-Playwright) - Designs reusable CI/CD quality gate architecture — GitHub Actions templates, quality gate tiers, pipeline optimization - Makes tooling decisions — evaluates and selects tools (Playwright over Cypress, Pact for contracts, Great Expectations for data quality), defines integration patterns - Acts as the technical counterpart to the Guild Lead — the Guild Lead owns people and governance, the Staff/Lead QE owns technical strategy and infrastructure - Mentors Quality Engineering team members and embedded QEs on automation best practices

The team as a whole: - Builds and maintains shared test infrastructure across both tracks - Owns CI/CD quality gates and pipeline optimization - Manages flaky test detection, remediation systems, and quality dashboards - Develops AI-specific testing strategies and tooling (CodeRabbit, Augment Code) - Intelligence-specific: maintains Great Expectations infrastructure, MLflow validation gates, data drift monitoring - App/Platform-specific: maintains Testcontainers configs, Playwright infrastructure, Pact broker

Quality Guild: - All QEs from both tracks + interested developers and ML engineers meet bi-weekly - Guild lead (can be an Engineering Manager) coordinates standards, hiring, and career growth - Two-track agenda: shared topics (CI/CD, AI code quality, metrics) + rotating track-specific deep dives - Cross-pollination between tracks: App engineers learn about data quality; ML engineers learn about contract testing - Shared playbooks where practices overlap (e.g., pytest best practices, pre-commit hooks, coverage reporting)

2.3 Ratios and Sizing

Role Track Count Ratio Notes
Embedded QE (App/Platform) App/Platform 3-5 1 QE : 2-4 teams Start with 2 in pilot, scale based on results
Embedded QE (ML/Data) Intelligence 1-2 1 QE : 3-5 teams Phase 2. Requires Python + ML pipeline experience; start with 1
Quality Engineering Shared 2-3 Central team Led by Staff/Lead QE + 1-2 Engineers
Guild Lead Shared 1 Part of EM role Coordinates both tracks, not a full-time role
Total QE headcount Both 7-11 ~1 QE : 12-19 devs Lean — quality is a shared responsibility

Hiring priority and phasing: Start with App/Platform QEs in Phase 1 (largest surface area, most teams, highest test debt in Selenium/Cypress migration). The Intelligence track QE is hired in Phase 2 once the guild is operational — this person needs a rare blend of testing expertise and data/ML familiarity, so allow longer time to hire. Intelligence teams still participate in the guild from Phase 1 and benefit from shared standards and infrastructure.

This is significantly leaner than traditional QA models (1:3-5 ratio) because developers own testing. QE enables and elevates, not replaces.


3. Role Definitions

3.1 The Quality Engineer Role (Proposed Primary Role)

The industry is converging on the Quality Engineer as the dominant quality role in modern SaaS. It replaces the traditional QA Engineer, focusing on prevention over detection, coaching over gatekeeping.

Philosophy shift: - FROM: "QA finds bugs after development" (detective work) - TO: "QE prevents bugs by improving the system" (preventive work)

What a Quality Engineer's week looks like at Duetto:

Day Activities
Monday Sprint planning — reviewing stories for testability, suggesting acceptance criteria
Tuesday Pair programming with developer on test strategy for new microservice
Wednesday Analyzing test metrics dashboard, identifying flaky tests, reviewing CI pipeline health
Thursday Exploratory testing on high-risk feature, writing up findings and risk assessment
Friday Quality Guild meeting — sharing testing patterns for event-driven architecture

3.2 Role Comparison

Dimension QA Engineer (traditional) Test Automation Eng. SDET Quality Engineer (proposed)
Primary focus Test execution & defect finding Automated test creation Test tooling & infrastructure Quality strategy & enablement
Coding level Light (scripting) Moderate (test code) Heavy (production-grade) Moderate (tooling + automation)
Production code Rarely Never Frequently Sometimes
Manual testing Significant Minimal Rare Strategic/exploratory only
Developer coaching Minimal Some Some (on testability) Primary responsibility
Quality strategy Limited No Some Primary responsibility
Typical dev ratio 1:3-5 1:5-8 1:8-15 1:10-20

3.3 Why Not SDET?

The SDET role (Software Development Engineer in Test) was pioneered by Microsoft in the early 2000s at a 1:1 SDE-to-SDET ratio. Microsoft eliminated the title in 2014, merging all SDETs into SDEs under a "Combined Engineering" model. Google similarly moved away from dedicated SETs (Software Engineers in Test) around 2016-2018.

Why it declined: - The separate role created a "someone else will test my code" mentality - SDETs often became test-only engineers despite being hired as software engineers - The 1:1 ratio was expensive and unsustainable - Modern CI/CD made developer-owned testing more practical

What replaced it: Quality Engineers (coaching model) + Platform/Infrastructure Engineers (shared tooling) + developer-owned testing. This is the model we propose for Duetto.

3.4 Engineer Testing Responsibilities

In the hybrid/guild model, developers own testing — QEs coach and enable, but do not replace engineer responsibility. Every engineer is expected to contribute across the testing honeycomb.

Unit Tests (40% of effort — owned by engineers): - Write unit tests for all new business logic, domain models, and utility functions - Maintain test coverage for code they modify — no PR without corresponding tests - Use JUnit 5 + Mockito (Java) or Jest + React Testing Library (frontend) - Focus on behavior verification, not implementation testing: test what the code does, not how it does it - AI-generated code requires the same (or higher) testing standard — engineers must verify AI-generated tests are meaningful, not tautological - Target: every function with branching logic or business rules has a corresponding test

Integration Tests (50% of effort — engineers with QE guidance): - Write integration tests for service-to-service communication, database queries, message handling, and GraphQL resolvers - Use Testcontainers for real infrastructure (MongoDB, PostgreSQL, Redis, LocalStack, RabbitMQ) instead of mocks - Write Pact consumer contracts when consuming another team's API or events - Test the full request-response cycle through your service, not just internal methods - QEs help define the integration test strategy and identify critical boundaries; engineers execute - For event-driven services: test the complete publish → queue → consume → side-effect path

E2E Tests (10% of effort — engineers + QE collaboration): - Write Playwright E2E tests for critical user journeys that touch your team's features - Focus on happy paths and high-risk scenarios — E2E tests are expensive to maintain - Follow the Page Object pattern for maintainable test code - QEs define which journeys need E2E coverage and review test design; engineers implement - Run E2E tests locally before pushing — don't rely solely on CI

General expectations for all engineers:

Responsibility Expectation
Test with every PR No production code change merges without corresponding tests
Fix broken tests If your change breaks a test, you fix it — not the QE
Flaky test ownership If you wrote a test that becomes flaky, you own remediation (within SLA)
Review test quality In code reviews, evaluate test coverage and quality, not just production code
Test data management Use shared factories and fixtures; don't hardcode test data
AI-generated test review Critically evaluate AI-generated tests: check for tautological assertions, missing edge cases, and implementation coupling

3.5 Career Ladder for Quality Engineers

Level Title Focus
L1-L2 Junior QE Learns testing fundamentals, assists with test strategy, basic automation
L3-L4 Quality Engineer Defines test strategy for a team, coaches developers, builds automation
L5 Senior QE Test strategy across multiple teams, drives quality initiatives, analyzes metrics
L6 Staff/Lead QE Org-wide quality strategy, defines quality engineering practices, influences engineering culture. Leads the Quality Engineering Team — owns technical vision for automation infrastructure, framework decisions, and CI/CD quality gate design
L7 Principal QE / Head of Quality Sets quality vision for the organization, industry thought leadership

4. Testing Strategy

4.1 The Testing Honeycomb (Not the Pyramid)

The traditional testing pyramid (many unit tests, fewer integration, fewer E2E) was designed for monoliths. For Duetto's microservices migration, the testing honeycomb (Spotify model) is more appropriate:

Why the pyramid doesn't work for microservices: The majority of bugs in microservices occur at service boundaries — serialization mismatches, API contract violations, message schema drift, network timeout handling. Unit tests within a single service catch fewer of these real-world failures.

The honeycomb model:

Layer % of effort What to test Tools
E2E / UI tests ~10% Critical user journeys through the full stack Playwright
Integration tests ~50% Service-to-service communication, database interactions, message handling, GraphQL resolvers Testcontainers, Spring Boot Test, Pact
Unit tests ~40% Complex business logic (pricing algorithms, revenue calculations) JUnit 5, Jest, React Testing Library

Key insight: In microservices, integration tests are the most valuable. Internal logic within a well-designed microservice is often simple; the complexity lives in the interactions.

4.2 Testing by Architecture Layer

Backend (Java / Spring Boot)

Test Type What Tools Who Writes
Unit tests Business logic, domain models, utilities JUnit 5, Mockito Developers
Integration tests Repository operations, service interactions, message handlers Testcontainers (MongoDB, PostgreSQL, Redis, LocalStack, RabbitMQ), Spring Boot Test Developers + QE guidance
API tests REST/GraphQL endpoints, request/response validation MockMvc, Spring Boot Test, Apollo subgraph testing Developers
Contract tests Service boundary contracts Pact (pact-jvm) Developers + QE
Performance tests Load, latency, throughput k6 (with DataDog integration) QE / Quality Engineering

Frontend (React / TypeScript / Next.js)

Test Type What Tools Who Writes
Unit tests Component logic, hooks, utilities, state management Jest, React Testing Library Developers
Component tests Visual + behavioral component validation Storybook, React Testing Library Developers
Integration tests Apollo Client queries, form flows, multi-component interactions Jest + MockedProvider, Playwright Developers
E2E tests Critical user journeys, cross-page flows Playwright Developers + QE
Visual regression UI consistency, design system compliance Playwright built-in screenshots QE / Quality Engineering
Accessibility WCAG 2.1 AA compliance axe-core, Playwright accessibility assertions Developers + QE

GraphQL (Apollo Federation)

Test Type What Tools
Schema validation Breaking change detection Apollo rover subgraph check in CI
Resolver unit tests Individual resolver logic DGS test utilities / graphql-java, Jest + MockedProvider
Composition tests Cross-subgraph query resolution Apollo Router in test mode
Contract tests Consumer-provider contracts for subgraphs Pact (supports GraphQL)
Operation tests Real query regression against test env Apollo Studio operation checks

Event-Driven Architecture (SQS, SNS, Kinesis, RabbitMQ)

Test Type What Tools
Handler unit tests Message processing logic, idempotency JUnit 5, Jest
Integration tests End-to-end message flow (publish → queue → consume → side effects) Testcontainers (LocalStack, RabbitMQ)
Contract tests Message schema compatibility Pact message contracts
Ordering/failure tests Out-of-order delivery, DLQ routing, batch failures Testcontainers + custom scenarios

4.3 Contract Testing with Pact

Contract testing is critical during the monolith-to-microservices migration. It verifies that services agree on their API contracts without needing full integration environments.

How Pact works:

  1. Consumer side: Write tests describing expected interactions → Pact generates a contract JSON
  2. Pact Broker: Contracts are published to a central broker
  3. Provider side: Provider runs verification tests against the contract
  4. Can-I-Deploy: Before deploying, pact-broker can-i-deploy checks all contracts are verified

When to introduce Pact at Duetto: - Start writing contracts for the modules you plan to extract from the monolith next - The monolith acts as "consumer" of the new service; React/Next.js frontends are also consumers - Use Pact message contracts for SQS/SNS/Kinesis event schemas - Pact supports GraphQL interactions natively

4.4 E2E Framework Consolidation: Playwright

Duetto currently uses both Playwright and Cypress. We propose consolidating on Playwright.

Dimension Playwright Cypress
Browser support Chromium, Firefox, WebKit (Safari) Chromium, Firefox, WebKit (experimental)
Language support JS, TS, Python, Java, .NET JS, TS only
Multi-tab/window Full support Not supported
Parallel execution Built-in (worker-based) Requires Cypress Cloud (paid)
CI performance 2-3x faster in parallel scenarios Slower (sequential default)
Visual regression Built-in screenshot comparison Via plugins (Percy, Applitools)
API testing Full HTTP client (request context) Limited (cy.request)
Cost Fully open source Core OSS, paid Cloud features
Next.js integration Native via @playwright/test Limited

Why Playwright wins for Duetto: - Java bindings — backend teams can write integration tests in familiar tools - Native parallelization — reduces CI pipeline time by 40-60% - Next.js (BLAST) compatibility — tests server components, edge functions, middleware - Apollo Federation testingrequest context is superior for GraphQL API testing - Cost — fully open source vs. Cypress Cloud (paid for parallelization and dashboard)

Migration strategy: 1. Stop writing new Cypress tests immediately 2. All new E2E tests in Playwright 3. Gradually migrate critical Cypress tests (prioritize by business value) 4. Set a deadline (~6 months) for full Cypress decommission

4.5 Testing During the Monolith Migration

The monolith-to-microservices migration creates a period where the same functionality exists in both systems. Testing strategies must account for this duality.

Strangler Fig Testing Strategy:

Testing Layer What When
Contract tests Define contracts before extracting functionality Before each service extraction
Routing tests Verify the strangler facade correctly directs traffic During migration
Parallel run tests Route to both old and new, compare responses During high-risk extractions
Data consistency tests Verify data migration completeness and dual-write consistency When service takes data ownership
Rollback tests Verify switching back to monolith works Before each cutover

Parallel run testing (for high-risk extractions):

Request --> [Router]
              |
              +---> [Monolith] ---> Response (returned to user)
              |
              +---> [New Service] ---> Response (logged, compared)

Route a shadow copy of requests to the new service, compare responses, and track the discrepancy rate. When it drops below 0.1%, switch primary. Tools: Scientist4J pattern, DataDog for tracking comparison results.

4.6 Testing Strategy for Intelligence Teams (ML/Data)

Phase 2 scope. The Intelligence-specific practices in this section are designed to be introduced as a second phase of the Quality Guild, after the App/Platform track is established (see Section 2.2). During Phase 1, Intelligence teams adopt shared standards (linting, pre-commit, coverage visibility) and benefit from Quality Engineering infrastructure. The dedicated Intelligence QE, ML-specific quality gates, and full testing diamond are introduced in Phase 2.

The Intelligence domain (Pricing, Forecasting, Elasticity, Anomaly Detection, Optimization) operates on a fundamentally different technology stack and development lifecycle than App and Platform teams. The testing honeycomb (Section 4.1) applies to the Java services within Intelligence, but the core ML training pipelines, data engineering, and algorithmic code require a distinct quality approach.

4.6.1 Intelligence Domain Technology Landscape

Dimension App / Platform Teams Intelligence Teams
Primary language Java 17 + TypeScript/React Python 3.10-3.12 (dominant), some Java
Frameworks Spring Boot, Next.js, Apollo GraphQL Flask, LightGBM, Prophet, Ray, PyTorch, Optuna, PuLP/HiGHS
Code nature CRUD APIs, UI components, service orchestration ML training pipelines, optimization solvers, statistical models, feature engineering
Data stores MongoDB, PostgreSQL, Redis S3, MLflow, DynamoDB, Athena/Glue, PostgreSQL (caching)
Deployment Kubernetes, continuous delivery Docker containers, AWS Lambda, SSM-based version promotion (stage→demo→prod)
Orchestration Request/response APIs Airflow DAGs, DynamoDB config-driven jobs, Ray distributed training
Testing maturity ~3,100+ tests across repos ~100 tests across all ML repos; some repos have 3 tests for 30K+ LOC

Key repos analyzed: pricerator (pricing engine, 82 source files, 47 test files — best-tested), forecasting (LightGBM + Prophet, 30.5K LOC, 3 tests), ml_elasticity (DoubleML, 50+ tests), ml_pricing_engine (LP/MILP optimizer, 15 tests), intelligence-domain (Java/Spring Boot with JaCoCo), datapipelines (31+ Airflow DAGs, Great Expectations).

4.6.2 Why the Standard Honeycomb Doesn't Fully Apply

The testing honeycomb (unit → integration → E2E) assumes request/response services where bugs live at boundaries. In ML systems, the primary quality risks are different:

Risk Category What Can Go Wrong Standard Honeycomb Coverage
Model accuracy degradation New training run produces worse predictions than previous version Not covered — no concept of "model accuracy" in unit/integration tests
Data quality drift Input data schema changes, nulls appear in critical columns, distributions shift Partially covered — data validation is not traditional testing
Feature engineering bugs Incorrect temporal joins, data leakage across train/test split, wrong aggregation windows Partially covered — unit tests can catch some, but the bugs are subtle and domain-specific
Training pipeline failure OOM during distributed training, config-driven jobs silently skip steps Not covered — pipeline orchestration testing is distinct
Numerical instability Floating-point issues in optimization solvers, edge cases in elasticity calculations Partially covered — requires property-based and boundary testing
Reproducibility failure Same config + data produces different results across runs Not covered — requires deterministic seeding and environment locking

4.6.3 The ML Testing Diamond

For Intelligence teams, we propose a testing diamond adapted from Google's ML testing guidelines and industry practice:

            ┌───────────────┐
            │  Model        │  ~5% of effort
            │  Validation   │  Accuracy, backtesting, champion/challenger
            ├───────────────┤
          ┌─┤  Pipeline     ├─┐  ~15% of effort
          │ │  Tests        │ │  DAG correctness, config validation, idempotency
          │ ├───────────────┤ │
        ┌─┤ │  Data         │ ├─┐  ~30% of effort
        │ │ │  Quality      │ │ │  Schema enforcement, drift detection, expectations
        │ │ ├───────────────┤ │ │
        │ │ │  Unit +       │ │ │  ~50% of effort
        │ │ │  Integration  │ │ │  Transformations, utilities, API contracts
        └─┴─┴───────────────┴─┴─┘
Layer % Effort What to Test Tools Who
Unit + Integration ~50% Data transformations, utility functions, API endpoints, service contracts pytest, unittest, moto (AWS mocking), Testcontainers, Pact Engineers
Data Quality ~30% Input schema validation, null/range checks, distribution drift, referential integrity Great Expectations (already in datapipelines), Pandera, custom validators Engineers + QE
Pipeline Tests ~15% Airflow DAG validation, config-driven job correctness, idempotency, failure recovery pytest-docker, DAG unit tests, golden file testing Engineers + QE
Model Validation ~5% Accuracy metrics vs baseline, backtesting on holdout data, champion/challenger comparison MLflow (already in use), custom metrics, SHAP analysis ML Engineers + QE

4.6.4 Testing by Intelligence Architecture Layer

ML Training Pipelines (Python — forecasting, ml_elasticity, ml_pricing_engine)
Test Type What Tools Current State Target
Utility unit tests Data transformations, feature engineering functions, math utilities pytest, numpy.testing Low (~15 tests in ml_pricing_engine, 3 in forecasting) Every transformation function has tests
Data validation Input schema checks, null detection, range validation, distribution alerts Great Expectations, Pandera Present in datapipelines; absent in training repos Validation at every pipeline stage
Golden file tests Known input → expected output regression tests pytest + snapshot files Present in group-forecast-service Expand to all model inference paths
Model validation Accuracy vs previous version, backtesting, metric tracking MLflow metrics, custom MLflow experiment tracking in place Automated champion/challenger gates
Config validation YAML/DynamoDB config correctness, required fields, value ranges JSON Schema, Pydantic Pydantic validation in pricerator All config-driven jobs validate before execution
Reproducibility Same config + data → same results Seed locking, pytest-randomly Partial (deterministic seeds in some models) Deterministic training with CI reproducibility check
Pricing/Optimization Services (Python — pricerator, group-forecast-service)
Test Type What Tools Current State Target
API tests Flask endpoint validation, request/response schemas pytest + Flask test client Good coverage in pricerator (47 test files) Maintain; add contract tests for consumers
Algorithm tests Pricing step correctness, constraint application, rate selection pytest, property-based testing (Hypothesis) Unit tests exist Expand with property-based tests for edge cases
Integration tests External API client calls (Rate API, Elasticity API, Monolith) pytest-mock, moto (S3), responses Present with moto for S3 Add contract tests for upstream ML services
Intelligence Java Services (intelligence-domain, dynamic-optimization)

These repos align with the standard strategy (Sections 4.1-4.2): - intelligence-domain — Spring Boot with JaCoCo, PostgreSQL, RabbitMQ, LLM enrichment (Claude) - dynamic-optimization — Spring Boot with RabbitMQ, Awaitility for async testing, MockServer

Standard honeycomb applies: JUnit 5, Testcontainers, Pact contracts, Playwright for any UI exposure.

Data Pipelines (datapipelines)
Test Type What Tools Current State Target
DAG validation DAG structure, task dependencies, schedule correctness Airflow test utilities, pytest 31+ DAGs, no DAG-specific tests found DAG structure tests for every pipeline
Data quality Great Expectations suites per pipeline stage Great Expectations, DataDog Present in monitoring module Expand to all critical pipelines; block on failures
Transformation tests Spark/Glue job logic, SQL correctness pytest, PySpark test utilities Limited Unit tests for all transformation functions

4.6.5 Existing Intelligence Strengths to Preserve

The Intelligence domain already has quality practices that should be recognized and built upon:

Practice Where Value
MLflow experiment tracking forecasting, ml_elasticity, ml_pricing_engine Model versioning, hyperparameter logging, SHAP analysis — the ML equivalent of a quality dashboard
MyPy strict mode pricerator, ml_elasticity Stricter type enforcement than any app team repo — prevents a class of runtime errors
Great Expectations datapipelines Data validation framework already in production — expand coverage
Golden file testing group-forecast-service Snapshot-based output validation — effective for deterministic inference
detect-secrets pricerator, ml_elasticity (pre-commit) Secret detection in pre-commit hooks — ahead of app teams
Ruff + pre-commit All Python repos Consistent linting and formatting already enforced
Semantic version promotion pricerator, forecasting stage-NEXT → demo-NEXT → prod-NOW pipeline with SSM gating
Hammer (pricing regression) duetto monolith (hammer/ module) Branch-vs-branch pricing optimization comparison with statistical aggregation — the most sophisticated existing regression testing tool in the org. Currently on-demand only (see Section 4.6.7)

4.6.6 Intelligence-Specific Quality Gates

Supplement the standard CI/CD gates (Section 7) with ML-specific gates:

Phase 1 (align with org Phase 1): - Ruff lint + format check (already present — standardize config across repos) - MyPy strict mode (already on some repos — extend to all) - pytest unit tests with --coverage reporting (establish baseline)

Phase 2: - Great Expectations data quality checks as blocking gates in DAGs - Golden file regression tests for inference endpoints - Model accuracy comparison vs baseline (MLflow metric check) — warning, not blocking

Phase 3: - Automated champion/challenger testing before model promotion - Data drift detection alerts (via Great Expectations or custom monitors) - pytest coverage threshold (start at 40% for ML repos — lower than app teams, but meaningful) - Property-based testing (Hypothesis) for algorithmic code

4.6.7 Hammer: Pricing Regression Testing — Current State and Modernization

What Hammer is:

Hammer is an existing pricing optimization regression testing tool in the duetto monolith (hammer/ module). It is the most sophisticated quality validation tool currently in the Intelligence domain. It consists of two executables:

  • hammerOptimizer — Runs the full pricing optimization for a set of hotels, generating rate recommendations, constrained/unconstrained forecasts, and rate sync data transactions (RSDT). Results are stored in S3 under a {runName}/{branchName}/ prefix.
  • hammerDiffer — Compares optimization outputs from two branches with statistical aggregation: mean squared error per hotel, rate deviation percentages, one-sided diffs (prices in one branch but not the other), and per-hotel forecast diffs. Exits with code 1 if forecast differences are detected.

Typical workflow (current — manual):

# 1. Run optimization on feature branch
gradle runHammer -PbranchName=feature-branch -PrunName=run1 ...

# 2. Run optimization on develop
gradle runHammer -PbranchName=develop -PrunName=run1 ...

# 3. Compare results
gradle runHammerDiffer -PfirstBranch=feature-branch -PsecondBranch=develop ...

Current limitations:

Limitation Impact
On-demand only No CI/CD integration — pricing regressions can ship undetected if nobody runs Hammer
Manual 3-step process Run branch A → run branch B → run differ. Easy to forget or misconfigure
Binary pass/fail Any forecast diff triggers System.exit(1) — no tolerance threshold; a 0.001% rate deviation is treated the same as 50%
No historical tracking Results live as text files in S3 with no dashboard, alerting, or trend analysis
Heavy monolith coupling Loads the full Spring context (:api, :data, :query, :scheduler, :server) — slow startup, tightly coupled to monolith
No subset mode for CI Running all active hotels is too slow for PR pipelines; no curated sample for fast feedback
Opaque reporting Pipe-delimited text files in S3 — no structured output, no PR comments, no Slack notifications
No DataDog integration The rest of the org uses DataDog for observability, but Hammer results are isolated

Proposed modernization — 3 phases:

Phase 1 — Automate in CI (Months 1-3):

  • GitHub Actions workflow for pricing PRs: Trigger Hammer automatically on PRs that modify pricing-related code paths (api/src/**/pricing/**, api/src/**/optimizer/**, hammer/**). Use path filters to avoid running on unrelated PRs.
  • Representative hotel sample: Maintain a curated registry of 10-20 hotels (small, medium, large, different regions and configurations) for fast CI runs. Full hotel set reserved for nightly runs.
  • Scheduled nightly run on develop: Compare develop against the last release tag. Full hotel set. Slack notification if diffs exceed threshold.
  • Configurable tolerance thresholds: Replace binary pass/fail with threshold-based reporting — warn on small deviations (<0.5% mean rate deviation), block on large deviations (>2%). Allow expected changes to be annotated in the PR.

Phase 2 — Structured reporting and observability (Months 3-6):

  • Structured JSON output: Extend Hammer to produce machine-readable JSON alongside the current text format, enabling consumption by dashboards and CI tooling.
  • PR comment with summary table: GitHub Actions posts a structured comment on pricing PRs:
  • Hotels tested / hotels with diffs
  • Mean rate deviation, largest deviation (hotel + percentage)
  • Status: PASS / WARN / FAIL
  • DataDog integration: Push Hammer results as custom metrics — rate deviation per hotel, number of affected hotels, MSE trends. Build a dashboard tracking pricing stability over time. Alert on trends ("pricing deviation increasing over last 5 runs").

Phase 3 — Decouple and modernize (Months 6-12):

  • Containerize Hammer: Docker image with Hammer JAR + dependencies. Eliminates the need to build the full monolith to run Hammer. Runs on any CI runner or as a scheduled ECS task.
  • Decouple from monolith Spring context: Extract PricingOptimizerService interface so Hammer can operate with a lighter context. As pricing moves to microservices, Hammer transitions from loading code in-process to calling the pricing service HTTP endpoint.
  • AI-assisted diff analysis: Feed Hammer diff output to Claude Code to generate human-readable summaries for PR reviewers — e.g., "This PR changes rates for 3 APAC hotels by an average of 2.3%, concentrated in room types rt1234 and rt5678 for Q3 dates. This is consistent with the PR description."

Target state:

Dimension Current Target
Trigger Manual, on-demand Automatic on pricing PRs + nightly on develop
Scope All hotels or manual selection Curated sample for PRs, full set for nightly
Pass/fail Binary (any diff = fail) Threshold-based (warn / block with configurable tolerance)
Reporting S3 text files PR comments, DataDog dashboard, Slack alerts
Speed Slow (full hotel set + Spring context) Fast for PRs (sample + containerized), thorough overnight
Historical tracking None DataDog metrics with trend analysis
Monolith coupling Full Spring context load Containerized → eventually HTTP client to pricing service

5. QA for AI-Generated Code

With 50-70% of Duetto's code AI-generated via Claude Code, QA strategy must explicitly address this reality.

5.1 How AI-Generated Code Changes Testing

Traditional Code AI-Generated Code
Developer understands every line they wrote Developer reviews code they didn't write — attention gaps are common
Bugs correlate with developer skill and domain knowledge Bugs correlate with prompt quality and review thoroughness
Testing verifies the developer's implementation Testing must verify the AI's implementation AND the reviewer's understanding
Test quality depends on developer testing discipline AI can generate tests too — but tautological tests (testing the implementation, not the behavior) are a known risk

5.2 Specific Risks and Mitigations

Risk Mitigation
AI-generated code passes review but has subtle logic errors Mutation testing to verify test suite effectiveness
AI generates tests that test the implementation, not the behavior QE reviews test strategy, not just test code; property-based testing for complex logic
AI-generated code has security vulnerabilities (dependency confusion, injection, etc.) SAST/DAST in CI, CodeRabbit security rules, dedicated security scanning
AI generates overly complex code when simple code would do Complexity metrics in CI (warn on high cyclomatic complexity), code review standards
AI-generated tests provide false confidence (high coverage, low signal) Mutation testing score as a quality gate, QE exploratory testing

5.3 Mutation Testing

Mutation testing is the strongest technique for verifying test suite quality. It works by inserting small faults ("mutants") into production code and checking whether tests detect them.

Why it matters for AI-generated code: - AI can generate tests with high line coverage that don't actually catch bugs - Mutation testing reveals whether tests are truly effective, not just comprehensive - Google uses mutation testing at scale (6,000+ engineers, 14,000+ code authors) via a diff-based probabilistic approach

Tools: - PIT (pitest) — mutation testing for Java/JVM (Spring Boot, JUnit) - Stryker — mutation testing for JavaScript/TypeScript (Jest, React)

Proposal: Start with mutation testing on critical business logic (pricing algorithms, revenue calculations) and expand. Use as a quality signal, not a hard gate initially.

5.4 AI Tools for QA

Tool What It Does Applicability to Duetto
Claude Code Generate test scaffolding, write test cases from specs, debug test failures Primary — already in use. QE should develop testing-specific prompts and skills.
Diffblue Cover AI-generated unit tests for Java Evaluate for Spring Boot services — auto-generates JUnit tests for existing code
Playwright Codegen Records browser actions and generates Playwright test code Use for rapid E2E test creation; QE reviews and refines generated tests
CodeRabbit AI code review that can enforce testing standards via path-based rules Configure to flag PRs missing tests, enforce coverage thresholds, detect test anti-patterns
Augment AI-assisted code review Complement CodeRabbit for reviewing AI-generated code quality

5.5 AI Code Review Tools for Testing Standards Enforcement

Two AI-powered code review tools are available in Duetto's stack. They serve complementary purposes for quality enforcement.

CodeRabbit

Automated AI code review with rule-based enforcement:

  • Path-based instructions: Require tests for specific directories (e.g., src/services/** must have corresponding src/__tests__/services/**)
  • Code guidelines: Reference testing standards document so CodeRabbit checks compliance
  • AST-based rules (Pro): Enforce patterns like "no console.log in production code" or "test files must use describe/it structure"
  • Default protections: Automatically skips review of generated code, lock files, build artifacts
  • Strengths: Excellent for systematic rule enforcement, configurable per-repo, catches structural test issues (missing test files, coverage regressions, anti-patterns)

Augment Code

AI-powered code review and development assistant with deep codebase context:

  • Codebase-aware reviews: Augment indexes the full codebase to provide contextual review comments — it understands existing patterns and flags deviations
  • Test quality analysis: Can identify when tests don't adequately cover the changed code, suggest missing test scenarios, and flag weak assertions
  • Architecture awareness: Understands service boundaries and can flag when integration or contract tests are missing for cross-service changes
  • IDE integration: Available in VS Code and JetBrains IDEs, providing real-time feedback during development (shift-left)
  • Strengths: Superior contextual understanding of the codebase; better at catching logic-level issues and suggesting what should be tested rather than enforcing structural rules

Comparison and Proposed Usage

Dimension CodeRabbit Augment Code
Primary mode Automated PR review (CI) IDE assistant + PR review
Rule enforcement Strong — configurable path-based and AST rules Moderate — guideline-based, not rule-based
Codebase context Limited to PR diff + configured instructions Deep — indexes full repository
Test gap detection Structural (missing test files, coverage delta) Semantic (missing scenarios, weak assertions)
Custom configuration Extensive (.coderabbit.yaml, path instructions) Moderate (team-level instructions)
Best for Enforcing standards consistently at scale Catching logic-level issues humans miss

Proposal: Use both tools in complement: - CodeRabbit as the systematic enforcer — ensures every PR meets structural quality gates (test files exist, patterns followed, no anti-patterns) - Augment Code as the intelligent reviewer — catches semantic issues like insufficient test scenarios, missing edge cases, and tests that don't match the intent of the change - Configure CodeRabbit rules first (quick wins), then layer Augment for deeper quality insights


6. Cross-Team Consistency

6.1 The Problem

With tens of autonomous teams, inconsistency is the default. Without intervention, each team: - Chooses its own testing tools and patterns - Has different coverage thresholds (or none) - Writes tests at different levels (some heavy on unit, some on E2E, some on nothing) - Has different CI pipeline configurations - Handles flaky tests differently (or doesn't handle them at all)

6.2 The "Paved Road" Approach

Inspired by Stripe and Netflix: make the right thing easy, not mandatory.

Instead of mandating practices through policy, build infrastructure that makes good practices the path of least resistance:

Paved Road What It Means
Shared test templates GitHub repo templates that include pre-configured test setup — App/Platform: Jest, JUnit, Playwright, Testcontainers; Intelligence: pytest, Great Expectations, golden file harness
CI pipeline templates Reusable GitHub Actions workflows with quality gates built in — separate templates for Java/Spring, React/Next.js, and Python ML repos
Test data libraries Shared factories and fixtures for common Duetto entities (hotels, rates, reservations, users); Intelligence: shared test data schemas and sample hotel model fixtures
Quality dashboard DataDog dashboard showing quality metrics per team and per track — visibility drives behavior
Example repos "Golden path" example services showing the proposed testing approach for backend, frontend, full-stack, and ML/data pipeline repos

6.3 Testing Standards Document

The Quality Guild should own a lightweight testing standards document. What to include and what to leave to teams:

Standardize (guild-owned — all teams): - Test categorization (unit, integration, E2E, contract, data quality, pipeline — definitions and expectations) - Minimum quality gates for CI/CD (what blocks a merge) - Test naming conventions - Flaky test policy (quarantine, SLA for remediation) - Pre-commit hook standards (Ruff/ESLint, type checking, secret detection) - Coverage reporting (visibility required; thresholds by track)

Standardize (App/Platform track): - E2E framework choice (Playwright — not optional) - Contract testing approach (Pact — for service boundaries) - Testcontainers for integration tests (real infrastructure, not mocks)

Standardize (Intelligence track): - Data quality framework (Great Expectations — for all data pipelines) - Golden file testing for inference endpoints - MyPy strict mode for all Python repos - MLflow experiment tracking with minimum logged metrics - Model promotion requires accuracy comparison vs baseline

Leave to teams: - Internal test organization (file structure, test grouping) - Mocking strategies (as long as they follow "test behavior not implementation") - Which specific scenarios to test (teams know their domain best) - Test execution speed optimization (teams manage their own pipeline budget) - ML model architecture and hyperparameter choices

6.4 Quality Guild Cadence

Activity Frequency Participants Purpose
Guild meeting (all-hands) Bi-weekly, 45 min Both tracks + interested engineers Shared topics: CI/CD, AI code quality, metrics review, cross-pollination
App/Platform deep dive Monthly, 30 min App/Platform QEs + leads Track-specific: Playwright, Pact, Testcontainers, frontend testing
Intelligence deep dive Monthly, 30 min Intelligence QE + ML engineers Track-specific: data quality, pipeline testing, model validation, MLflow
Standards review Quarterly Full guild Update testing standards for both tracks based on what's working
Quality metrics review Monthly Guild lead + EM Review dashboards by track, identify teams needing support
Tool evaluation As needed Relevant track Evaluate new tools, make proposals
Onboarding Per new QE/engineer Relevant track QE Testing expectations and tooling walkthrough (track-appropriate)

7. Quality Gates in CI/CD

7.1 Progressive Adoption

Do not implement all gates at once. This creates overwhelming friction. Phase them in:

Phase 1 — Foundation (Weeks 1-4): - Build verification (compilation, Docker image) - Unit tests with existing coverage - Linting/formatting (ESLint, Prettier, Checkstyle — autofix where possible) - GraphQL schema checks (rover subgraph check)

Phase 2 — Contracts and Integration (Weeks 5-8): - Integration tests (with Testcontainers) - Pact contract verification - Security scanning (Snyk, Dependabot, or Trivy — critical/high block) - Code coverage reporting (no hard threshold yet — just visibility)

Phase 3 — Performance and Quality (Weeks 9-12): - Performance regression testing (k6 in CI) - Code coverage thresholds (start conservative: 60%, increase over time) - Bundle size monitoring (frontend) - E2E smoke tests on staging deployment

Phase 4 — Optimization (Ongoing): - Flaky test quarantine system - Test impact analysis (run only tests affected by changed files) - Visual regression testing - Mutation testing on critical paths

7.2 Gate Classification

Tier Behavior Examples
Blocking Must pass to merge Build, unit tests, integration tests, linting, schema checks, contract tests, security (critical/high)
Warning Report but don't block Coverage trending down, performance regression >10%, bundle size growth, complexity increase
Informational Report only Test execution time trends, flaky test rate, dead code, accessibility audit, TODO count

7.3 Test Execution Optimization

Technique How Impact
Parallelization Playwright --shard, JUnit 5 parallel execution, GitHub Actions matrix strategy 2-4x faster CI
Selective test runs Jest --changedSince=main, Gradle --tests with file mapping 50-80% fewer tests on average PR
Fail-fast Run unit tests first; skip integration/E2E if they fail Faster feedback on obvious failures
Caching Cache Docker images (Testcontainers), Playwright browsers, Maven/npm dependencies 30-50% faster pipeline startup
Test result aggregation Publish to DataDog for trend analysis, GitHub Actions test summary annotations Flaky test detection, regression tracking

8. Current State: CI/CD & Test Infrastructure Analysis

This section documents the current state of Duetto's CI/CD pipelines, static code analysis, test coverage, and flaky test handling — derived from analysis of the duetto (backend) and duetto-frontend repositories. Understanding the baseline is essential for prioritizing improvements.

8.1 Current CI/CD Pipeline Architecture

Repository PR Pipeline Push (develop) Pipeline Scheduled
duetto (backend) Static analysis → All tests (Jest + basic + Selenium) Static analysis → All tests → Docker build → GraphQL schema publish Every 2 hours (with commit-change check)
duetto-frontend Lint → Jest → Cypress (12 parallel containers) Lint → Jest → Cypress → Trigger external Playwright E2E Weekly flaky test issue creation (Mondays)
duetto-playwright-e2e PR-specific tests Regression suite (triggered by frontend push) On-demand via workflow dispatch

Key observations: - Pipeline structure is sound — progressive checks with fast failures first - Backend uses larger runners (ARM 64-core for static analysis, ubuntu-latest-m for Jest) - Playwright tests live in a separate repository, triggered via webhook — this adds latency and reduces developer visibility - Scheduled all-tests run (every 2 hours) with Slack notifications provides good ongoing monitoring

8.2 Static Code Analysis — Current State

Tool Repository Configuration Blocking? Notes
Checkstyle 10.26.1 Backend 120-char line length, naming conventions, whitespace, modifier order Yes — fails PR Well-configured, enforces consistent Java style
SpotBugs 6.1.7 Backend Exclude filter for known false positives, HTML+XML reports NoignoreFailures: true Reports generated but don't block merges
ESLint (Airbnb + TS) Frontend Airbnb + TypeScript config, no-only-tests (error level), deprecated import warnings Yes — fails PR Good configuration; prevents .only() leaks
Prettier 2.4.1 Frontend Integrated into lint-staged pre-commit hooks Yes — pre-commit Formatting consistency ensured
TypeScript compiler Frontend tsc --noEmit strict check Yes — fails PR Catches type errors before runtime
Xray / Frogbot Backend Security scanning via JFrog Manual trigger Available but not on every PR

Gaps and proposals:

Gap Impact Proposal Priority
SpotBugs is non-blocking Potential bugs slip through to production Make SpotBugs blocking for new violations (allow existing baseline) High
No SonarQube or equivalent No centralized quality dashboard, no code smell tracking, no technical debt measurement Evaluate SonarCloud (SaaS) for unified quality visibility across repos Medium
No SAST/DAST security scanning on PRs Security vulnerabilities in dependencies and code not caught early Add Snyk or Trivy to PR pipeline for dependency scanning (critical/high = blocking) High
No frontend complexity analysis Complex components grow unchecked Add ESLint complexity rules (max cyclomatic complexity warning) Low

8.3 Test Coverage — Current State

Critical finding: No code coverage thresholds are enforced in either repository.

Dimension Backend Frontend
Coverage tool Not configured (no JaCoCo) Not configured in jest.config.js
Coverage threshold None None
Coverage reporting None None
Coverage trend tracking None None
Test count ~2,318 Java unit tests + 176 Selenium tests ~404 Jest specs + 175 Cypress specs + 32 Playwright tests

Proposals:

  1. Immediate (Phase 1): Add coverage reporting without thresholds — visibility first
  2. Backend: Add JaCoCo to Gradle (jacocoTestReport), publish HTML reports as CI artifacts
  3. Frontend: Add --coverage flag to Jest CI command, publish lcov reports
  4. Integrate with CodeCov or SonarCloud for PR-level coverage delta comments

  5. Phase 2: Introduce conservative thresholds (not blocking yet)

  6. Start with warning-level thresholds: 50% line coverage (likely below current levels to avoid disruption)
  7. Track coverage trend per PR — flag PRs that decrease coverage

  8. Phase 3: Enforce blocking thresholds

  9. Increase to 60% line coverage (blocking)
  10. Require no decrease in coverage per PR (blocking)
  11. Target 80%+ for critical business logic (pricing, revenue calculations, event processing)

8.4 Flaky Test Handling — Current State

Duetto has invested meaningfully in flaky test infrastructure, but the approach is fragmented across frameworks.

Backend — Custom Retry System

Mechanism Details
@RetryTest annotation Custom JUnit 5 extension: retries test N times on specific exception types (e.g., TimeoutException, InvocationTargetException)
Usage ~9 annotated tests, primarily Selenium page tests
validate-flapper-fix workflow Manual-trigger workflow that runs a single test up to 250 times to verify a flaky fix
@Disabled tests ~20+ Selenium tests disabled with notes like "To Be Fixed in Later Ticket"
Test splitting 15 runners (basic) + 20 runners (Selenium) with line-count distribution via split-tests action
Slack alerts Scheduled test failures notify via Slack webhook

Frontend — Cypress Retry + Automated Detection

Mechanism Details
Cypress retries retries=10 — each test can fail up to 10 times before final failure
.xspec.ts quarantine 5 test files renamed to .xspec.ts (excluded from runs)
Weekly automation GitHub Action runs every Monday: parses Cypress logs for failures, creates GitHub issue with skip instructions
Cypress Cloud Records all runs for post-mortem analysis
it.skip() / describe.skip() Manual skip annotations throughout the codebase

Playwright E2E

Mechanism Details
Retry 1 retry on CI only (retries: process.env.CI ? 1 : 0)
Workers 1 worker on CI (stability), 4 locally (speed)
Reporting Allure reports with per-team breakdown, DataDog test visibility integration
Traces Retained on failure for debugging

Gaps and proposals:

Gap Impact Proposal Priority
Cypress retries=10 is excessive Masks genuinely flaky tests; 10 retries can add minutes to CI Reduce to 2-3 retries; any test needing >3 retries is flaky and should be quarantined High
No unified flaky test tracking Flaky tests tracked differently per framework; no org-wide view Build a DataDog dashboard tracking flaky rate per test suite, per team Medium
~20+ disabled Selenium tests Unknown test debt; regression risk Audit disabled tests: either fix, delete, or convert to Playwright. Set SLA: no test disabled >30 days without a ticket High
No automatic quarantine Manual process to skip flaky tests; relies on someone noticing Implement automatic quarantine: if a test fails >3 times in 7 days, auto-quarantine + create ticket Medium
Frontend weekly automation is reactive Only runs Mondays; flaky tests can block PRs all week Run flaky detection daily or on every develop push Low

8.5 Test Infrastructure Summary

┌──────────────────────────────────────────────────────────────────┐
│                     Current Test Landscape                        │
├────────────────┬───────────────┬──────────────────────────────────┤
│ Layer          │ Count         │ Framework & Notes                │
├────────────────┼───────────────┼──────────────────────────────────┤
│ Java unit      │ ~2,318 tests  │ JUnit 5 + Mockito (15 runners)  │
│ Backend Jest   │ Subset        │ Jest (frontend-in-backend)       │
│ Frontend Jest  │ ~404 specs    │ Jest + RTL (jsdom)               │
│ Cypress E2E    │ ~175 specs    │ Cypress 7.7 (12 runners, ret=10)│
│ Selenium E2E   │ ~176 tests    │ Selenium 4.29 + Firefox (20 run)│
│ Playwright E2E │ ~32 tests     │ Playwright (separate repo)       │
│ Hammer (price) │ On-demand     │ Custom Java (monolith module)    │
├────────────────┼───────────────┼──────────────────────────────────┤
│ TOTAL          │ ~3,100+ tests │ 7 frameworks across 3+ repos     │
└────────────────┴───────────────┴──────────────────────────────────┘

Key takeaway: The test suite is substantial (~3,100+ tests) but fragmented across 7 frameworks and multiple repositories. Consolidation on Playwright (replacing Cypress and Selenium), unifying test reporting, and automating Hammer (currently the only pricing regression tool, but on-demand only) will dramatically improve signal quality and reduce maintenance burden.


9. Quality Metrics

9.1 Metrics That Matter

Beyond test coverage — metrics that actually correlate with software quality:

Leading Indicators (predict quality):

Metric What It Measures Target Tool
Test coverage trend Direction of coverage over time (not absolute %) Increasing quarter-over-quarter CodeCov, SonarQube
Mutation score % of mutants killed — measures test effectiveness >70% for critical business logic PIT, Stryker
Flaky test rate % of test runs that are non-deterministic <1% of total test suite Custom tracking, DataDog
Build success rate % of CI builds that pass on first run >90% GitHub Actions metrics
PR test coverage delta Coverage change per PR No decrease (warning), increase (goal) CodeCov PR comments

Lagging Indicators (measure quality outcomes):

Metric What It Measures Target Tool
Escaped defect rate Bugs that reach production per release Decreasing trend Jira/incident tracking
Mean Time to Recovery (MTTR) How fast production issues are fixed <1 hour for P1 DataDog, PagerDuty
Deployment frequency How often teams ship to production Multiple times/week per team GitHub Actions
Change failure rate % of deployments causing incidents <5% Incident tracking
Incident frequency Production incidents per team per month Decreasing trend PagerDuty, DataDog

DORA metrics (Deployment Frequency, Lead Time, Change Failure Rate, MTTR) should be the north-star metrics for engineering quality.

9.2 Quality Dashboard

Build a DataDog dashboard showing quality health per team:

Section Metrics Audience
Team Health DORA metrics, escaped defects, incident rate Engineering leadership
Test Health Coverage, flaky rate, test execution time, mutation score Quality Guild, team leads
Pipeline Health Build success rate, pipeline duration, quality gate pass rate Quality Engineering team
Production Health Error rates, p99 latency, SLO compliance All engineers

10. Tooling Proposals

10.1 Proposed Tool Stack

Category Tool Why
Unit testing (Java) JUnit 5 + Mockito Industry standard for Spring Boot
Unit testing (JS/TS) Jest + React Testing Library Already in use, excellent for React
Integration testing Testcontainers Real Docker containers for MongoDB, PostgreSQL, Redis, LocalStack, RabbitMQ
E2E testing Playwright Consolidate from Playwright + Cypress + Selenium (see Sections 4.4 and 11)
Contract testing Pact Consumer-driven contracts for service boundaries
GraphQL schema Apollo Rover Schema checks in CI for federation
Performance testing k6 JS-based, DataDog integration, GraphQL support, CI-native
Security scanning Snyk or Trivy Dependency vulnerabilities in CI
Mutation testing PIT (Java), Stryker (JS/TS) Validate test suite effectiveness
Visual regression Playwright built-in screenshots Free, no additional tooling
Accessibility axe-core + Playwright Automated WCAG 2.1 AA checks
Code review (AI) CodeRabbit + Augment Enforce testing standards, catch quality issues
Observability DataDog + OpenTelemetry Production quality signals
Chaos engineering AWS FIS + Toxiproxy Resilience testing (Phase 4+)
Intelligence Track
Unit testing (Python) pytest + pytest-mock Already in use across ML repos; standardize fixtures and markers
Data quality Great Expectations Already in datapipelines — expand to training repos; schema + distribution checks
Type checking (Python) MyPy (strict mode) Already on pricerator, ml_elasticity — standardize across all Python repos
Linting (Python) Ruff Already adopted — standardize config (line-length, rule sets) across all ML repos
ML experiment tracking MLflow Already in use — add automated validation gates (accuracy vs baseline)
Golden file testing pytest + snapshot files Already in group-forecast-service — expand to all inference endpoints
Property-based testing Hypothesis For algorithmic code: pricing constraints, elasticity calculations, optimization solvers
Pipeline testing Airflow test utilities + pytest-docker DAG structure validation, pipeline integration tests
AWS mocking moto Already in pricerator — standardize for all boto3/S3 testing

10.2 Testcontainers Setup for Duetto's Stack

Testcontainers should be the standard for all backend integration tests, replacing mocks with real infrastructure:

Container What it replaces Use case
MongoDBContainer Mock MongoDB / embedded MongoDB Repository tests, aggregation pipeline tests
PostgreSQLContainer Mock PostgreSQL / H2 Neon Postgres compatibility, migration tests
GenericContainer("redis:7") Mock Redis Cache integration tests
LocalStackContainer (SQS, SNS, Kinesis, S3) Mock AWS SDK Event-driven architecture tests
RabbitMQContainer Mock RabbitMQ Message handler tests

11. Selenium-to-Playwright Migration

11.1 Migration Scope

Duetto has ~176 Selenium tests (Java, Selenium 4.29, Firefox) running across 20 parallel CI runners. These tests use the Page Object pattern, custom @RetryTest annotations, and rely on Xvfb virtual display + Jetty server for test execution. In parallel, ~175 Cypress specs (TypeScript) run across 12 CI containers.

Both test suites should migrate to Playwright. This section focuses on the Selenium migration, which is the larger and more complex effort.

11.2 Why Migrate Now

Driver Details
Maintenance burden 20+ disabled Selenium tests with "to be fixed later" notes; custom retry infrastructure needed for stability
Browser limitation Selenium tests only run Firefox; no Safari or Chrome coverage
Infrastructure cost 20 parallel runners + Xvfb virtual display is heavyweight compared to Playwright's headless-by-default approach
Framework age Selenium 4.x is stable but less developer-friendly than Playwright's auto-waiting, built-in assertions, and tracing
Consolidation Running 3 E2E frameworks (Selenium + Cypress + Playwright) across 3 repos is unsustainable; converging to 1 cuts maintenance by ~60%
BLAST compatibility Next.js E2E testing is natively supported by Playwright; Selenium has no first-class Next.js integration

11.3 Migration Strategy: AI-Accelerated Conversion

The migration of ~176 Selenium tests is a high-volume, pattern-based task — ideal for AI acceleration. Using Claude Code (and optionally Augment Code for codebase-aware suggestions), the migration can be completed in a fraction of the time manual rewriting would require.

Phase 1 — Foundation (Weeks 1-2)

Set up the Playwright project and migrate the base infrastructure:

  1. Create Playwright project in the existing duetto-playwright-e2e repo (or a new directory in duetto)
  2. Convert base test utilities:
  3. Selenium WebDriver setup → Playwright Browser/BrowserContext/Page
  4. Custom @RetryTest annotation → Playwright's built-in retries config
  5. Xvfb display server → Playwright headless mode (no display server needed)
  6. Jetty server start/stop → Playwright webServer config in playwright.config.ts
  7. MongoDB/Redis setup → reuse existing GitHub Actions setup or migrate to Testcontainers

  8. Create a mapping reference for the AI tools:

Selenium (Java) Playwright (TypeScript)
driver.findElement(By.id("x")) page.locator('#x')
driver.findElement(By.cssSelector(".x")) page.locator('.x')
driver.findElement(By.xpath("//div")) page.locator('div') or page.locator('xpath=//div')
element.click() await locator.click()
element.sendKeys("text") await locator.fill('text')
element.getText() await locator.textContent()
element.isDisplayed() await locator.isVisible()
new WebDriverWait(driver, 10).until(...) Auto-waiting built into Playwright actions
driver.navigate().to(url) await page.goto(url)
driver.switchTo().frame(...) await page.frameLocator(...)
Thread.sleep(ms) await page.waitForSelector(...) or await expect(locator).toBeVisible()
Actions(driver).moveToElement(e) await locator.hover()
Select(element).selectByValue(v) await locator.selectOption(v)
driver.manage().window().setSize(...) await page.setViewportSize({...})
Assert.assertEquals(expected, actual) await expect(locator).toHaveText(expected)

Phase 2 — AI-Powered Page Object Conversion (Weeks 3-6)

Use Claude Code to bulk-convert Selenium Page Objects to Playwright:

Prompt strategy for Claude Code:

Given this Selenium Page Object class (Java), convert it to a Playwright
Page Object (TypeScript). Follow these rules:

1. Replace all Selenium WebDriver calls with Playwright equivalents
2. Replace explicit waits (WebDriverWait) with Playwright auto-waiting
3. Replace Thread.sleep() with proper Playwright waitFor* methods
4. Convert Java assertions to Playwright expect() assertions
5. Use Playwright's built-in locator strategies (prefer role, text,
   test-id over CSS/XPath)
6. Keep the Page Object pattern but adapt to TypeScript class syntax
7. Add proper TypeScript types for all methods
8. Replace any Selenium-specific retry logic with Playwright's
   built-in retry mechanisms

Source Selenium class:
[paste class]

Output the Playwright TypeScript equivalent.

Batch conversion workflow:

  1. List all Page Objects from selenium/src/test/java/com/duetto/frontend/selenium/
  2. Prioritize by business value: Start with pages that cover critical user journeys (login, pricing, rate management, dashboard)
  3. Feed each Page Object to Claude Code with the prompt above
  4. Human review: Engineer verifies each converted Page Object — check locator strategies, ensure business logic is preserved, validate assertions
  5. Run both old and new tests in parallel for the same pages to validate conversion accuracy

Expected velocity with AI assistance: - Manual conversion: ~2-3 Page Objects per engineer per day - AI-assisted conversion: ~10-15 Page Objects per engineer per day (3-5x speedup) - Human review remains essential — AI may miss Duetto-specific patterns, custom wait conditions, or domain-specific assertions

Phase 3 — Test Conversion (Weeks 5-10)

Convert test files in priority order:

Priority Tests Criteria
P0 Login, authentication, critical navigation Break = users can't access the product
P1 Pricing, rate management, revenue dashboards Core business functionality
P2 Settings, admin, user management Important but lower traffic
P3 Disabled tests (~20+) Evaluate: convert or permanently delete

For each test file: 1. Feed the Selenium test + its Page Objects to Claude Code 2. AI generates the Playwright equivalent 3. Engineer reviews, adjusts for Duetto-specific patterns 4. Run the new Playwright test against the same environment 5. Validate it covers the same scenarios (compare step-by-step) 6. Once green, mark the Selenium test for deprecation

Claude Code skills for migration:

Consider creating a dedicated Claude Code skill (.claude/skills/selenium-to-playwright.md) that encodes: - The mapping reference table above - Duetto-specific Page Object conventions - Common patterns in Duetto's Selenium tests (e.g., how they handle MongoDB test data, Jetty server initialization) - Preferred Playwright locator strategy (test-id > role > text > CSS > xpath) - Assertion patterns used in the Playwright E2E repo

Phase 4 — Validation and Cutover (Weeks 9-12)

  1. Parallel run period (2-3 weeks):
  2. Run both Selenium and Playwright suites in CI
  3. Track: same scenarios should produce same pass/fail results
  4. Investigate any discrepancies — usually timing or locator differences

  5. Decommission Selenium:

  6. Remove Selenium tests from CI pipeline
  7. Archive (don't delete) the Selenium directory for reference
  8. Remove Selenium dependencies from build.gradle
  9. Remove Xvfb and Firefox setup from GitHub Actions workflows
  10. Reduce CI runners from 20 → Playwright's built-in sharding

  11. Update CI infrastructure:

  12. Replace 20 Selenium runners with Playwright --shard across fewer runners
  13. Playwright's native parallelization typically needs 3-5 shards for equivalent coverage
  14. Expected CI time reduction: 40-60%

11.4 Cypress Migration (Parallel Track)

The Cypress-to-Playwright migration follows a similar pattern but is simpler due to both being JavaScript/TypeScript:

Selenium → Playwright Cypress → Playwright
Language change (Java → TS) Same language (TS → TS)
Page Object pattern rewrite Page Object pattern adaptation
New assertion library Similar assertion patterns
~176 tests ~175 specs
12 weeks 6-8 weeks

Key Cypress → Playwright differences:

Cypress Playwright
cy.visit(url) await page.goto(url)
cy.get('.selector') page.locator('.selector')
cy.contains('text') page.getByText('text')
cy.intercept() page.route()
cy.wait('@alias') await page.waitForResponse(...)
Automatic chaining Explicit await on each action
cy.request() request.get() / request.post()

Claude Code can perform this conversion even faster since no language change is involved. Expected velocity: 15-25 specs per engineer per day with AI assistance.

11.5 Migration Timeline

Month 1          Month 2          Month 3          Month 4
┌────────────────┬────────────────┬────────────────┬────────────┐
│ SELENIUM       │ SELENIUM       │ SELENIUM       │            │
│ Phase 1:       │ Phase 2-3:     │ Phase 3-4:     │ Selenium   │
│ Foundation +   │ AI-convert     │ Convert P2-P3  │ decomm.    │
│ base infra     │ P0-P1 tests    │ + validation   │            │
├────────────────┼────────────────┼────────────────┤            │
│ CYPRESS        │ CYPRESS        │                │ Cypress    │
│ Phase 1:       │ Phase 2:       │ Cypress        │ decomm.    │
│ Stop new tests │ AI-convert     │ validation     │            │
│ + P0 migration │ P1-P3 tests    │ + cutover      │            │
└────────────────┴────────────────┴────────────────┴────────────┘

Total effort estimate: - With AI assistance: 2-3 engineers × 3 months (including validation) - Without AI: 3-4 engineers × 6 months - AI acceleration saves ~50% of migration effort

11.6 Risk Mitigation

Risk Mitigation
AI generates incorrect locators Every converted test must be run and verified by an engineer; use Playwright Codegen to validate locator strategies
Test behavior changes during conversion Run old and new tests in parallel during validation; compare results test-by-test
Team bandwidth Spread migration across teams — each team converts its own Selenium/Cypress tests, guided by QE
Loss of Selenium-specific infrastructure Document all custom Selenium utilities before removal; ensure Playwright equivalents exist
Disabled tests never get converted Audit disabled tests in Phase 1: decide convert or delete. No indefinite quarantine during migration

12. Implementation Roadmap

Phase 1 — Foundation (Months 1-3)

App/Platform track: - [ ] Define Quality Engineer role and career ladder (both tracks) - [ ] Hire or reassign 2-3 App/Platform QEs for pilot embedding (choose teams with willing tech leads) - [ ] Audit current testing: what exists, what's automated, where are gaps per team - [ ] Establish basic testing standards document (guild v0 — shared + track-specific sections) - [ ] Implement Phase 1 CI quality gates (build, unit tests, linting, schema checks) - [ ] Stop writing new Cypress and Selenium tests — all new E2E in Playwright - [ ] Begin Selenium-to-Playwright migration: foundation + base infrastructure (Section 11) - [ ] Create Claude Code skill for Selenium-to-Playwright conversion patterns - [ ] Add JaCoCo (backend) and Jest --coverage (frontend) for coverage visibility - [ ] Make SpotBugs blocking for new violations (baseline existing) - [ ] Set up shared Testcontainers configurations for common Duetto infrastructure - [ ] Create quality metrics dashboard in DataDog (basic version — both tracks)

Intelligence track: - [ ] Audit Intelligence repo testing: catalog test counts, coverage gaps, and existing strengths per repo - [ ] Standardize Ruff config across all Python ML repos (line-length, rule sets, pre-commit) - [ ] Extend MyPy strict mode to all Intelligence Python repos (currently only pricerator, ml_elasticity) - [ ] Add pytest --coverage to Intelligence CI pipelines for visibility (no thresholds yet) - [ ] Document existing MLflow, Great Expectations, and golden file testing practices - [ ] Automate Hammer: GitHub Actions workflow for pricing PRs + curated hotel sample (Section 4.6.7) - [ ] Define Hammer tolerance thresholds (warn vs block) to replace binary pass/fail - [ ] Add scheduled nightly Hammer run on develop with Slack notifications

Phase 2 — Pilot and Guild Formation (Months 3-6)

App/Platform track: - [ ] Embed QEs in 3-4 App/Platform pilot teams — focus on coaching, not gatekeeping - [ ] Form Quality Guild with two-track structure — bi-weekly all-hands, monthly track deep dives - [ ] Implement Phase 2 CI quality gates (integration tests, contract testing, security scanning) - [ ] Introduce Pact for first service boundary being extracted from monolith - [ ] AI-powered bulk conversion of Selenium P0-P1 tests to Playwright (Section 11.3) - [ ] Begin Cypress-to-Playwright migration (critical paths first) - [ ] Implement CodeRabbit + Augment Code testing enforcement rules - [ ] Add Snyk or Trivy for dependency security scanning on PRs - [ ] Start tracking DORA metrics per team

Intelligence track: - [ ] Hire or assign 1 Intelligence track QE (Python + ML pipeline experience required) - [ ] Expand Great Expectations from datapipelines into training repos (forecasting, ml_elasticity) - [ ] Implement golden file testing for pricerator and group-forecast-service inference endpoints - [ ] Add MLflow automated accuracy comparison vs baseline (warning gate, not blocking) - [ ] Create CI pipeline templates for Python ML repos (Ruff + MyPy + pytest + coverage) - [ ] Begin pytest coverage improvement: target utility functions and data transformations first - [ ] Hammer Phase 2: structured JSON output, PR comment summaries, DataDog metrics integration

Phase 3 — Scale and Standardize (Months 6-12)

App/Platform track: - [ ] Expand App/Platform QEs to cover all App + Platform teams (1 QE per 2-4 teams) - [ ] Implement Phase 3 CI quality gates (performance testing, coverage thresholds, E2E smoke) - [ ] Complete Selenium and Cypress decommission (Section 11.5) - [ ] Introduce mutation testing for critical business logic (pricing, revenue) - [ ] Build shared test data libraries (hotel, rate, reservation factories) - [ ] Create CI pipeline templates (reusable GitHub Actions workflows) - [ ] Establish quality onboarding for new developers

Intelligence track: - [ ] Intelligence QE embedded across Pricing, Forecasting, and Data teams - [ ] Implement data quality gates as blocking in Airflow DAGs (Great Expectations) - [ ] Introduce Hypothesis property-based testing for algorithmic code (pricing constraints, elasticity calculations) - [ ] pytest coverage threshold for Intelligence repos (40% — lower than app teams but meaningful) - [ ] Automated champion/challenger model testing before promotion to production - [ ] Data drift detection alerts integrated with DataDog

Phase 4 — Optimize and Mature (Months 12-18)

Both tracks: - [ ] All teams have QE support (embedded or shared) - [ ] Implement Phase 4 quality gates (flaky test quarantine, test impact analysis, visual regression) - [ ] Explore chaos engineering (AWS FIS + Toxiproxy for critical services) - [ ] Evaluate AI-powered test generation tools (Diffblue for Java, Claude Code test skills) - [ ] Formalize parallel run testing for high-risk service extractions - [ ] Annual review: guild health, tooling satisfaction, quality metrics by track

Intelligence-specific: - [ ] End-to-end ML pipeline validation (data → training → inference → output) as CI workflow - [ ] Model monitoring dashboards per model family (forecasting, elasticity, pricing optimizer) - [ ] Reproducibility CI checks: same config + data → deterministic output - [ ] Hammer Phase 3: containerize, decouple from monolith Spring context, AI-assisted diff analysis


13. Pre-Identified Potential Initiatives

The following initiatives have been identified during the analysis that produced this strategy. They are listed here as a starting backlog — not a commitment. Final prioritisation, sequencing, and scope will be determined by the Quality Guild and Quality Engineering Team once established.

Each initiative is tagged with the team charter it belongs to: Guild (TC-006) for governance and standards work, or QE Team (TC-007) for infrastructure and tooling delivery. QE Team initiatives are split by track.

13.1 Quality Guild Initiatives (TC-006)

These initiatives relate to governance, standards, coaching, and culture — owned by the guild as a community of practice.

# Initiative Phase Primary Metric Impact
1 Author testing standards document v1 — shared section + track-specific sections, including engineer testing responsibilities (unit/integration/E2E ownership expectations, PR review standards) 1 Cross-Team Testing Standard Adoption
2 Establish flaky test policy and SLA — quarantine rules, remediation ownership, maximum days disabled 1-2 Flaky Test Rate
3 AI-generated code QA strategy — mutation testing adoption criteria, AI-generated test review guidelines, tautological test detection 2 Escaped Defect Rate
4 DORA metrics tracking and review — define measurement per team, monthly review cadence with EMs 2 DORA Change Failure Rate
5 Testing standards document v2 — add Intelligence track sections (ML testing diamond, data quality standards, pipeline testing expectations, Intelligence-specific quality gates) 2 Cross-Team Testing Standard Adoption

13.2 Quality Engineering Team Initiatives — Shared (TC-007)

Infrastructure and tooling that serves both App/Platform and Intelligence tracks.

# Initiative Phase Primary Metric Impact
1 Phase 1 CI quality gates — build and improve existing checks: build verification, unit tests, linting/formatting, GraphQL schema checks, code coverage visibility (JaCoCo, Jest, pytest — reporting only, no thresholds) 1 Quality Gate Adoption Rate
2 Quality metrics dashboard in DataDog — test health, pipeline health, per-team and per-track views 1 Quality Gate Adoption Rate
3 Phase 2 CI quality gates — integration tests, contract verification, dependency security scanning (Snyk/Trivy, critical/high = blocking), coverage delta PR comments 2 Quality Gate Adoption Rate
4 CodeRabbit + Augment Code configuration — path-based test enforcement rules, anti-pattern detection, semantic test gap review 2 Quality Gate Adoption Rate
5 Flaky test auto-quarantine system — detection (>3 failures in 7 days), automatic quarantine, Jira ticket creation, unified dashboard 2 Defect Detection Rate
6 Incident RCA tagging model — establish caught-in-ci / escaped-to-production / could-have-been-caught / not-ci-detectable classification in Jira 2 Defect Detection Rate
7 Phase 3 CI quality gates and reusable CI pipeline templates — k6 performance regression, coverage thresholds (60% blocking), E2E smoke on staging; published as versioned GitHub Actions workflows for Java/Spring, React/Next.js, Python ML 3 Quality Gate Adoption Rate

13.3 Quality Engineering Team Initiatives — App/Platform Track (TC-007)

# Initiative Phase Primary Metric Impact
1 Selenium-to-Playwright migration — foundation, base infrastructure, Claude Code migration skill; stop new Cypress and Selenium tests, all new E2E in Playwright (Section 11) 1 E2E Framework Consolidation
2 Make SpotBugs blocking for new violations — baseline existing, fail PR on new 1 Quality Gate Adoption Rate
3 Shared Testcontainers configurations — MongoDB, PostgreSQL, Redis, LocalStack (SQS/SNS/Kinesis/S3), RabbitMQ 1 Quality Gate Adoption Rate
4 E2E framework bulk conversion — AI-powered Selenium P0-P1 test conversion with parallel run validation + Cypress critical path migration (Sections 11.3, 11.4) 2 E2E Framework Consolidation
5 Pact broker setup — contract broker infrastructure, can-i-deploy gate, message contract support for first service extraction 2 Quality Gate Adoption Rate
6 Selenium and Cypress decommission — remove from CI, archive tests, remove 20 Selenium runners + 12 Cypress containers, replace with Playwright sharding 3 E2E Framework Consolidation
7 Mutation testing infrastructure — PIT for Java, Stryker for JS/TS on critical business logic (pricing, revenue) 3 Defect Detection Rate
8 Shared test data libraries — factories and fixtures for hotels, rates, reservations, users 3 CI Build Success Rate

13.4 Quality Engineering Team Initiatives — Intelligence Track (TC-007)

# Initiative Phase Primary Metric Impact
1 Audit Intelligence repo testing — catalog test counts, coverage gaps, and existing strengths per repo 1 Quality Gate Adoption Rate
2 Standardize Python tooling across all Intelligence repos — unified Ruff config (line-length, rule sets, pre-commit) and MyPy strict mode (currently only pricerator, ml_elasticity) 1 Quality Gate Adoption Rate
3 Hammer CI automation — GitHub Actions for pricing PRs + curated hotel sample + nightly develop run + Slack notifications + configurable tolerance thresholds replacing binary pass/fail (Section 4.6.7) 1 Hammer CI Automation
4 Expand Great Expectations from datapipelines into training repos (forecasting, ml_elasticity) 2 Quality Gate Adoption Rate
5 Model and inference validation — golden file testing for inference endpoints (pricerator, group-forecast-service) + MLflow automated accuracy comparison vs baseline (warning gate, not blocking) 2 Defect Detection Rate
6 Python ML CI pipeline templates — Ruff + MyPy strict + pytest + coverage as reusable GitHub Actions 2 Quality Gate Adoption Rate
7 Hammer structured reporting — JSON output, PR comment summaries, DataDog metrics integration 2 Hammer CI Automation
8 Data quality gates as blocking in Airflow DAGs — Great Expectations suites per pipeline stage 3 Defect Detection Rate
9 Hypothesis property-based testing for algorithmic code — pricing constraints, elasticity calculations, optimization solvers 3 Defect Detection Rate
10 pytest coverage thresholds for Intelligence repos — 40% initial target 3 Quality Gate Adoption Rate
11 Automated champion/challenger model testing before promotion to production 3 Defect Detection Rate
12 Data drift detection alerts integrated with DataDog 3 Defect Detection Rate

13.5 Initiative Summary

Owner Count Phase 1 Phase 2 Phase 3 Phase 4
Quality Guild 5 2 3 0 0
QE Team — Shared 7 2 4 1 0
QE Team — App/Platform 8 3 2 3 0
QE Team — Intelligence 12 3 4 5 0
Total 32 10 13 9 0

Note: These are pre-identified potential initiatives derived from the analysis in this strategy document. They are not commitments. The Quality Guild and Quality Engineering Team will refine, reprioritise, merge, or discard initiatives as they begin work and learn from early phases. Formal APEX initiative IDs (I-YYYY-XX-NNN) will be assigned when initiatives are approved and enter the APEX pipeline.


14. Frequently Asked Questions

Q: Does this mean we're getting rid of manual QA? A: We're evolving it. Manual exploratory testing remains valuable — trained exploratory testers consistently find bugs that automated tests miss. But manual regression testing is eliminated. Developers own automated testing; QE focuses on strategy, coaching, and high-value exploratory work.

Q: Who writes the tests — developers or QEs? A: Developers write unit and integration tests for their code. QEs define the test strategy, coach developers on what and how to test, conduct exploratory testing, and build shared infrastructure. Think of it like security: every developer writes secure code, but security engineers set standards and do penetration testing.

Q: How does this work with AI-generated code? A: AI can generate tests, but AI-generated tests have known risks (tautological tests, testing implementation not behavior). QEs review test strategy and effectiveness, mutation testing validates test suite quality, and CodeRabbit enforces testing standards in code review.

Q: Won't this slow teams down? A: Short-term, introducing quality gates adds friction. Long-term, it reduces escaped defects, incident frequency, and time spent firefighting. Google's data shows that investing in testing infrastructure pays back within 6-12 months through fewer production issues and faster development velocity.

Q: What about our existing Automation Engineers? A: They can transition to Quality Engineers (with coaching/strategy focus) or join the Quality Engineering team (infrastructure focus), depending on their strengths and interests. Both paths are valuable.

Q: Why Playwright over Cypress? A: Multi-browser support, Java bindings for backend teams, native parallelization (free vs. Cypress Cloud paid), built-in visual regression, superior Next.js/BLAST compatibility, and 2-3x faster CI execution. See Section 4.4 for full comparison.

Q: How does this strategy apply to Intelligence/ML teams? A: Intelligence teams operate on a different technology stack (Python, LightGBM, Airflow, MLflow) and face different quality risks (data drift, model accuracy degradation, pipeline failures). The guild has a dedicated Intelligence track with its own QE(s), testing practices (data quality, pipeline tests, model validation), and tools (Great Expectations, golden file testing, Hypothesis). The standard testing honeycomb is replaced by a "testing diamond" adapted for ML systems. See Section 4.6.

Q: Do ML teams need the same coverage thresholds as App teams? A: No. ML training code is inherently harder to unit test because outcomes depend on data distributions, not deterministic logic. We set a lower initial coverage threshold (40% vs 60% for App) but target the testable parts — utility functions, data transformations, API layers, and configuration validation. Quality in ML comes from data validation (Great Expectations), model validation (MLflow metrics), and pipeline testing — not just code coverage.

Q: How long will the Selenium and Cypress migration take? A: With AI-assisted conversion (Claude Code), approximately 3 months with 2-3 engineers. Without AI, it would take 6+ months. The migration runs Selenium and Cypress tests in parallel with new Playwright tests during validation, so there's no coverage gap. See Section 11 for the full plan.


References

  • Google, Software Engineering at Google — Chapter 11 (Testing Overview), Chapter 12 (Unit Testing)
  • Google Research, State of Mutation Testing at Google (2018)
  • Google Testing Blog, Where Do Our Flaky Tests Come From? (2017)
  • GitLab, Testing Guide (docs.gitlab.com/ee/development/testing_guide)
  • Atlassian, Quality Assistance vs Quality Assurance model
  • Spotify, Testing Honeycomb and Squad/Chapter/Guild organizational model
  • Stripe, Paved Road approach to developer experience
  • Netflix, Paved Road infrastructure and Toxiproxy
  • Microsoft, Combined Engineering announcement (2014)
  • Pact Foundation, Consumer-Driven Contract Testing (pact.io)
  • CodeRabbit, Skills and Configuration documentation
  • Google, Reliable Machine Learning — ML testing and validation patterns
  • Great Expectations, Data Quality and Testing documentation (greatexpectations.io)
  • Hypothesis, Property-Based Testing for Python (hypothesis.readthedocs.io)

Document History

Date Author Change
2026-03-04 Antonio Cortés Initial draft
2026-03-04 Antonio Cortés Added: engineer testing responsibilities (3.4), Augment Code evaluation (5.5), current CI/CD state analysis (Section 8), Selenium-to-Playwright migration plan (Section 11). Updated QE staffing to 4-6 embedded.
2026-03-04 Antonio Cortés Added: Intelligence domain testing strategy (Section 4.6) with ML testing diamond, repo analysis, and ML-specific quality gates. Restructured guild model to two-track (App/Platform + Intelligence) with separate embedded QE profiles, cadence, standards, tooling, and roadmap items.
2026-03-04 Antonio Cortés Updated L6 title to Staff/Lead QE. Changed language from "recommendation" to "proposal" throughout. Added phasing notes: Intelligence track as Phase 2 for the Quality Guild. Updated folder structure. Staff/Lead QE leads Quality Engineering Team with automation architecture responsibilities.
2026-03-04 Antonio Cortés Added Hammer pricing regression testing analysis and 3-phase modernization plan (Section 4.6.7). Added Hammer to existing strengths (4.6.5) and test infrastructure summary (Section 8.5). Added Hammer automation items to implementation roadmap (Phases 1, 2, and 4 Intelligence track).
2026-03-04 Antonio Cortés Renamed "Quality Platform Team" to "Quality Engineering Team" throughout to avoid confusion with the App/Platform engineering domain.
2026-03-05 Antonio Cortés Renamed document from "QA & Automation Strategy" to "Quality Engineering Strategy." Added Section 13: pre-identified potential initiatives for Quality Guild (10), QE Team Shared (13), QE Team App/Platform (11), and QE Team Intelligence (20) — 54 total.