Quality Engineering Strategy: Best Practices for Duetto
Quality Engineering Strategy: Best Practices for Duetto
Author: Antonio Cortés Date: 2026-03-04 Status: DRAFT Audience: Engineering Leadership, Engineering Managers, Tech Leads Applies to: All engineering teams — App, Intelligence, and Platform areas
Related Team Charters: - Quality Guild Charter (TC-006) - Quality Engineering Team Charter (TC-007)
1. Executive Summary
This document proposes a quality engineering strategy for Duetto, informed by industry best practices from organizations like Google, Spotify, Atlassian, GitLab, Stripe, and Netflix. It addresses Duetto's current challenges — manual QA, inconsistent ways of working across teams, and the unique demands of AI-first engineering — and proposes a path forward.
Current State
| Dimension | Current Reality |
|---|---|
| QA model | Manual QA + some Automation Engineers, not embedded in teams |
| Ways of working | Inconsistent across tens of teams — no shared testing standards |
| Test automation | Some automation exists but coverage, tools, and practices vary by team |
| E2E testing | Both Playwright and Cypress in use — no consolidation strategy |
| AI-generated code | 50-70% of code is AI-generated, but QA practices haven't adapted |
| Architecture transition | Active monolith-to-microservices migration creates new testing challenges at service boundaries |
| Intelligence / ML testing | Minimal test coverage in ML repos (~100 tests across 15+ repos); existing strengths in MLflow, MyPy strict mode, Great Expectations, and pre-commit hooks — but no shared ML testing standards |
Proposed Target State
| Dimension | Target |
|---|---|
| QA model | Hybrid/Guild with two tracks: App/Platform QEs + Intelligence (ML/Data) QE + central Quality Engineering team |
| Ways of working | Shared testing standards with track-specific adaptations, enforced through CI/CD quality gates and Quality Guild |
| Test automation | Developer-owned testing with QE coaching; testing honeycomb for App/Platform, testing diamond for Intelligence (data quality, pipeline tests, model validation) |
| E2E testing | Consolidated on Playwright (App/Platform); golden file + data quality testing (Intelligence) |
| AI-generated code | Specific QA strategies for AI-generated code; mutation testing; AI-aware code review |
| Architecture transition | Contract testing (Pact) at service boundaries; parallel run testing for high-risk extractions |
2. QA Organizational Model
2.1 Why the Hybrid/Guild Model
Three QA models exist in the industry. For Duetto's context (tens of teams, lean core, monolith-to-microservices migration), the Hybrid/Guild model is the strongest fit.
| Model | How it works | Pros | Cons | Used by |
|---|---|---|---|---|
| Centralized | Separate QA team serves all product teams | Consistent standards, clear career ladder | Bottleneck, "throw it over the wall" mentality, slow feedback | Legacy enterprises |
| Embedded | QA engineers sit within product teams | Deep domain context, fast feedback, team ownership | Isolated QA, inconsistent practices, no shared infrastructure | Spotify (evolved) |
| Hybrid/Guild | Embedded in teams + central enablement guild | Best of both: context + consistency, shared infrastructure, career growth | Requires guild leadership, matrix reporting complexity | GitLab, Stripe, Atlassian |
Why not centralized: With tens of teams, a centralized QA team would be a severe bottleneck. It reinforces the "QA tests my code" anti-pattern that doesn't scale.
Why not pure embedded: Without a guild, tens of teams would each invent their own practices. QA engineers would be isolated with no career path and no shared tooling.
Why hybrid/guild: Embedded QE gets domain context for microservices testing. The central guild provides consistency, shared infrastructure, career development, and drives org-wide quality initiatives — critical during the monolith migration and AI-first transformation.
2.2 Proposed Structure for Duetto
Duetto's engineering organization spans two distinct technology domains: App/Platform (Java/Spring Boot + React/TypeScript) and Intelligence (Python ML/data pipelines). The guild model must serve both, with shared governance but domain-appropriate practices.
Phasing note: The Intelligence track is designed to be added as a second phase of the Quality Guild. Phase 1 focuses on establishing the guild with the App/Platform track (highest team count, largest test debt, Selenium/Cypress migration). Once the guild is operational and the App/Platform practices are stable (approximately months 3-6), the Intelligence track is introduced with its own embedded QE(s) and domain-specific practices. This phased approach avoids overloading the guild at inception and allows the Intelligence track to benefit from the patterns and infrastructure already established by the App/Platform track.
┌──────────────────────────────────────────────────────────────────────┐
│ Quality Guild │
│ (All QEs + interested engineers meet bi-weekly) │
│ (Guild lead coordinates standards, hiring, career growth) │
├──────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────────────┐ ┌──────────────────────┐ │
│ │ App/Platform Track │ │ Intelligence Track │ │
│ │ │ │ (ML / Data) │ │
│ │ Embedded QEs │ │ Embedded QE(s) │ │
│ │ (3-5 across │ │ (1-2 across │ │
│ │ App + Platform │ │ Pricing, Forecast, │ │
│ │ teams) │ │ Elasticity, Data) │ │
│ │ │ │ │ │
│ │ • Test strategy │ │ • Data quality │ │
│ │ • Dev coaching │ │ • Pipeline testing │ │
│ │ • Exploratory testing│ │ • Model validation │ │
│ │ • Quality metrics │ │ • ML test coaching │ │
│ │ • Playwright E2E │ │ • Great Expectations │ │
│ │ • Pact contracts │ │ • Golden file tests │ │
│ └──────────────────────┘ └──────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Quality Engineering Team (2-3 people) │ │
│ │ Led by Staff/Lead QE │ │
│ │ │ │
│ │ Shared across both tracks: │ │
│ │ • CI/CD quality gates (GitHub Actions templates) │ │
│ │ • Flaky test detection & remediation systems │ │
│ │ • Quality dashboards (DataDog — per team, per track) │ │
│ │ • AI code testing tools (CodeRabbit, Augment Code config) │ │
│ │ │ │
│ │ App/Platform-specific: Intelligence-specific: │ │
│ │ • Testcontainers configs • Great Expectations infra │ │
│ │ • Playwright infra • MLflow validation gates │ │
│ │ • Pact broker management • Data drift monitoring │ │
│ │ • Test data factories • Pipeline test frameworks │ │
│ └──────────────────────────────────────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────┘
App/Platform Track
Embedded Quality Engineers (3-5 people): - 1 QE serves 2-4 App or Platform teams — lean, coaching-focused - Embedded in team ceremonies (standups, planning, retros) - Focuses on test strategy, exploratory testing, and developer coaching - Expertise: Java/Spring Boot testing, React/TypeScript testing, Playwright E2E, Pact contracts - Does NOT do all the testing — developers own their tests
Intelligence Track (ML / Data) — Phase 2
This track is proposed as a second phase, introduced once the Quality Guild is established with the App/Platform track (see phasing note above). During Phase 1, Intelligence teams benefit from shared guild standards (Ruff, pre-commit, coverage visibility) and Quality Engineering infrastructure, but the dedicated Intelligence QE(s) and ML-specific practices are added in Phase 2.
Embedded Quality Engineer(s) (1-2 people): - Serves Pricing, Forecasting, Elasticity, Anomaly Detection, and Data Pipeline teams - Different skill set than App/Platform QEs — requires Python, data engineering, and ML pipeline experience - Focuses on data quality strategy (Great Expectations), pipeline testing, model validation gates, and golden file testing - Coaches ML engineers on testing the testable parts — utility functions, transformations, API layers — rather than forcing the honeycomb onto ML training code - Does NOT validate model accuracy (that's the ML engineer's domain) — instead ensures the infrastructure for validation exists (MLflow gates, champion/challenger pipelines, drift detection)
Why Intelligence needs its own QE track: - The technology stack (Python, LightGBM, Airflow, MLflow) shares almost nothing with the App stack (Java, React, Playwright) - Quality risks are different: data drift and model degradation vs. UI bugs and API contract violations - Testing tools are different: Great Expectations and golden file testing vs. Playwright and Pact - A QE trained in Playwright and Pact cannot coach an ML engineer on data validation or pipeline testing — and vice versa - However, both tracks share governance, CI/CD infrastructure, and quality culture — hence one guild, two tracks
Quality Engineering Team (2-3 people, led by Staff/Lead QE):
The team is led by the Staff/Lead QE (L6) — the most senior technical IC in the guild. This person: - Sets the technical vision for test automation across the organization — framework choices, infrastructure design, migration strategies (e.g., Selenium-to-Playwright) - Designs reusable CI/CD quality gate architecture — GitHub Actions templates, quality gate tiers, pipeline optimization - Makes tooling decisions — evaluates and selects tools (Playwright over Cypress, Pact for contracts, Great Expectations for data quality), defines integration patterns - Acts as the technical counterpart to the Guild Lead — the Guild Lead owns people and governance, the Staff/Lead QE owns technical strategy and infrastructure - Mentors Quality Engineering team members and embedded QEs on automation best practices
The team as a whole: - Builds and maintains shared test infrastructure across both tracks - Owns CI/CD quality gates and pipeline optimization - Manages flaky test detection, remediation systems, and quality dashboards - Develops AI-specific testing strategies and tooling (CodeRabbit, Augment Code) - Intelligence-specific: maintains Great Expectations infrastructure, MLflow validation gates, data drift monitoring - App/Platform-specific: maintains Testcontainers configs, Playwright infrastructure, Pact broker
Quality Guild: - All QEs from both tracks + interested developers and ML engineers meet bi-weekly - Guild lead (can be an Engineering Manager) coordinates standards, hiring, and career growth - Two-track agenda: shared topics (CI/CD, AI code quality, metrics) + rotating track-specific deep dives - Cross-pollination between tracks: App engineers learn about data quality; ML engineers learn about contract testing - Shared playbooks where practices overlap (e.g., pytest best practices, pre-commit hooks, coverage reporting)
2.3 Ratios and Sizing
| Role | Track | Count | Ratio | Notes |
|---|---|---|---|---|
| Embedded QE (App/Platform) | App/Platform | 3-5 | 1 QE : 2-4 teams | Start with 2 in pilot, scale based on results |
| Embedded QE (ML/Data) | Intelligence | 1-2 | 1 QE : 3-5 teams | Phase 2. Requires Python + ML pipeline experience; start with 1 |
| Quality Engineering | Shared | 2-3 | Central team | Led by Staff/Lead QE + 1-2 Engineers |
| Guild Lead | Shared | 1 | Part of EM role | Coordinates both tracks, not a full-time role |
| Total QE headcount | Both | 7-11 | ~1 QE : 12-19 devs | Lean — quality is a shared responsibility |
Hiring priority and phasing: Start with App/Platform QEs in Phase 1 (largest surface area, most teams, highest test debt in Selenium/Cypress migration). The Intelligence track QE is hired in Phase 2 once the guild is operational — this person needs a rare blend of testing expertise and data/ML familiarity, so allow longer time to hire. Intelligence teams still participate in the guild from Phase 1 and benefit from shared standards and infrastructure.
This is significantly leaner than traditional QA models (1:3-5 ratio) because developers own testing. QE enables and elevates, not replaces.
3. Role Definitions
3.1 The Quality Engineer Role (Proposed Primary Role)
The industry is converging on the Quality Engineer as the dominant quality role in modern SaaS. It replaces the traditional QA Engineer, focusing on prevention over detection, coaching over gatekeeping.
Philosophy shift: - FROM: "QA finds bugs after development" (detective work) - TO: "QE prevents bugs by improving the system" (preventive work)
What a Quality Engineer's week looks like at Duetto:
| Day | Activities |
|---|---|
| Monday | Sprint planning — reviewing stories for testability, suggesting acceptance criteria |
| Tuesday | Pair programming with developer on test strategy for new microservice |
| Wednesday | Analyzing test metrics dashboard, identifying flaky tests, reviewing CI pipeline health |
| Thursday | Exploratory testing on high-risk feature, writing up findings and risk assessment |
| Friday | Quality Guild meeting — sharing testing patterns for event-driven architecture |
3.2 Role Comparison
| Dimension | QA Engineer (traditional) | Test Automation Eng. | SDET | Quality Engineer (proposed) |
|---|---|---|---|---|
| Primary focus | Test execution & defect finding | Automated test creation | Test tooling & infrastructure | Quality strategy & enablement |
| Coding level | Light (scripting) | Moderate (test code) | Heavy (production-grade) | Moderate (tooling + automation) |
| Production code | Rarely | Never | Frequently | Sometimes |
| Manual testing | Significant | Minimal | Rare | Strategic/exploratory only |
| Developer coaching | Minimal | Some | Some (on testability) | Primary responsibility |
| Quality strategy | Limited | No | Some | Primary responsibility |
| Typical dev ratio | 1:3-5 | 1:5-8 | 1:8-15 | 1:10-20 |
3.3 Why Not SDET?
The SDET role (Software Development Engineer in Test) was pioneered by Microsoft in the early 2000s at a 1:1 SDE-to-SDET ratio. Microsoft eliminated the title in 2014, merging all SDETs into SDEs under a "Combined Engineering" model. Google similarly moved away from dedicated SETs (Software Engineers in Test) around 2016-2018.
Why it declined: - The separate role created a "someone else will test my code" mentality - SDETs often became test-only engineers despite being hired as software engineers - The 1:1 ratio was expensive and unsustainable - Modern CI/CD made developer-owned testing more practical
What replaced it: Quality Engineers (coaching model) + Platform/Infrastructure Engineers (shared tooling) + developer-owned testing. This is the model we propose for Duetto.
3.4 Engineer Testing Responsibilities
In the hybrid/guild model, developers own testing — QEs coach and enable, but do not replace engineer responsibility. Every engineer is expected to contribute across the testing honeycomb.
Unit Tests (40% of effort — owned by engineers): - Write unit tests for all new business logic, domain models, and utility functions - Maintain test coverage for code they modify — no PR without corresponding tests - Use JUnit 5 + Mockito (Java) or Jest + React Testing Library (frontend) - Focus on behavior verification, not implementation testing: test what the code does, not how it does it - AI-generated code requires the same (or higher) testing standard — engineers must verify AI-generated tests are meaningful, not tautological - Target: every function with branching logic or business rules has a corresponding test
Integration Tests (50% of effort — engineers with QE guidance): - Write integration tests for service-to-service communication, database queries, message handling, and GraphQL resolvers - Use Testcontainers for real infrastructure (MongoDB, PostgreSQL, Redis, LocalStack, RabbitMQ) instead of mocks - Write Pact consumer contracts when consuming another team's API or events - Test the full request-response cycle through your service, not just internal methods - QEs help define the integration test strategy and identify critical boundaries; engineers execute - For event-driven services: test the complete publish → queue → consume → side-effect path
E2E Tests (10% of effort — engineers + QE collaboration): - Write Playwright E2E tests for critical user journeys that touch your team's features - Focus on happy paths and high-risk scenarios — E2E tests are expensive to maintain - Follow the Page Object pattern for maintainable test code - QEs define which journeys need E2E coverage and review test design; engineers implement - Run E2E tests locally before pushing — don't rely solely on CI
General expectations for all engineers:
| Responsibility | Expectation |
|---|---|
| Test with every PR | No production code change merges without corresponding tests |
| Fix broken tests | If your change breaks a test, you fix it — not the QE |
| Flaky test ownership | If you wrote a test that becomes flaky, you own remediation (within SLA) |
| Review test quality | In code reviews, evaluate test coverage and quality, not just production code |
| Test data management | Use shared factories and fixtures; don't hardcode test data |
| AI-generated test review | Critically evaluate AI-generated tests: check for tautological assertions, missing edge cases, and implementation coupling |
3.5 Career Ladder for Quality Engineers
| Level | Title | Focus |
|---|---|---|
| L1-L2 | Junior QE | Learns testing fundamentals, assists with test strategy, basic automation |
| L3-L4 | Quality Engineer | Defines test strategy for a team, coaches developers, builds automation |
| L5 | Senior QE | Test strategy across multiple teams, drives quality initiatives, analyzes metrics |
| L6 | Staff/Lead QE | Org-wide quality strategy, defines quality engineering practices, influences engineering culture. Leads the Quality Engineering Team — owns technical vision for automation infrastructure, framework decisions, and CI/CD quality gate design |
| L7 | Principal QE / Head of Quality | Sets quality vision for the organization, industry thought leadership |
4. Testing Strategy
4.1 The Testing Honeycomb (Not the Pyramid)
The traditional testing pyramid (many unit tests, fewer integration, fewer E2E) was designed for monoliths. For Duetto's microservices migration, the testing honeycomb (Spotify model) is more appropriate:
Why the pyramid doesn't work for microservices: The majority of bugs in microservices occur at service boundaries — serialization mismatches, API contract violations, message schema drift, network timeout handling. Unit tests within a single service catch fewer of these real-world failures.
The honeycomb model:
| Layer | % of effort | What to test | Tools |
|---|---|---|---|
| E2E / UI tests | ~10% | Critical user journeys through the full stack | Playwright |
| Integration tests | ~50% | Service-to-service communication, database interactions, message handling, GraphQL resolvers | Testcontainers, Spring Boot Test, Pact |
| Unit tests | ~40% | Complex business logic (pricing algorithms, revenue calculations) | JUnit 5, Jest, React Testing Library |
Key insight: In microservices, integration tests are the most valuable. Internal logic within a well-designed microservice is often simple; the complexity lives in the interactions.
4.2 Testing by Architecture Layer
Backend (Java / Spring Boot)
| Test Type | What | Tools | Who Writes |
|---|---|---|---|
| Unit tests | Business logic, domain models, utilities | JUnit 5, Mockito | Developers |
| Integration tests | Repository operations, service interactions, message handlers | Testcontainers (MongoDB, PostgreSQL, Redis, LocalStack, RabbitMQ), Spring Boot Test | Developers + QE guidance |
| API tests | REST/GraphQL endpoints, request/response validation | MockMvc, Spring Boot Test, Apollo subgraph testing | Developers |
| Contract tests | Service boundary contracts | Pact (pact-jvm) | Developers + QE |
| Performance tests | Load, latency, throughput | k6 (with DataDog integration) | QE / Quality Engineering |
Frontend (React / TypeScript / Next.js)
| Test Type | What | Tools | Who Writes |
|---|---|---|---|
| Unit tests | Component logic, hooks, utilities, state management | Jest, React Testing Library | Developers |
| Component tests | Visual + behavioral component validation | Storybook, React Testing Library | Developers |
| Integration tests | Apollo Client queries, form flows, multi-component interactions | Jest + MockedProvider, Playwright | Developers |
| E2E tests | Critical user journeys, cross-page flows | Playwright | Developers + QE |
| Visual regression | UI consistency, design system compliance | Playwright built-in screenshots | QE / Quality Engineering |
| Accessibility | WCAG 2.1 AA compliance | axe-core, Playwright accessibility assertions | Developers + QE |
GraphQL (Apollo Federation)
| Test Type | What | Tools |
|---|---|---|
| Schema validation | Breaking change detection | Apollo rover subgraph check in CI |
| Resolver unit tests | Individual resolver logic | DGS test utilities / graphql-java, Jest + MockedProvider |
| Composition tests | Cross-subgraph query resolution | Apollo Router in test mode |
| Contract tests | Consumer-provider contracts for subgraphs | Pact (supports GraphQL) |
| Operation tests | Real query regression against test env | Apollo Studio operation checks |
Event-Driven Architecture (SQS, SNS, Kinesis, RabbitMQ)
| Test Type | What | Tools |
|---|---|---|
| Handler unit tests | Message processing logic, idempotency | JUnit 5, Jest |
| Integration tests | End-to-end message flow (publish → queue → consume → side effects) | Testcontainers (LocalStack, RabbitMQ) |
| Contract tests | Message schema compatibility | Pact message contracts |
| Ordering/failure tests | Out-of-order delivery, DLQ routing, batch failures | Testcontainers + custom scenarios |
4.3 Contract Testing with Pact
Contract testing is critical during the monolith-to-microservices migration. It verifies that services agree on their API contracts without needing full integration environments.
How Pact works:
- Consumer side: Write tests describing expected interactions → Pact generates a contract JSON
- Pact Broker: Contracts are published to a central broker
- Provider side: Provider runs verification tests against the contract
- Can-I-Deploy: Before deploying,
pact-broker can-i-deploychecks all contracts are verified
When to introduce Pact at Duetto: - Start writing contracts for the modules you plan to extract from the monolith next - The monolith acts as "consumer" of the new service; React/Next.js frontends are also consumers - Use Pact message contracts for SQS/SNS/Kinesis event schemas - Pact supports GraphQL interactions natively
4.4 E2E Framework Consolidation: Playwright
Duetto currently uses both Playwright and Cypress. We propose consolidating on Playwright.
| Dimension | Playwright | Cypress |
|---|---|---|
| Browser support | Chromium, Firefox, WebKit (Safari) | Chromium, Firefox, WebKit (experimental) |
| Language support | JS, TS, Python, Java, .NET | JS, TS only |
| Multi-tab/window | Full support | Not supported |
| Parallel execution | Built-in (worker-based) | Requires Cypress Cloud (paid) |
| CI performance | 2-3x faster in parallel scenarios | Slower (sequential default) |
| Visual regression | Built-in screenshot comparison | Via plugins (Percy, Applitools) |
| API testing | Full HTTP client (request context) |
Limited (cy.request) |
| Cost | Fully open source | Core OSS, paid Cloud features |
| Next.js integration | Native via @playwright/test |
Limited |
Why Playwright wins for Duetto:
- Java bindings — backend teams can write integration tests in familiar tools
- Native parallelization — reduces CI pipeline time by 40-60%
- Next.js (BLAST) compatibility — tests server components, edge functions, middleware
- Apollo Federation testing — request context is superior for GraphQL API testing
- Cost — fully open source vs. Cypress Cloud (paid for parallelization and dashboard)
Migration strategy: 1. Stop writing new Cypress tests immediately 2. All new E2E tests in Playwright 3. Gradually migrate critical Cypress tests (prioritize by business value) 4. Set a deadline (~6 months) for full Cypress decommission
4.5 Testing During the Monolith Migration
The monolith-to-microservices migration creates a period where the same functionality exists in both systems. Testing strategies must account for this duality.
Strangler Fig Testing Strategy:
| Testing Layer | What | When |
|---|---|---|
| Contract tests | Define contracts before extracting functionality | Before each service extraction |
| Routing tests | Verify the strangler facade correctly directs traffic | During migration |
| Parallel run tests | Route to both old and new, compare responses | During high-risk extractions |
| Data consistency tests | Verify data migration completeness and dual-write consistency | When service takes data ownership |
| Rollback tests | Verify switching back to monolith works | Before each cutover |
Parallel run testing (for high-risk extractions):
Request --> [Router]
|
+---> [Monolith] ---> Response (returned to user)
|
+---> [New Service] ---> Response (logged, compared)
Route a shadow copy of requests to the new service, compare responses, and track the discrepancy rate. When it drops below 0.1%, switch primary. Tools: Scientist4J pattern, DataDog for tracking comparison results.
4.6 Testing Strategy for Intelligence Teams (ML/Data)
Phase 2 scope. The Intelligence-specific practices in this section are designed to be introduced as a second phase of the Quality Guild, after the App/Platform track is established (see Section 2.2). During Phase 1, Intelligence teams adopt shared standards (linting, pre-commit, coverage visibility) and benefit from Quality Engineering infrastructure. The dedicated Intelligence QE, ML-specific quality gates, and full testing diamond are introduced in Phase 2.
The Intelligence domain (Pricing, Forecasting, Elasticity, Anomaly Detection, Optimization) operates on a fundamentally different technology stack and development lifecycle than App and Platform teams. The testing honeycomb (Section 4.1) applies to the Java services within Intelligence, but the core ML training pipelines, data engineering, and algorithmic code require a distinct quality approach.
4.6.1 Intelligence Domain Technology Landscape
| Dimension | App / Platform Teams | Intelligence Teams |
|---|---|---|
| Primary language | Java 17 + TypeScript/React | Python 3.10-3.12 (dominant), some Java |
| Frameworks | Spring Boot, Next.js, Apollo GraphQL | Flask, LightGBM, Prophet, Ray, PyTorch, Optuna, PuLP/HiGHS |
| Code nature | CRUD APIs, UI components, service orchestration | ML training pipelines, optimization solvers, statistical models, feature engineering |
| Data stores | MongoDB, PostgreSQL, Redis | S3, MLflow, DynamoDB, Athena/Glue, PostgreSQL (caching) |
| Deployment | Kubernetes, continuous delivery | Docker containers, AWS Lambda, SSM-based version promotion (stage→demo→prod) |
| Orchestration | Request/response APIs | Airflow DAGs, DynamoDB config-driven jobs, Ray distributed training |
| Testing maturity | ~3,100+ tests across repos | ~100 tests across all ML repos; some repos have 3 tests for 30K+ LOC |
Key repos analyzed: pricerator (pricing engine, 82 source files, 47 test files — best-tested), forecasting (LightGBM + Prophet, 30.5K LOC, 3 tests), ml_elasticity (DoubleML, 50+ tests), ml_pricing_engine (LP/MILP optimizer, 15 tests), intelligence-domain (Java/Spring Boot with JaCoCo), datapipelines (31+ Airflow DAGs, Great Expectations).
4.6.2 Why the Standard Honeycomb Doesn't Fully Apply
The testing honeycomb (unit → integration → E2E) assumes request/response services where bugs live at boundaries. In ML systems, the primary quality risks are different:
| Risk Category | What Can Go Wrong | Standard Honeycomb Coverage |
|---|---|---|
| Model accuracy degradation | New training run produces worse predictions than previous version | Not covered — no concept of "model accuracy" in unit/integration tests |
| Data quality drift | Input data schema changes, nulls appear in critical columns, distributions shift | Partially covered — data validation is not traditional testing |
| Feature engineering bugs | Incorrect temporal joins, data leakage across train/test split, wrong aggregation windows | Partially covered — unit tests can catch some, but the bugs are subtle and domain-specific |
| Training pipeline failure | OOM during distributed training, config-driven jobs silently skip steps | Not covered — pipeline orchestration testing is distinct |
| Numerical instability | Floating-point issues in optimization solvers, edge cases in elasticity calculations | Partially covered — requires property-based and boundary testing |
| Reproducibility failure | Same config + data produces different results across runs | Not covered — requires deterministic seeding and environment locking |
4.6.3 The ML Testing Diamond
For Intelligence teams, we propose a testing diamond adapted from Google's ML testing guidelines and industry practice:
┌───────────────┐
│ Model │ ~5% of effort
│ Validation │ Accuracy, backtesting, champion/challenger
├───────────────┤
┌─┤ Pipeline ├─┐ ~15% of effort
│ │ Tests │ │ DAG correctness, config validation, idempotency
│ ├───────────────┤ │
┌─┤ │ Data │ ├─┐ ~30% of effort
│ │ │ Quality │ │ │ Schema enforcement, drift detection, expectations
│ │ ├───────────────┤ │ │
│ │ │ Unit + │ │ │ ~50% of effort
│ │ │ Integration │ │ │ Transformations, utilities, API contracts
└─┴─┴───────────────┴─┴─┘
| Layer | % Effort | What to Test | Tools | Who |
|---|---|---|---|---|
| Unit + Integration | ~50% | Data transformations, utility functions, API endpoints, service contracts | pytest, unittest, moto (AWS mocking), Testcontainers, Pact | Engineers |
| Data Quality | ~30% | Input schema validation, null/range checks, distribution drift, referential integrity | Great Expectations (already in datapipelines), Pandera, custom validators | Engineers + QE |
| Pipeline Tests | ~15% | Airflow DAG validation, config-driven job correctness, idempotency, failure recovery | pytest-docker, DAG unit tests, golden file testing | Engineers + QE |
| Model Validation | ~5% | Accuracy metrics vs baseline, backtesting on holdout data, champion/challenger comparison | MLflow (already in use), custom metrics, SHAP analysis | ML Engineers + QE |
4.6.4 Testing by Intelligence Architecture Layer
ML Training Pipelines (Python — forecasting, ml_elasticity, ml_pricing_engine)
| Test Type | What | Tools | Current State | Target |
|---|---|---|---|---|
| Utility unit tests | Data transformations, feature engineering functions, math utilities | pytest, numpy.testing | Low (~15 tests in ml_pricing_engine, 3 in forecasting) | Every transformation function has tests |
| Data validation | Input schema checks, null detection, range validation, distribution alerts | Great Expectations, Pandera | Present in datapipelines; absent in training repos | Validation at every pipeline stage |
| Golden file tests | Known input → expected output regression tests | pytest + snapshot files | Present in group-forecast-service | Expand to all model inference paths |
| Model validation | Accuracy vs previous version, backtesting, metric tracking | MLflow metrics, custom | MLflow experiment tracking in place | Automated champion/challenger gates |
| Config validation | YAML/DynamoDB config correctness, required fields, value ranges | JSON Schema, Pydantic | Pydantic validation in pricerator | All config-driven jobs validate before execution |
| Reproducibility | Same config + data → same results | Seed locking, pytest-randomly | Partial (deterministic seeds in some models) | Deterministic training with CI reproducibility check |
Pricing/Optimization Services (Python — pricerator, group-forecast-service)
| Test Type | What | Tools | Current State | Target |
|---|---|---|---|---|
| API tests | Flask endpoint validation, request/response schemas | pytest + Flask test client | Good coverage in pricerator (47 test files) | Maintain; add contract tests for consumers |
| Algorithm tests | Pricing step correctness, constraint application, rate selection | pytest, property-based testing (Hypothesis) | Unit tests exist | Expand with property-based tests for edge cases |
| Integration tests | External API client calls (Rate API, Elasticity API, Monolith) | pytest-mock, moto (S3), responses | Present with moto for S3 | Add contract tests for upstream ML services |
Intelligence Java Services (intelligence-domain, dynamic-optimization)
These repos align with the standard strategy (Sections 4.1-4.2):
- intelligence-domain — Spring Boot with JaCoCo, PostgreSQL, RabbitMQ, LLM enrichment (Claude)
- dynamic-optimization — Spring Boot with RabbitMQ, Awaitility for async testing, MockServer
Standard honeycomb applies: JUnit 5, Testcontainers, Pact contracts, Playwright for any UI exposure.
Data Pipelines (datapipelines)
| Test Type | What | Tools | Current State | Target |
|---|---|---|---|---|
| DAG validation | DAG structure, task dependencies, schedule correctness | Airflow test utilities, pytest | 31+ DAGs, no DAG-specific tests found | DAG structure tests for every pipeline |
| Data quality | Great Expectations suites per pipeline stage | Great Expectations, DataDog | Present in monitoring module | Expand to all critical pipelines; block on failures |
| Transformation tests | Spark/Glue job logic, SQL correctness | pytest, PySpark test utilities | Limited | Unit tests for all transformation functions |
4.6.5 Existing Intelligence Strengths to Preserve
The Intelligence domain already has quality practices that should be recognized and built upon:
| Practice | Where | Value |
|---|---|---|
| MLflow experiment tracking | forecasting, ml_elasticity, ml_pricing_engine | Model versioning, hyperparameter logging, SHAP analysis — the ML equivalent of a quality dashboard |
| MyPy strict mode | pricerator, ml_elasticity | Stricter type enforcement than any app team repo — prevents a class of runtime errors |
| Great Expectations | datapipelines | Data validation framework already in production — expand coverage |
| Golden file testing | group-forecast-service | Snapshot-based output validation — effective for deterministic inference |
| detect-secrets | pricerator, ml_elasticity (pre-commit) | Secret detection in pre-commit hooks — ahead of app teams |
| Ruff + pre-commit | All Python repos | Consistent linting and formatting already enforced |
| Semantic version promotion | pricerator, forecasting | stage-NEXT → demo-NEXT → prod-NOW pipeline with SSM gating |
| Hammer (pricing regression) | duetto monolith (hammer/ module) |
Branch-vs-branch pricing optimization comparison with statistical aggregation — the most sophisticated existing regression testing tool in the org. Currently on-demand only (see Section 4.6.7) |
4.6.6 Intelligence-Specific Quality Gates
Supplement the standard CI/CD gates (Section 7) with ML-specific gates:
Phase 1 (align with org Phase 1):
- Ruff lint + format check (already present — standardize config across repos)
- MyPy strict mode (already on some repos — extend to all)
- pytest unit tests with --coverage reporting (establish baseline)
Phase 2: - Great Expectations data quality checks as blocking gates in DAGs - Golden file regression tests for inference endpoints - Model accuracy comparison vs baseline (MLflow metric check) — warning, not blocking
Phase 3: - Automated champion/challenger testing before model promotion - Data drift detection alerts (via Great Expectations or custom monitors) - pytest coverage threshold (start at 40% for ML repos — lower than app teams, but meaningful) - Property-based testing (Hypothesis) for algorithmic code
4.6.7 Hammer: Pricing Regression Testing — Current State and Modernization
What Hammer is:
Hammer is an existing pricing optimization regression testing tool in the duetto monolith (hammer/ module). It is the most sophisticated quality validation tool currently in the Intelligence domain. It consists of two executables:
hammerOptimizer— Runs the full pricing optimization for a set of hotels, generating rate recommendations, constrained/unconstrained forecasts, and rate sync data transactions (RSDT). Results are stored in S3 under a{runName}/{branchName}/prefix.hammerDiffer— Compares optimization outputs from two branches with statistical aggregation: mean squared error per hotel, rate deviation percentages, one-sided diffs (prices in one branch but not the other), and per-hotel forecast diffs. Exits with code 1 if forecast differences are detected.
Typical workflow (current — manual):
# 1. Run optimization on feature branch
gradle runHammer -PbranchName=feature-branch -PrunName=run1 ...
# 2. Run optimization on develop
gradle runHammer -PbranchName=develop -PrunName=run1 ...
# 3. Compare results
gradle runHammerDiffer -PfirstBranch=feature-branch -PsecondBranch=develop ...
Current limitations:
| Limitation | Impact |
|---|---|
| On-demand only | No CI/CD integration — pricing regressions can ship undetected if nobody runs Hammer |
| Manual 3-step process | Run branch A → run branch B → run differ. Easy to forget or misconfigure |
| Binary pass/fail | Any forecast diff triggers System.exit(1) — no tolerance threshold; a 0.001% rate deviation is treated the same as 50% |
| No historical tracking | Results live as text files in S3 with no dashboard, alerting, or trend analysis |
| Heavy monolith coupling | Loads the full Spring context (:api, :data, :query, :scheduler, :server) — slow startup, tightly coupled to monolith |
| No subset mode for CI | Running all active hotels is too slow for PR pipelines; no curated sample for fast feedback |
| Opaque reporting | Pipe-delimited text files in S3 — no structured output, no PR comments, no Slack notifications |
| No DataDog integration | The rest of the org uses DataDog for observability, but Hammer results are isolated |
Proposed modernization — 3 phases:
Phase 1 — Automate in CI (Months 1-3):
- GitHub Actions workflow for pricing PRs: Trigger Hammer automatically on PRs that modify pricing-related code paths (
api/src/**/pricing/**,api/src/**/optimizer/**,hammer/**). Use path filters to avoid running on unrelated PRs. - Representative hotel sample: Maintain a curated registry of 10-20 hotels (small, medium, large, different regions and configurations) for fast CI runs. Full hotel set reserved for nightly runs.
- Scheduled nightly run on
develop: Comparedevelopagainst the last release tag. Full hotel set. Slack notification if diffs exceed threshold. - Configurable tolerance thresholds: Replace binary pass/fail with threshold-based reporting — warn on small deviations (<0.5% mean rate deviation), block on large deviations (>2%). Allow expected changes to be annotated in the PR.
Phase 2 — Structured reporting and observability (Months 3-6):
- Structured JSON output: Extend Hammer to produce machine-readable JSON alongside the current text format, enabling consumption by dashboards and CI tooling.
- PR comment with summary table: GitHub Actions posts a structured comment on pricing PRs:
- Hotels tested / hotels with diffs
- Mean rate deviation, largest deviation (hotel + percentage)
- Status: PASS / WARN / FAIL
- DataDog integration: Push Hammer results as custom metrics — rate deviation per hotel, number of affected hotels, MSE trends. Build a dashboard tracking pricing stability over time. Alert on trends ("pricing deviation increasing over last 5 runs").
Phase 3 — Decouple and modernize (Months 6-12):
- Containerize Hammer: Docker image with Hammer JAR + dependencies. Eliminates the need to build the full monolith to run Hammer. Runs on any CI runner or as a scheduled ECS task.
- Decouple from monolith Spring context: Extract
PricingOptimizerServiceinterface so Hammer can operate with a lighter context. As pricing moves to microservices, Hammer transitions from loading code in-process to calling the pricing service HTTP endpoint. - AI-assisted diff analysis: Feed Hammer diff output to Claude Code to generate human-readable summaries for PR reviewers — e.g., "This PR changes rates for 3 APAC hotels by an average of 2.3%, concentrated in room types rt1234 and rt5678 for Q3 dates. This is consistent with the PR description."
Target state:
| Dimension | Current | Target |
|---|---|---|
| Trigger | Manual, on-demand | Automatic on pricing PRs + nightly on develop |
| Scope | All hotels or manual selection | Curated sample for PRs, full set for nightly |
| Pass/fail | Binary (any diff = fail) | Threshold-based (warn / block with configurable tolerance) |
| Reporting | S3 text files | PR comments, DataDog dashboard, Slack alerts |
| Speed | Slow (full hotel set + Spring context) | Fast for PRs (sample + containerized), thorough overnight |
| Historical tracking | None | DataDog metrics with trend analysis |
| Monolith coupling | Full Spring context load | Containerized → eventually HTTP client to pricing service |
5. QA for AI-Generated Code
With 50-70% of Duetto's code AI-generated via Claude Code, QA strategy must explicitly address this reality.
5.1 How AI-Generated Code Changes Testing
| Traditional Code | AI-Generated Code |
|---|---|
| Developer understands every line they wrote | Developer reviews code they didn't write — attention gaps are common |
| Bugs correlate with developer skill and domain knowledge | Bugs correlate with prompt quality and review thoroughness |
| Testing verifies the developer's implementation | Testing must verify the AI's implementation AND the reviewer's understanding |
| Test quality depends on developer testing discipline | AI can generate tests too — but tautological tests (testing the implementation, not the behavior) are a known risk |
5.2 Specific Risks and Mitigations
| Risk | Mitigation |
|---|---|
| AI-generated code passes review but has subtle logic errors | Mutation testing to verify test suite effectiveness |
| AI generates tests that test the implementation, not the behavior | QE reviews test strategy, not just test code; property-based testing for complex logic |
| AI-generated code has security vulnerabilities (dependency confusion, injection, etc.) | SAST/DAST in CI, CodeRabbit security rules, dedicated security scanning |
| AI generates overly complex code when simple code would do | Complexity metrics in CI (warn on high cyclomatic complexity), code review standards |
| AI-generated tests provide false confidence (high coverage, low signal) | Mutation testing score as a quality gate, QE exploratory testing |
5.3 Mutation Testing
Mutation testing is the strongest technique for verifying test suite quality. It works by inserting small faults ("mutants") into production code and checking whether tests detect them.
Why it matters for AI-generated code: - AI can generate tests with high line coverage that don't actually catch bugs - Mutation testing reveals whether tests are truly effective, not just comprehensive - Google uses mutation testing at scale (6,000+ engineers, 14,000+ code authors) via a diff-based probabilistic approach
Tools: - PIT (pitest) — mutation testing for Java/JVM (Spring Boot, JUnit) - Stryker — mutation testing for JavaScript/TypeScript (Jest, React)
Proposal: Start with mutation testing on critical business logic (pricing algorithms, revenue calculations) and expand. Use as a quality signal, not a hard gate initially.
5.4 AI Tools for QA
| Tool | What It Does | Applicability to Duetto |
|---|---|---|
| Claude Code | Generate test scaffolding, write test cases from specs, debug test failures | Primary — already in use. QE should develop testing-specific prompts and skills. |
| Diffblue Cover | AI-generated unit tests for Java | Evaluate for Spring Boot services — auto-generates JUnit tests for existing code |
| Playwright Codegen | Records browser actions and generates Playwright test code | Use for rapid E2E test creation; QE reviews and refines generated tests |
| CodeRabbit | AI code review that can enforce testing standards via path-based rules | Configure to flag PRs missing tests, enforce coverage thresholds, detect test anti-patterns |
| Augment | AI-assisted code review | Complement CodeRabbit for reviewing AI-generated code quality |
5.5 AI Code Review Tools for Testing Standards Enforcement
Two AI-powered code review tools are available in Duetto's stack. They serve complementary purposes for quality enforcement.
CodeRabbit
Automated AI code review with rule-based enforcement:
- Path-based instructions: Require tests for specific directories (e.g.,
src/services/**must have correspondingsrc/__tests__/services/**) - Code guidelines: Reference testing standards document so CodeRabbit checks compliance
- AST-based rules (Pro): Enforce patterns like "no console.log in production code" or "test files must use describe/it structure"
- Default protections: Automatically skips review of generated code, lock files, build artifacts
- Strengths: Excellent for systematic rule enforcement, configurable per-repo, catches structural test issues (missing test files, coverage regressions, anti-patterns)
Augment Code
AI-powered code review and development assistant with deep codebase context:
- Codebase-aware reviews: Augment indexes the full codebase to provide contextual review comments — it understands existing patterns and flags deviations
- Test quality analysis: Can identify when tests don't adequately cover the changed code, suggest missing test scenarios, and flag weak assertions
- Architecture awareness: Understands service boundaries and can flag when integration or contract tests are missing for cross-service changes
- IDE integration: Available in VS Code and JetBrains IDEs, providing real-time feedback during development (shift-left)
- Strengths: Superior contextual understanding of the codebase; better at catching logic-level issues and suggesting what should be tested rather than enforcing structural rules
Comparison and Proposed Usage
| Dimension | CodeRabbit | Augment Code |
|---|---|---|
| Primary mode | Automated PR review (CI) | IDE assistant + PR review |
| Rule enforcement | Strong — configurable path-based and AST rules | Moderate — guideline-based, not rule-based |
| Codebase context | Limited to PR diff + configured instructions | Deep — indexes full repository |
| Test gap detection | Structural (missing test files, coverage delta) | Semantic (missing scenarios, weak assertions) |
| Custom configuration | Extensive (.coderabbit.yaml, path instructions) |
Moderate (team-level instructions) |
| Best for | Enforcing standards consistently at scale | Catching logic-level issues humans miss |
Proposal: Use both tools in complement: - CodeRabbit as the systematic enforcer — ensures every PR meets structural quality gates (test files exist, patterns followed, no anti-patterns) - Augment Code as the intelligent reviewer — catches semantic issues like insufficient test scenarios, missing edge cases, and tests that don't match the intent of the change - Configure CodeRabbit rules first (quick wins), then layer Augment for deeper quality insights
6. Cross-Team Consistency
6.1 The Problem
With tens of autonomous teams, inconsistency is the default. Without intervention, each team: - Chooses its own testing tools and patterns - Has different coverage thresholds (or none) - Writes tests at different levels (some heavy on unit, some on E2E, some on nothing) - Has different CI pipeline configurations - Handles flaky tests differently (or doesn't handle them at all)
6.2 The "Paved Road" Approach
Inspired by Stripe and Netflix: make the right thing easy, not mandatory.
Instead of mandating practices through policy, build infrastructure that makes good practices the path of least resistance:
| Paved Road | What It Means |
|---|---|
| Shared test templates | GitHub repo templates that include pre-configured test setup — App/Platform: Jest, JUnit, Playwright, Testcontainers; Intelligence: pytest, Great Expectations, golden file harness |
| CI pipeline templates | Reusable GitHub Actions workflows with quality gates built in — separate templates for Java/Spring, React/Next.js, and Python ML repos |
| Test data libraries | Shared factories and fixtures for common Duetto entities (hotels, rates, reservations, users); Intelligence: shared test data schemas and sample hotel model fixtures |
| Quality dashboard | DataDog dashboard showing quality metrics per team and per track — visibility drives behavior |
| Example repos | "Golden path" example services showing the proposed testing approach for backend, frontend, full-stack, and ML/data pipeline repos |
6.3 Testing Standards Document
The Quality Guild should own a lightweight testing standards document. What to include and what to leave to teams:
Standardize (guild-owned — all teams): - Test categorization (unit, integration, E2E, contract, data quality, pipeline — definitions and expectations) - Minimum quality gates for CI/CD (what blocks a merge) - Test naming conventions - Flaky test policy (quarantine, SLA for remediation) - Pre-commit hook standards (Ruff/ESLint, type checking, secret detection) - Coverage reporting (visibility required; thresholds by track)
Standardize (App/Platform track): - E2E framework choice (Playwright — not optional) - Contract testing approach (Pact — for service boundaries) - Testcontainers for integration tests (real infrastructure, not mocks)
Standardize (Intelligence track): - Data quality framework (Great Expectations — for all data pipelines) - Golden file testing for inference endpoints - MyPy strict mode for all Python repos - MLflow experiment tracking with minimum logged metrics - Model promotion requires accuracy comparison vs baseline
Leave to teams: - Internal test organization (file structure, test grouping) - Mocking strategies (as long as they follow "test behavior not implementation") - Which specific scenarios to test (teams know their domain best) - Test execution speed optimization (teams manage their own pipeline budget) - ML model architecture and hyperparameter choices
6.4 Quality Guild Cadence
| Activity | Frequency | Participants | Purpose |
|---|---|---|---|
| Guild meeting (all-hands) | Bi-weekly, 45 min | Both tracks + interested engineers | Shared topics: CI/CD, AI code quality, metrics review, cross-pollination |
| App/Platform deep dive | Monthly, 30 min | App/Platform QEs + leads | Track-specific: Playwright, Pact, Testcontainers, frontend testing |
| Intelligence deep dive | Monthly, 30 min | Intelligence QE + ML engineers | Track-specific: data quality, pipeline testing, model validation, MLflow |
| Standards review | Quarterly | Full guild | Update testing standards for both tracks based on what's working |
| Quality metrics review | Monthly | Guild lead + EM | Review dashboards by track, identify teams needing support |
| Tool evaluation | As needed | Relevant track | Evaluate new tools, make proposals |
| Onboarding | Per new QE/engineer | Relevant track QE | Testing expectations and tooling walkthrough (track-appropriate) |
7. Quality Gates in CI/CD
7.1 Progressive Adoption
Do not implement all gates at once. This creates overwhelming friction. Phase them in:
Phase 1 — Foundation (Weeks 1-4):
- Build verification (compilation, Docker image)
- Unit tests with existing coverage
- Linting/formatting (ESLint, Prettier, Checkstyle — autofix where possible)
- GraphQL schema checks (rover subgraph check)
Phase 2 — Contracts and Integration (Weeks 5-8): - Integration tests (with Testcontainers) - Pact contract verification - Security scanning (Snyk, Dependabot, or Trivy — critical/high block) - Code coverage reporting (no hard threshold yet — just visibility)
Phase 3 — Performance and Quality (Weeks 9-12): - Performance regression testing (k6 in CI) - Code coverage thresholds (start conservative: 60%, increase over time) - Bundle size monitoring (frontend) - E2E smoke tests on staging deployment
Phase 4 — Optimization (Ongoing): - Flaky test quarantine system - Test impact analysis (run only tests affected by changed files) - Visual regression testing - Mutation testing on critical paths
7.2 Gate Classification
| Tier | Behavior | Examples |
|---|---|---|
| Blocking | Must pass to merge | Build, unit tests, integration tests, linting, schema checks, contract tests, security (critical/high) |
| Warning | Report but don't block | Coverage trending down, performance regression >10%, bundle size growth, complexity increase |
| Informational | Report only | Test execution time trends, flaky test rate, dead code, accessibility audit, TODO count |
7.3 Test Execution Optimization
| Technique | How | Impact |
|---|---|---|
| Parallelization | Playwright --shard, JUnit 5 parallel execution, GitHub Actions matrix strategy |
2-4x faster CI |
| Selective test runs | Jest --changedSince=main, Gradle --tests with file mapping |
50-80% fewer tests on average PR |
| Fail-fast | Run unit tests first; skip integration/E2E if they fail | Faster feedback on obvious failures |
| Caching | Cache Docker images (Testcontainers), Playwright browsers, Maven/npm dependencies | 30-50% faster pipeline startup |
| Test result aggregation | Publish to DataDog for trend analysis, GitHub Actions test summary annotations | Flaky test detection, regression tracking |
8. Current State: CI/CD & Test Infrastructure Analysis
This section documents the current state of Duetto's CI/CD pipelines, static code analysis, test coverage, and flaky test handling — derived from analysis of the duetto (backend) and duetto-frontend repositories. Understanding the baseline is essential for prioritizing improvements.
8.1 Current CI/CD Pipeline Architecture
| Repository | PR Pipeline | Push (develop) Pipeline | Scheduled |
|---|---|---|---|
| duetto (backend) | Static analysis → All tests (Jest + basic + Selenium) | Static analysis → All tests → Docker build → GraphQL schema publish | Every 2 hours (with commit-change check) |
| duetto-frontend | Lint → Jest → Cypress (12 parallel containers) | Lint → Jest → Cypress → Trigger external Playwright E2E | Weekly flaky test issue creation (Mondays) |
| duetto-playwright-e2e | PR-specific tests | Regression suite (triggered by frontend push) | On-demand via workflow dispatch |
Key observations: - Pipeline structure is sound — progressive checks with fast failures first - Backend uses larger runners (ARM 64-core for static analysis, ubuntu-latest-m for Jest) - Playwright tests live in a separate repository, triggered via webhook — this adds latency and reduces developer visibility - Scheduled all-tests run (every 2 hours) with Slack notifications provides good ongoing monitoring
8.2 Static Code Analysis — Current State
| Tool | Repository | Configuration | Blocking? | Notes |
|---|---|---|---|---|
| Checkstyle 10.26.1 | Backend | 120-char line length, naming conventions, whitespace, modifier order | Yes — fails PR | Well-configured, enforces consistent Java style |
| SpotBugs 6.1.7 | Backend | Exclude filter for known false positives, HTML+XML reports | No — ignoreFailures: true |
Reports generated but don't block merges |
| ESLint (Airbnb + TS) | Frontend | Airbnb + TypeScript config, no-only-tests (error level), deprecated import warnings | Yes — fails PR | Good configuration; prevents .only() leaks |
| Prettier 2.4.1 | Frontend | Integrated into lint-staged pre-commit hooks | Yes — pre-commit | Formatting consistency ensured |
| TypeScript compiler | Frontend | tsc --noEmit strict check |
Yes — fails PR | Catches type errors before runtime |
| Xray / Frogbot | Backend | Security scanning via JFrog | Manual trigger | Available but not on every PR |
Gaps and proposals:
| Gap | Impact | Proposal | Priority |
|---|---|---|---|
| SpotBugs is non-blocking | Potential bugs slip through to production | Make SpotBugs blocking for new violations (allow existing baseline) | High |
| No SonarQube or equivalent | No centralized quality dashboard, no code smell tracking, no technical debt measurement | Evaluate SonarCloud (SaaS) for unified quality visibility across repos | Medium |
| No SAST/DAST security scanning on PRs | Security vulnerabilities in dependencies and code not caught early | Add Snyk or Trivy to PR pipeline for dependency scanning (critical/high = blocking) | High |
| No frontend complexity analysis | Complex components grow unchecked | Add ESLint complexity rules (max cyclomatic complexity warning) | Low |
8.3 Test Coverage — Current State
Critical finding: No code coverage thresholds are enforced in either repository.
| Dimension | Backend | Frontend |
|---|---|---|
| Coverage tool | Not configured (no JaCoCo) | Not configured in jest.config.js |
| Coverage threshold | None | None |
| Coverage reporting | None | None |
| Coverage trend tracking | None | None |
| Test count | ~2,318 Java unit tests + 176 Selenium tests | ~404 Jest specs + 175 Cypress specs + 32 Playwright tests |
Proposals:
- Immediate (Phase 1): Add coverage reporting without thresholds — visibility first
- Backend: Add JaCoCo to Gradle (
jacocoTestReport), publish HTML reports as CI artifacts - Frontend: Add
--coverageflag to Jest CI command, publish lcov reports -
Integrate with CodeCov or SonarCloud for PR-level coverage delta comments
-
Phase 2: Introduce conservative thresholds (not blocking yet)
- Start with warning-level thresholds: 50% line coverage (likely below current levels to avoid disruption)
-
Track coverage trend per PR — flag PRs that decrease coverage
-
Phase 3: Enforce blocking thresholds
- Increase to 60% line coverage (blocking)
- Require no decrease in coverage per PR (blocking)
- Target 80%+ for critical business logic (pricing, revenue calculations, event processing)
8.4 Flaky Test Handling — Current State
Duetto has invested meaningfully in flaky test infrastructure, but the approach is fragmented across frameworks.
Backend — Custom Retry System
| Mechanism | Details |
|---|---|
@RetryTest annotation |
Custom JUnit 5 extension: retries test N times on specific exception types (e.g., TimeoutException, InvocationTargetException) |
| Usage | ~9 annotated tests, primarily Selenium page tests |
validate-flapper-fix workflow |
Manual-trigger workflow that runs a single test up to 250 times to verify a flaky fix |
@Disabled tests |
~20+ Selenium tests disabled with notes like "To Be Fixed in Later Ticket" |
| Test splitting | 15 runners (basic) + 20 runners (Selenium) with line-count distribution via split-tests action |
| Slack alerts | Scheduled test failures notify via Slack webhook |
Frontend — Cypress Retry + Automated Detection
| Mechanism | Details |
|---|---|
| Cypress retries | retries=10 — each test can fail up to 10 times before final failure |
.xspec.ts quarantine |
5 test files renamed to .xspec.ts (excluded from runs) |
| Weekly automation | GitHub Action runs every Monday: parses Cypress logs for failures, creates GitHub issue with skip instructions |
| Cypress Cloud | Records all runs for post-mortem analysis |
it.skip() / describe.skip() |
Manual skip annotations throughout the codebase |
Playwright E2E
| Mechanism | Details |
|---|---|
| Retry | 1 retry on CI only (retries: process.env.CI ? 1 : 0) |
| Workers | 1 worker on CI (stability), 4 locally (speed) |
| Reporting | Allure reports with per-team breakdown, DataDog test visibility integration |
| Traces | Retained on failure for debugging |
Gaps and proposals:
| Gap | Impact | Proposal | Priority |
|---|---|---|---|
| Cypress retries=10 is excessive | Masks genuinely flaky tests; 10 retries can add minutes to CI | Reduce to 2-3 retries; any test needing >3 retries is flaky and should be quarantined | High |
| No unified flaky test tracking | Flaky tests tracked differently per framework; no org-wide view | Build a DataDog dashboard tracking flaky rate per test suite, per team | Medium |
| ~20+ disabled Selenium tests | Unknown test debt; regression risk | Audit disabled tests: either fix, delete, or convert to Playwright. Set SLA: no test disabled >30 days without a ticket | High |
| No automatic quarantine | Manual process to skip flaky tests; relies on someone noticing | Implement automatic quarantine: if a test fails >3 times in 7 days, auto-quarantine + create ticket | Medium |
| Frontend weekly automation is reactive | Only runs Mondays; flaky tests can block PRs all week | Run flaky detection daily or on every develop push | Low |
8.5 Test Infrastructure Summary
┌──────────────────────────────────────────────────────────────────┐
│ Current Test Landscape │
├────────────────┬───────────────┬──────────────────────────────────┤
│ Layer │ Count │ Framework & Notes │
├────────────────┼───────────────┼──────────────────────────────────┤
│ Java unit │ ~2,318 tests │ JUnit 5 + Mockito (15 runners) │
│ Backend Jest │ Subset │ Jest (frontend-in-backend) │
│ Frontend Jest │ ~404 specs │ Jest + RTL (jsdom) │
│ Cypress E2E │ ~175 specs │ Cypress 7.7 (12 runners, ret=10)│
│ Selenium E2E │ ~176 tests │ Selenium 4.29 + Firefox (20 run)│
│ Playwright E2E │ ~32 tests │ Playwright (separate repo) │
│ Hammer (price) │ On-demand │ Custom Java (monolith module) │
├────────────────┼───────────────┼──────────────────────────────────┤
│ TOTAL │ ~3,100+ tests │ 7 frameworks across 3+ repos │
└────────────────┴───────────────┴──────────────────────────────────┘
Key takeaway: The test suite is substantial (~3,100+ tests) but fragmented across 7 frameworks and multiple repositories. Consolidation on Playwright (replacing Cypress and Selenium), unifying test reporting, and automating Hammer (currently the only pricing regression tool, but on-demand only) will dramatically improve signal quality and reduce maintenance burden.
9. Quality Metrics
9.1 Metrics That Matter
Beyond test coverage — metrics that actually correlate with software quality:
Leading Indicators (predict quality):
| Metric | What It Measures | Target | Tool |
|---|---|---|---|
| Test coverage trend | Direction of coverage over time (not absolute %) | Increasing quarter-over-quarter | CodeCov, SonarQube |
| Mutation score | % of mutants killed — measures test effectiveness | >70% for critical business logic | PIT, Stryker |
| Flaky test rate | % of test runs that are non-deterministic | <1% of total test suite | Custom tracking, DataDog |
| Build success rate | % of CI builds that pass on first run | >90% | GitHub Actions metrics |
| PR test coverage delta | Coverage change per PR | No decrease (warning), increase (goal) | CodeCov PR comments |
Lagging Indicators (measure quality outcomes):
| Metric | What It Measures | Target | Tool |
|---|---|---|---|
| Escaped defect rate | Bugs that reach production per release | Decreasing trend | Jira/incident tracking |
| Mean Time to Recovery (MTTR) | How fast production issues are fixed | <1 hour for P1 | DataDog, PagerDuty |
| Deployment frequency | How often teams ship to production | Multiple times/week per team | GitHub Actions |
| Change failure rate | % of deployments causing incidents | <5% | Incident tracking |
| Incident frequency | Production incidents per team per month | Decreasing trend | PagerDuty, DataDog |
DORA metrics (Deployment Frequency, Lead Time, Change Failure Rate, MTTR) should be the north-star metrics for engineering quality.
9.2 Quality Dashboard
Build a DataDog dashboard showing quality health per team:
| Section | Metrics | Audience |
|---|---|---|
| Team Health | DORA metrics, escaped defects, incident rate | Engineering leadership |
| Test Health | Coverage, flaky rate, test execution time, mutation score | Quality Guild, team leads |
| Pipeline Health | Build success rate, pipeline duration, quality gate pass rate | Quality Engineering team |
| Production Health | Error rates, p99 latency, SLO compliance | All engineers |
10. Tooling Proposals
10.1 Proposed Tool Stack
| Category | Tool | Why |
|---|---|---|
| Unit testing (Java) | JUnit 5 + Mockito | Industry standard for Spring Boot |
| Unit testing (JS/TS) | Jest + React Testing Library | Already in use, excellent for React |
| Integration testing | Testcontainers | Real Docker containers for MongoDB, PostgreSQL, Redis, LocalStack, RabbitMQ |
| E2E testing | Playwright | Consolidate from Playwright + Cypress + Selenium (see Sections 4.4 and 11) |
| Contract testing | Pact | Consumer-driven contracts for service boundaries |
| GraphQL schema | Apollo Rover | Schema checks in CI for federation |
| Performance testing | k6 | JS-based, DataDog integration, GraphQL support, CI-native |
| Security scanning | Snyk or Trivy | Dependency vulnerabilities in CI |
| Mutation testing | PIT (Java), Stryker (JS/TS) | Validate test suite effectiveness |
| Visual regression | Playwright built-in screenshots | Free, no additional tooling |
| Accessibility | axe-core + Playwright | Automated WCAG 2.1 AA checks |
| Code review (AI) | CodeRabbit + Augment | Enforce testing standards, catch quality issues |
| Observability | DataDog + OpenTelemetry | Production quality signals |
| Chaos engineering | AWS FIS + Toxiproxy | Resilience testing (Phase 4+) |
| Intelligence Track | ||
| Unit testing (Python) | pytest + pytest-mock | Already in use across ML repos; standardize fixtures and markers |
| Data quality | Great Expectations | Already in datapipelines — expand to training repos; schema + distribution checks |
| Type checking (Python) | MyPy (strict mode) | Already on pricerator, ml_elasticity — standardize across all Python repos |
| Linting (Python) | Ruff | Already adopted — standardize config (line-length, rule sets) across all ML repos |
| ML experiment tracking | MLflow | Already in use — add automated validation gates (accuracy vs baseline) |
| Golden file testing | pytest + snapshot files | Already in group-forecast-service — expand to all inference endpoints |
| Property-based testing | Hypothesis | For algorithmic code: pricing constraints, elasticity calculations, optimization solvers |
| Pipeline testing | Airflow test utilities + pytest-docker | DAG structure validation, pipeline integration tests |
| AWS mocking | moto | Already in pricerator — standardize for all boto3/S3 testing |
10.2 Testcontainers Setup for Duetto's Stack
Testcontainers should be the standard for all backend integration tests, replacing mocks with real infrastructure:
| Container | What it replaces | Use case |
|---|---|---|
MongoDBContainer |
Mock MongoDB / embedded MongoDB | Repository tests, aggregation pipeline tests |
PostgreSQLContainer |
Mock PostgreSQL / H2 | Neon Postgres compatibility, migration tests |
GenericContainer("redis:7") |
Mock Redis | Cache integration tests |
LocalStackContainer (SQS, SNS, Kinesis, S3) |
Mock AWS SDK | Event-driven architecture tests |
RabbitMQContainer |
Mock RabbitMQ | Message handler tests |
11. Selenium-to-Playwright Migration
11.1 Migration Scope
Duetto has ~176 Selenium tests (Java, Selenium 4.29, Firefox) running across 20 parallel CI runners. These tests use the Page Object pattern, custom @RetryTest annotations, and rely on Xvfb virtual display + Jetty server for test execution. In parallel, ~175 Cypress specs (TypeScript) run across 12 CI containers.
Both test suites should migrate to Playwright. This section focuses on the Selenium migration, which is the larger and more complex effort.
11.2 Why Migrate Now
| Driver | Details |
|---|---|
| Maintenance burden | 20+ disabled Selenium tests with "to be fixed later" notes; custom retry infrastructure needed for stability |
| Browser limitation | Selenium tests only run Firefox; no Safari or Chrome coverage |
| Infrastructure cost | 20 parallel runners + Xvfb virtual display is heavyweight compared to Playwright's headless-by-default approach |
| Framework age | Selenium 4.x is stable but less developer-friendly than Playwright's auto-waiting, built-in assertions, and tracing |
| Consolidation | Running 3 E2E frameworks (Selenium + Cypress + Playwright) across 3 repos is unsustainable; converging to 1 cuts maintenance by ~60% |
| BLAST compatibility | Next.js E2E testing is natively supported by Playwright; Selenium has no first-class Next.js integration |
11.3 Migration Strategy: AI-Accelerated Conversion
The migration of ~176 Selenium tests is a high-volume, pattern-based task — ideal for AI acceleration. Using Claude Code (and optionally Augment Code for codebase-aware suggestions), the migration can be completed in a fraction of the time manual rewriting would require.
Phase 1 — Foundation (Weeks 1-2)
Set up the Playwright project and migrate the base infrastructure:
- Create Playwright project in the existing
duetto-playwright-e2erepo (or a new directory induetto) - Convert base test utilities:
- Selenium
WebDriversetup → PlaywrightBrowser/BrowserContext/Page - Custom
@RetryTestannotation → Playwright's built-inretriesconfig - Xvfb display server → Playwright headless mode (no display server needed)
- Jetty server start/stop → Playwright
webServerconfig inplaywright.config.ts -
MongoDB/Redis setup → reuse existing GitHub Actions setup or migrate to Testcontainers
-
Create a mapping reference for the AI tools:
| Selenium (Java) | Playwright (TypeScript) |
|---|---|
driver.findElement(By.id("x")) |
page.locator('#x') |
driver.findElement(By.cssSelector(".x")) |
page.locator('.x') |
driver.findElement(By.xpath("//div")) |
page.locator('div') or page.locator('xpath=//div') |
element.click() |
await locator.click() |
element.sendKeys("text") |
await locator.fill('text') |
element.getText() |
await locator.textContent() |
element.isDisplayed() |
await locator.isVisible() |
new WebDriverWait(driver, 10).until(...) |
Auto-waiting built into Playwright actions |
driver.navigate().to(url) |
await page.goto(url) |
driver.switchTo().frame(...) |
await page.frameLocator(...) |
Thread.sleep(ms) |
await page.waitForSelector(...) or await expect(locator).toBeVisible() |
Actions(driver).moveToElement(e) |
await locator.hover() |
Select(element).selectByValue(v) |
await locator.selectOption(v) |
driver.manage().window().setSize(...) |
await page.setViewportSize({...}) |
Assert.assertEquals(expected, actual) |
await expect(locator).toHaveText(expected) |
Phase 2 — AI-Powered Page Object Conversion (Weeks 3-6)
Use Claude Code to bulk-convert Selenium Page Objects to Playwright:
Prompt strategy for Claude Code:
Given this Selenium Page Object class (Java), convert it to a Playwright
Page Object (TypeScript). Follow these rules:
1. Replace all Selenium WebDriver calls with Playwright equivalents
2. Replace explicit waits (WebDriverWait) with Playwright auto-waiting
3. Replace Thread.sleep() with proper Playwright waitFor* methods
4. Convert Java assertions to Playwright expect() assertions
5. Use Playwright's built-in locator strategies (prefer role, text,
test-id over CSS/XPath)
6. Keep the Page Object pattern but adapt to TypeScript class syntax
7. Add proper TypeScript types for all methods
8. Replace any Selenium-specific retry logic with Playwright's
built-in retry mechanisms
Source Selenium class:
[paste class]
Output the Playwright TypeScript equivalent.
Batch conversion workflow:
- List all Page Objects from
selenium/src/test/java/com/duetto/frontend/selenium/ - Prioritize by business value: Start with pages that cover critical user journeys (login, pricing, rate management, dashboard)
- Feed each Page Object to Claude Code with the prompt above
- Human review: Engineer verifies each converted Page Object — check locator strategies, ensure business logic is preserved, validate assertions
- Run both old and new tests in parallel for the same pages to validate conversion accuracy
Expected velocity with AI assistance: - Manual conversion: ~2-3 Page Objects per engineer per day - AI-assisted conversion: ~10-15 Page Objects per engineer per day (3-5x speedup) - Human review remains essential — AI may miss Duetto-specific patterns, custom wait conditions, or domain-specific assertions
Phase 3 — Test Conversion (Weeks 5-10)
Convert test files in priority order:
| Priority | Tests | Criteria |
|---|---|---|
| P0 | Login, authentication, critical navigation | Break = users can't access the product |
| P1 | Pricing, rate management, revenue dashboards | Core business functionality |
| P2 | Settings, admin, user management | Important but lower traffic |
| P3 | Disabled tests (~20+) | Evaluate: convert or permanently delete |
For each test file: 1. Feed the Selenium test + its Page Objects to Claude Code 2. AI generates the Playwright equivalent 3. Engineer reviews, adjusts for Duetto-specific patterns 4. Run the new Playwright test against the same environment 5. Validate it covers the same scenarios (compare step-by-step) 6. Once green, mark the Selenium test for deprecation
Claude Code skills for migration:
Consider creating a dedicated Claude Code skill (.claude/skills/selenium-to-playwright.md) that encodes:
- The mapping reference table above
- Duetto-specific Page Object conventions
- Common patterns in Duetto's Selenium tests (e.g., how they handle MongoDB test data, Jetty server initialization)
- Preferred Playwright locator strategy (test-id > role > text > CSS > xpath)
- Assertion patterns used in the Playwright E2E repo
Phase 4 — Validation and Cutover (Weeks 9-12)
- Parallel run period (2-3 weeks):
- Run both Selenium and Playwright suites in CI
- Track: same scenarios should produce same pass/fail results
-
Investigate any discrepancies — usually timing or locator differences
-
Decommission Selenium:
- Remove Selenium tests from CI pipeline
- Archive (don't delete) the Selenium directory for reference
- Remove Selenium dependencies from
build.gradle - Remove Xvfb and Firefox setup from GitHub Actions workflows
-
Reduce CI runners from 20 → Playwright's built-in sharding
-
Update CI infrastructure:
- Replace 20 Selenium runners with Playwright
--shardacross fewer runners - Playwright's native parallelization typically needs 3-5 shards for equivalent coverage
- Expected CI time reduction: 40-60%
11.4 Cypress Migration (Parallel Track)
The Cypress-to-Playwright migration follows a similar pattern but is simpler due to both being JavaScript/TypeScript:
| Selenium → Playwright | Cypress → Playwright |
|---|---|
| Language change (Java → TS) | Same language (TS → TS) |
| Page Object pattern rewrite | Page Object pattern adaptation |
| New assertion library | Similar assertion patterns |
| ~176 tests | ~175 specs |
| 12 weeks | 6-8 weeks |
Key Cypress → Playwright differences:
| Cypress | Playwright |
|---|---|
cy.visit(url) |
await page.goto(url) |
cy.get('.selector') |
page.locator('.selector') |
cy.contains('text') |
page.getByText('text') |
cy.intercept() |
page.route() |
cy.wait('@alias') |
await page.waitForResponse(...) |
| Automatic chaining | Explicit await on each action |
cy.request() |
request.get() / request.post() |
Claude Code can perform this conversion even faster since no language change is involved. Expected velocity: 15-25 specs per engineer per day with AI assistance.
11.5 Migration Timeline
Month 1 Month 2 Month 3 Month 4
┌────────────────┬────────────────┬────────────────┬────────────┐
│ SELENIUM │ SELENIUM │ SELENIUM │ │
│ Phase 1: │ Phase 2-3: │ Phase 3-4: │ Selenium │
│ Foundation + │ AI-convert │ Convert P2-P3 │ decomm. │
│ base infra │ P0-P1 tests │ + validation │ │
├────────────────┼────────────────┼────────────────┤ │
│ CYPRESS │ CYPRESS │ │ Cypress │
│ Phase 1: │ Phase 2: │ Cypress │ decomm. │
│ Stop new tests │ AI-convert │ validation │ │
│ + P0 migration │ P1-P3 tests │ + cutover │ │
└────────────────┴────────────────┴────────────────┴────────────┘
Total effort estimate: - With AI assistance: 2-3 engineers × 3 months (including validation) - Without AI: 3-4 engineers × 6 months - AI acceleration saves ~50% of migration effort
11.6 Risk Mitigation
| Risk | Mitigation |
|---|---|
| AI generates incorrect locators | Every converted test must be run and verified by an engineer; use Playwright Codegen to validate locator strategies |
| Test behavior changes during conversion | Run old and new tests in parallel during validation; compare results test-by-test |
| Team bandwidth | Spread migration across teams — each team converts its own Selenium/Cypress tests, guided by QE |
| Loss of Selenium-specific infrastructure | Document all custom Selenium utilities before removal; ensure Playwright equivalents exist |
| Disabled tests never get converted | Audit disabled tests in Phase 1: decide convert or delete. No indefinite quarantine during migration |
12. Implementation Roadmap
Phase 1 — Foundation (Months 1-3)
App/Platform track:
- [ ] Define Quality Engineer role and career ladder (both tracks)
- [ ] Hire or reassign 2-3 App/Platform QEs for pilot embedding (choose teams with willing tech leads)
- [ ] Audit current testing: what exists, what's automated, where are gaps per team
- [ ] Establish basic testing standards document (guild v0 — shared + track-specific sections)
- [ ] Implement Phase 1 CI quality gates (build, unit tests, linting, schema checks)
- [ ] Stop writing new Cypress and Selenium tests — all new E2E in Playwright
- [ ] Begin Selenium-to-Playwright migration: foundation + base infrastructure (Section 11)
- [ ] Create Claude Code skill for Selenium-to-Playwright conversion patterns
- [ ] Add JaCoCo (backend) and Jest --coverage (frontend) for coverage visibility
- [ ] Make SpotBugs blocking for new violations (baseline existing)
- [ ] Set up shared Testcontainers configurations for common Duetto infrastructure
- [ ] Create quality metrics dashboard in DataDog (basic version — both tracks)
Intelligence track:
- [ ] Audit Intelligence repo testing: catalog test counts, coverage gaps, and existing strengths per repo
- [ ] Standardize Ruff config across all Python ML repos (line-length, rule sets, pre-commit)
- [ ] Extend MyPy strict mode to all Intelligence Python repos (currently only pricerator, ml_elasticity)
- [ ] Add pytest --coverage to Intelligence CI pipelines for visibility (no thresholds yet)
- [ ] Document existing MLflow, Great Expectations, and golden file testing practices
- [ ] Automate Hammer: GitHub Actions workflow for pricing PRs + curated hotel sample (Section 4.6.7)
- [ ] Define Hammer tolerance thresholds (warn vs block) to replace binary pass/fail
- [ ] Add scheduled nightly Hammer run on develop with Slack notifications
Phase 2 — Pilot and Guild Formation (Months 3-6)
App/Platform track: - [ ] Embed QEs in 3-4 App/Platform pilot teams — focus on coaching, not gatekeeping - [ ] Form Quality Guild with two-track structure — bi-weekly all-hands, monthly track deep dives - [ ] Implement Phase 2 CI quality gates (integration tests, contract testing, security scanning) - [ ] Introduce Pact for first service boundary being extracted from monolith - [ ] AI-powered bulk conversion of Selenium P0-P1 tests to Playwright (Section 11.3) - [ ] Begin Cypress-to-Playwright migration (critical paths first) - [ ] Implement CodeRabbit + Augment Code testing enforcement rules - [ ] Add Snyk or Trivy for dependency security scanning on PRs - [ ] Start tracking DORA metrics per team
Intelligence track: - [ ] Hire or assign 1 Intelligence track QE (Python + ML pipeline experience required) - [ ] Expand Great Expectations from datapipelines into training repos (forecasting, ml_elasticity) - [ ] Implement golden file testing for pricerator and group-forecast-service inference endpoints - [ ] Add MLflow automated accuracy comparison vs baseline (warning gate, not blocking) - [ ] Create CI pipeline templates for Python ML repos (Ruff + MyPy + pytest + coverage) - [ ] Begin pytest coverage improvement: target utility functions and data transformations first - [ ] Hammer Phase 2: structured JSON output, PR comment summaries, DataDog metrics integration
Phase 3 — Scale and Standardize (Months 6-12)
App/Platform track: - [ ] Expand App/Platform QEs to cover all App + Platform teams (1 QE per 2-4 teams) - [ ] Implement Phase 3 CI quality gates (performance testing, coverage thresholds, E2E smoke) - [ ] Complete Selenium and Cypress decommission (Section 11.5) - [ ] Introduce mutation testing for critical business logic (pricing, revenue) - [ ] Build shared test data libraries (hotel, rate, reservation factories) - [ ] Create CI pipeline templates (reusable GitHub Actions workflows) - [ ] Establish quality onboarding for new developers
Intelligence track: - [ ] Intelligence QE embedded across Pricing, Forecasting, and Data teams - [ ] Implement data quality gates as blocking in Airflow DAGs (Great Expectations) - [ ] Introduce Hypothesis property-based testing for algorithmic code (pricing constraints, elasticity calculations) - [ ] pytest coverage threshold for Intelligence repos (40% — lower than app teams but meaningful) - [ ] Automated champion/challenger model testing before promotion to production - [ ] Data drift detection alerts integrated with DataDog
Phase 4 — Optimize and Mature (Months 12-18)
Both tracks: - [ ] All teams have QE support (embedded or shared) - [ ] Implement Phase 4 quality gates (flaky test quarantine, test impact analysis, visual regression) - [ ] Explore chaos engineering (AWS FIS + Toxiproxy for critical services) - [ ] Evaluate AI-powered test generation tools (Diffblue for Java, Claude Code test skills) - [ ] Formalize parallel run testing for high-risk service extractions - [ ] Annual review: guild health, tooling satisfaction, quality metrics by track
Intelligence-specific: - [ ] End-to-end ML pipeline validation (data → training → inference → output) as CI workflow - [ ] Model monitoring dashboards per model family (forecasting, elasticity, pricing optimizer) - [ ] Reproducibility CI checks: same config + data → deterministic output - [ ] Hammer Phase 3: containerize, decouple from monolith Spring context, AI-assisted diff analysis
13. Pre-Identified Potential Initiatives
The following initiatives have been identified during the analysis that produced this strategy. They are listed here as a starting backlog — not a commitment. Final prioritisation, sequencing, and scope will be determined by the Quality Guild and Quality Engineering Team once established.
Each initiative is tagged with the team charter it belongs to: Guild (TC-006) for governance and standards work, or QE Team (TC-007) for infrastructure and tooling delivery. QE Team initiatives are split by track.
13.1 Quality Guild Initiatives (TC-006)
These initiatives relate to governance, standards, coaching, and culture — owned by the guild as a community of practice.
| # | Initiative | Phase | Primary Metric Impact |
|---|---|---|---|
| 1 | Author testing standards document v1 — shared section + track-specific sections, including engineer testing responsibilities (unit/integration/E2E ownership expectations, PR review standards) | 1 | Cross-Team Testing Standard Adoption |
| 2 | Establish flaky test policy and SLA — quarantine rules, remediation ownership, maximum days disabled | 1-2 | Flaky Test Rate |
| 3 | AI-generated code QA strategy — mutation testing adoption criteria, AI-generated test review guidelines, tautological test detection | 2 | Escaped Defect Rate |
| 4 | DORA metrics tracking and review — define measurement per team, monthly review cadence with EMs | 2 | DORA Change Failure Rate |
| 5 | Testing standards document v2 — add Intelligence track sections (ML testing diamond, data quality standards, pipeline testing expectations, Intelligence-specific quality gates) | 2 | Cross-Team Testing Standard Adoption |
13.2 Quality Engineering Team Initiatives — Shared (TC-007)
Infrastructure and tooling that serves both App/Platform and Intelligence tracks.
| # | Initiative | Phase | Primary Metric Impact |
|---|---|---|---|
| 1 | Phase 1 CI quality gates — build and improve existing checks: build verification, unit tests, linting/formatting, GraphQL schema checks, code coverage visibility (JaCoCo, Jest, pytest — reporting only, no thresholds) | 1 | Quality Gate Adoption Rate |
| 2 | Quality metrics dashboard in DataDog — test health, pipeline health, per-team and per-track views | 1 | Quality Gate Adoption Rate |
| 3 | Phase 2 CI quality gates — integration tests, contract verification, dependency security scanning (Snyk/Trivy, critical/high = blocking), coverage delta PR comments | 2 | Quality Gate Adoption Rate |
| 4 | CodeRabbit + Augment Code configuration — path-based test enforcement rules, anti-pattern detection, semantic test gap review | 2 | Quality Gate Adoption Rate |
| 5 | Flaky test auto-quarantine system — detection (>3 failures in 7 days), automatic quarantine, Jira ticket creation, unified dashboard | 2 | Defect Detection Rate |
| 6 | Incident RCA tagging model — establish caught-in-ci / escaped-to-production / could-have-been-caught / not-ci-detectable classification in Jira |
2 | Defect Detection Rate |
| 7 | Phase 3 CI quality gates and reusable CI pipeline templates — k6 performance regression, coverage thresholds (60% blocking), E2E smoke on staging; published as versioned GitHub Actions workflows for Java/Spring, React/Next.js, Python ML | 3 | Quality Gate Adoption Rate |
13.3 Quality Engineering Team Initiatives — App/Platform Track (TC-007)
| # | Initiative | Phase | Primary Metric Impact |
|---|---|---|---|
| 1 | Selenium-to-Playwright migration — foundation, base infrastructure, Claude Code migration skill; stop new Cypress and Selenium tests, all new E2E in Playwright (Section 11) | 1 | E2E Framework Consolidation |
| 2 | Make SpotBugs blocking for new violations — baseline existing, fail PR on new | 1 | Quality Gate Adoption Rate |
| 3 | Shared Testcontainers configurations — MongoDB, PostgreSQL, Redis, LocalStack (SQS/SNS/Kinesis/S3), RabbitMQ | 1 | Quality Gate Adoption Rate |
| 4 | E2E framework bulk conversion — AI-powered Selenium P0-P1 test conversion with parallel run validation + Cypress critical path migration (Sections 11.3, 11.4) | 2 | E2E Framework Consolidation |
| 5 | Pact broker setup — contract broker infrastructure, can-i-deploy gate, message contract support for first service extraction |
2 | Quality Gate Adoption Rate |
| 6 | Selenium and Cypress decommission — remove from CI, archive tests, remove 20 Selenium runners + 12 Cypress containers, replace with Playwright sharding | 3 | E2E Framework Consolidation |
| 7 | Mutation testing infrastructure — PIT for Java, Stryker for JS/TS on critical business logic (pricing, revenue) | 3 | Defect Detection Rate |
| 8 | Shared test data libraries — factories and fixtures for hotels, rates, reservations, users | 3 | CI Build Success Rate |
13.4 Quality Engineering Team Initiatives — Intelligence Track (TC-007)
| # | Initiative | Phase | Primary Metric Impact |
|---|---|---|---|
| 1 | Audit Intelligence repo testing — catalog test counts, coverage gaps, and existing strengths per repo | 1 | Quality Gate Adoption Rate |
| 2 | Standardize Python tooling across all Intelligence repos — unified Ruff config (line-length, rule sets, pre-commit) and MyPy strict mode (currently only pricerator, ml_elasticity) | 1 | Quality Gate Adoption Rate |
| 3 | Hammer CI automation — GitHub Actions for pricing PRs + curated hotel sample + nightly develop run + Slack notifications + configurable tolerance thresholds replacing binary pass/fail (Section 4.6.7) | 1 | Hammer CI Automation |
| 4 | Expand Great Expectations from datapipelines into training repos (forecasting, ml_elasticity) | 2 | Quality Gate Adoption Rate |
| 5 | Model and inference validation — golden file testing for inference endpoints (pricerator, group-forecast-service) + MLflow automated accuracy comparison vs baseline (warning gate, not blocking) | 2 | Defect Detection Rate |
| 6 | Python ML CI pipeline templates — Ruff + MyPy strict + pytest + coverage as reusable GitHub Actions | 2 | Quality Gate Adoption Rate |
| 7 | Hammer structured reporting — JSON output, PR comment summaries, DataDog metrics integration | 2 | Hammer CI Automation |
| 8 | Data quality gates as blocking in Airflow DAGs — Great Expectations suites per pipeline stage | 3 | Defect Detection Rate |
| 9 | Hypothesis property-based testing for algorithmic code — pricing constraints, elasticity calculations, optimization solvers | 3 | Defect Detection Rate |
| 10 | pytest coverage thresholds for Intelligence repos — 40% initial target | 3 | Quality Gate Adoption Rate |
| 11 | Automated champion/challenger model testing before promotion to production | 3 | Defect Detection Rate |
| 12 | Data drift detection alerts integrated with DataDog | 3 | Defect Detection Rate |
13.5 Initiative Summary
| Owner | Count | Phase 1 | Phase 2 | Phase 3 | Phase 4 |
|---|---|---|---|---|---|
| Quality Guild | 5 | 2 | 3 | 0 | 0 |
| QE Team — Shared | 7 | 2 | 4 | 1 | 0 |
| QE Team — App/Platform | 8 | 3 | 2 | 3 | 0 |
| QE Team — Intelligence | 12 | 3 | 4 | 5 | 0 |
| Total | 32 | 10 | 13 | 9 | 0 |
Note: These are pre-identified potential initiatives derived from the analysis in this strategy document. They are not commitments. The Quality Guild and Quality Engineering Team will refine, reprioritise, merge, or discard initiatives as they begin work and learn from early phases. Formal APEX initiative IDs (I-YYYY-XX-NNN) will be assigned when initiatives are approved and enter the APEX pipeline.
14. Frequently Asked Questions
Q: Does this mean we're getting rid of manual QA? A: We're evolving it. Manual exploratory testing remains valuable — trained exploratory testers consistently find bugs that automated tests miss. But manual regression testing is eliminated. Developers own automated testing; QE focuses on strategy, coaching, and high-value exploratory work.
Q: Who writes the tests — developers or QEs? A: Developers write unit and integration tests for their code. QEs define the test strategy, coach developers on what and how to test, conduct exploratory testing, and build shared infrastructure. Think of it like security: every developer writes secure code, but security engineers set standards and do penetration testing.
Q: How does this work with AI-generated code? A: AI can generate tests, but AI-generated tests have known risks (tautological tests, testing implementation not behavior). QEs review test strategy and effectiveness, mutation testing validates test suite quality, and CodeRabbit enforces testing standards in code review.
Q: Won't this slow teams down? A: Short-term, introducing quality gates adds friction. Long-term, it reduces escaped defects, incident frequency, and time spent firefighting. Google's data shows that investing in testing infrastructure pays back within 6-12 months through fewer production issues and faster development velocity.
Q: What about our existing Automation Engineers? A: They can transition to Quality Engineers (with coaching/strategy focus) or join the Quality Engineering team (infrastructure focus), depending on their strengths and interests. Both paths are valuable.
Q: Why Playwright over Cypress? A: Multi-browser support, Java bindings for backend teams, native parallelization (free vs. Cypress Cloud paid), built-in visual regression, superior Next.js/BLAST compatibility, and 2-3x faster CI execution. See Section 4.4 for full comparison.
Q: How does this strategy apply to Intelligence/ML teams? A: Intelligence teams operate on a different technology stack (Python, LightGBM, Airflow, MLflow) and face different quality risks (data drift, model accuracy degradation, pipeline failures). The guild has a dedicated Intelligence track with its own QE(s), testing practices (data quality, pipeline tests, model validation), and tools (Great Expectations, golden file testing, Hypothesis). The standard testing honeycomb is replaced by a "testing diamond" adapted for ML systems. See Section 4.6.
Q: Do ML teams need the same coverage thresholds as App teams? A: No. ML training code is inherently harder to unit test because outcomes depend on data distributions, not deterministic logic. We set a lower initial coverage threshold (40% vs 60% for App) but target the testable parts — utility functions, data transformations, API layers, and configuration validation. Quality in ML comes from data validation (Great Expectations), model validation (MLflow metrics), and pipeline testing — not just code coverage.
Q: How long will the Selenium and Cypress migration take? A: With AI-assisted conversion (Claude Code), approximately 3 months with 2-3 engineers. Without AI, it would take 6+ months. The migration runs Selenium and Cypress tests in parallel with new Playwright tests during validation, so there's no coverage gap. See Section 11 for the full plan.
References
- Google, Software Engineering at Google — Chapter 11 (Testing Overview), Chapter 12 (Unit Testing)
- Google Research, State of Mutation Testing at Google (2018)
- Google Testing Blog, Where Do Our Flaky Tests Come From? (2017)
- GitLab, Testing Guide (docs.gitlab.com/ee/development/testing_guide)
- Atlassian, Quality Assistance vs Quality Assurance model
- Spotify, Testing Honeycomb and Squad/Chapter/Guild organizational model
- Stripe, Paved Road approach to developer experience
- Netflix, Paved Road infrastructure and Toxiproxy
- Microsoft, Combined Engineering announcement (2014)
- Pact Foundation, Consumer-Driven Contract Testing (pact.io)
- CodeRabbit, Skills and Configuration documentation
- Google, Reliable Machine Learning — ML testing and validation patterns
- Great Expectations, Data Quality and Testing documentation (greatexpectations.io)
- Hypothesis, Property-Based Testing for Python (hypothesis.readthedocs.io)
Document History
| Date | Author | Change |
|---|---|---|
| 2026-03-04 | Antonio Cortés | Initial draft |
| 2026-03-04 | Antonio Cortés | Added: engineer testing responsibilities (3.4), Augment Code evaluation (5.5), current CI/CD state analysis (Section 8), Selenium-to-Playwright migration plan (Section 11). Updated QE staffing to 4-6 embedded. |
| 2026-03-04 | Antonio Cortés | Added: Intelligence domain testing strategy (Section 4.6) with ML testing diamond, repo analysis, and ML-specific quality gates. Restructured guild model to two-track (App/Platform + Intelligence) with separate embedded QE profiles, cadence, standards, tooling, and roadmap items. |
| 2026-03-04 | Antonio Cortés | Updated L6 title to Staff/Lead QE. Changed language from "recommendation" to "proposal" throughout. Added phasing notes: Intelligence track as Phase 2 for the Quality Guild. Updated folder structure. Staff/Lead QE leads Quality Engineering Team with automation architecture responsibilities. |
| 2026-03-04 | Antonio Cortés | Added Hammer pricing regression testing analysis and 3-phase modernization plan (Section 4.6.7). Added Hammer to existing strengths (4.6.5) and test infrastructure summary (Section 8.5). Added Hammer automation items to implementation roadmap (Phases 1, 2, and 4 Intelligence track). |
| 2026-03-04 | Antonio Cortés | Renamed "Quality Platform Team" to "Quality Engineering Team" throughout to avoid confusion with the App/Platform engineering domain. |
| 2026-03-05 | Antonio Cortés | Renamed document from "QA & Automation Strategy" to "Quality Engineering Strategy." Added Section 13: pre-identified potential initiatives for Quality Guild (10), QE Team Shared (13), QE Team App/Platform (11), and QE Team Intelligence (20) — 54 total. |