A Claude-powered BDD test-automation framework on TypeScript + playwright-bdd (Gherkin → Playwright Test runner). It turns a plain-language user story into a working, reviewed test suite — grounded in your app's real DOM and real code, not guesses.
What it is — a copilot for the first draft of your BDD tests. You describe a feature; aiwright proposes what to test (risk-ranked scenarios you curate), grounds the code in the live DOM (inspected, verified selectors) and your existing page objects/steps, self-heals until it compiles and runs, and classifies failures so you know whether the test or the app is wrong.
What it is not — an autonomous test writer that ships unreviewed tests. A human stays in the loop on every side-effecting step, and it never fakes a green test for behaviour your app does not have: a real app bug is escalated, not "healed" away.
flowchart TD
story(["📝 User Story"]):::io
page(["🌐 Live page"]):::io
pass(["✅ reviewed, green suite"]):::io
design["<b>design</b><br/>risk-ranked scenario ideas"]:::step
human{{"👤 human curates scope"}}:::human
inspect["<b>inspect</b><br/>real DOM → verified selectors"]:::step
generate["<b>generate</b><br/>.feature · steps · page objects"]:::step
verify["<b>verify</b> · tsc"]:::step
heal["<b>heal</b><br/>rewrite until it compiles"]:::heal
run["<b>run</b><br/>Playwright · real browser"]:::step
healsel["<b>heal-selectors</b><br/>re-inspect + patch real selector"]:::heal
analyze["<b>analyze</b><br/>classify the failure"]:::step
escalate[["⚠️ real app-bug<br/>STOP — ask a human"]]:::stop
story --> design --> human --> generate
page --> inspect
inspect -. verified selectors .-> generate
generate --> verify
verify -- fails --> heal --> verify
verify -- ok --> run
run -- green --> pass
run -- fails --> analyze
healsel --> verify
analyze -- "test-bug<br/>(locator)" --> healsel
analyze -- "flaky / env" --> run
analyze -- "app-bug" --> escalate
classDef io fill:#161b22,stroke:#30363d,color:#e6edf3
classDef step fill:#1f6feb,stroke:#1158c7,color:#ffffff
classDef heal fill:#238636,stroke:#196c2e,color:#ffffff
classDef human fill:#9e6a03,stroke:#7d5400,color:#ffffff
classDef stop fill:#da3633,stroke:#b62324,color:#ffffff
Drive it autonomously with the
agent, or run each step yourself from the CLI. The two green nodes are the self-heal loops; a real app bug is escalated, never healed green.
npm install
npx playwright install chromium
cp .env.example .env # set ANTHROPIC_API_KEYRetargeting to your app — everything app-specific lives in one file, aiwright.config.ts
(targetUrl, apiBaseUrl, openApiSpec, testIdAttributes); env vars
(TARGET_URL/BASE_URL, API_BASE_URL, OPENAPI_SPEC) override it at runtime. To scaffold the
project-owned layer for a fresh target (config, .env, a starter story, the directory layout):
npm run ai:init -- [dir] --target https://your-app.com --api https://api.your-app.comagent is an Anthropic tool-use loop that sequences the whole pipeline itself —
design → inspect → generate → verify → heal → run → analyze — pausing for your OK before
any side-effecting step, and self-healing failures along the way.
npm run ai:agent -- stories/getmobil-search.txt # interactive: confirms inspect/generate/run
npm run ai:agent -- stories/getmobil-search.txt --auto # CI / non-interactive: gates become non-blockingIt carries state across steps (verified selectors, generated files, run history) in
reports/agent-run-<slug>.json, so it reasons about the whole run instead of starting each
command from scratch.
Guardrails — "amplify, don't replace":
- Read-only steps (
design,verify,analyze,heal) run automatically. - Side-effecting steps (
inspect,probe,generate,run,heal-selectors,heal-contract) pause for a human OK (skipped under--auto, but the run record still shows what would have asked). - Semantic escalation: if the agent decides a failure is a real app bug (not a test bug), it stops and asks a human — it will not rewrite the test to make a genuine regression go green.
Each stage is also a standalone command — useful when you want to review between steps:
npm run ai:design -- stories/getmobil-search.txt
npm run ai:inspect -- https://getmobil.com
npm run ai:probe -- docs/api/openapi.json --live # API lane: spec → verified endpoint map
npm run ai:generate -- stories/getmobil-search.txt --design <report> --selectors <map>
npm run ai:analyzeThe feedback loop is closed in two places, both bounded (they stop and escalate instead of looping forever) and both honest (a real app bug is never healed green):
| Layer | Trigger | What it does |
|---|---|---|
heal (compile) |
verify (tsc) fails |
Feeds the TypeScript errors + current sources back to the model and rewrites only what's needed to compile (merging new members into existing page objects). Re-verifies. |
heal-selectors (runtime, UI) |
a scenario fails on a locator (timeout / strict-mode / not-visible) | Pulls the failing locator from the run report, re-inspects the live page, and patches the bad selector with a real one from the fresh map. Writes are confined to src/pages/src/steps with a .bak rollback, never touch Gherkin step text, and re-verify with tsc. |
heal-contract (runtime, API) |
an API scenario fails on schema drift (a thrown Contract violation, a body-field assertion, or an unexpected status) | Pulls the failure from the report, re-fetches the live response from the endpoint, and rewrites the stale contract/assertions. Writes are confined to src/api/src/steps/api with a .bak rollback, never touch Gherkin step text, and re-verify with tsc. |
A locator or contract drift is treated as a test bug, not an app bug — the selector/schema drifted, the app didn't. Re-inspect/re-fetch + patch is exactly the right fix, and it's automatic. A real API regression (an error status or missing data the story requires) is an app-bug: escalated, never healed green.
From a user story, produces a test design for a human to review: risk areas, prioritised scenario ideas, open questions (ambiguous requirements), assumptions, and deliberate out-of-scope calls.
npm run ai:design -- stories/getmobil-search.txtOutput: reports/test-design-<slug>.md. Flow: review/edit → approve the scenarios → generate.
Opens the live page and extracts a stability-ranked selector map from the DOM, so the generator uses real selectors instead of guessing. Strategy priority:
test-id attribute > stable id > role + accessible name > static text > structural CSS
Recognised test-id attributes: data-test, data-testid, data-test-id, data-cy,
data-qa, data-automation-id, data-e2e — and the selector is built with the actual
attribute name found. Each selector is verified unique against the live DOM; ambiguous ones
are scoped to a stable ancestor; repeated list rows collapse to one representative to
parametrize. Page text is PII-redacted before the map is written.
npm run ai:inspect -- https://getmobil.com # any public page
npm run ai:inspect -- "https://getmobil.com/ara/?term=iphone" # a results/listing pageOutput: reports/selector-map-<slug>.json. Accepts a full URL or a path resolved against
BASE_URL.
The API counterpart of inspect. Instead of a live DOM it reads an OpenAPI spec and
turns it into a map of real, declared endpoints (methods, params, response schemas), so
API tests target real paths/shapes instead of guessing. With --live it also calls each GET
endpoint and records the observed status — the verification half, mirroring how inspect
checks selectors against the live DOM. Deterministic (no LLM); JSON spec only (so it parses
dependency-free).
npm run ai:probe # spec only (default docs/api/openapi.json)
npm run ai:probe -- docs/api/openapi.json --live # also verify each GET against the running API
npm run ai:probe -- <spec.json> --base http://localhost:4010 --liveOutput: reports/endpoint-map-<slug>.json.
npm run ai:generate -- stories/getmobil-search.txtGrounded generation (recommended):
npm run ai:generate -- stories/getmobil-search.txt \
--design reports/test-design-product-search-on-getmobil.md \ # exact scenarios to build
--selectors reports/selector-map-getmobil-com.json \ # verified selectors, verbatim
--max 2 # quick trial: top N scenarios only--designmakes the curated design the authoritative scope — it builds exactly those scenarios, inventing none and dropping none.--selectorsmakes the generator use the inspected selectors verbatim instead of guessing.--max Ncaps a fast trial run to the N highest-priority scenarios (works with--design).--verifytype-checks the result;--fixruns the compile self-heal loop;--runexecutes the scenarios once they compile.
It never silently overwrites an existing file — on a conflict it writes a .generated
sibling. New page objects come with a fixture-registration snippet in the output notes.
run executes the scenarios in a real browser and retries on failure to tell a flaky
scenario (passes on re-run) from a consistent one. analyze reads the run's results and
classifies each failure as app-bug | test-bug | flaky | environment with a root cause and a
concrete fix.
npm test # all scenarios (parallel)
npm run ai:analyze # → reports/ai-analysis.mdnpm test # all scenarios (UI + API, parallel) → one Allure result set
npm run test:api # only the @api lane (browserless)
npm run test:smoke # only @smoke tagged
npm run test:ui # Playwright UI mode
HEADLESS=false npm test # watch the browser
npm run report # generate + open the Allure reportReporting: Allure is the human-facing report for both lanes (UI + API) — run history, per-step detail, and traces/screenshots attached on failure. The test scripts clear
reports/allure-resultsfirst, sonpm test(which runs both projects) gives one combined report;npm run reportrenders it (needs Java for the Allure CLI). A Cucumber JSON (reports/*-report.json) is still emitted, but only as the machine feed the AI pipeline reads (analyze/heal-selectors/heal-contract) — not a report you open.
CI note:
npm testtargets the live app (https://getmobil.com). Public sites can sit behind bot challenges that block data-centre IPs (e.g. CI runners), so the browser tests run locally where a normal browser/IP passes. CI gates on the offline checks (type-check, redaction). Reports land underreports/; screenshots and traces are captured for failed scenarios underreports/test-results/.
Alongside the UI lane there's a second, browserless lane for HTTP/API tests — same
BDD/Gherkin format, but driven by Playwright's APIRequestContext instead of a page. The two
lanes share the features/ tree and are split by tag: @api scenarios run in the api
Playwright project (no browser), everything else in chromium.
npm run test:api # @api scenarios only (browserless) → Allure results + cucumber JSON feed
npm run mock:api # run the local mock API standalone (otherwise auto-booted)getmobil.com doesn't publish a JSON API yet, so a dummy contract (docs/api/openapi.json)
plus a local Express mock (mock/server.ts) stand in. Playwright's webServer auto-boots
the mock for test:api. When the real API ships, point API_BASE_URL at it and update the
spec — the clients/steps don't change.
The lane mirrors the UI structure: src/api/BaseApiClient.ts is the page-object analogue over
APIRequestContext, src/api/clients/*Api.ts are the resource clients, src/api/contracts/*
are dependency-free response validators (the seam heal-contract acts on), and
src/api/fixtures.ts wires it all into the api BDD project. The same agent pipeline applies
— probe grounds it (the API twin of inspect) and heal-contract self-heals schema drift
(the API twin of heal-selectors).
A small browser front end over the same pipeline: paste a user story → review the AI design and tick the scenarios you want → generate the code, preview it, and save it into the project. The API key stays server-side.
npm run web # http://localhost:5173Endpoints (/api/design, /api/generate, /api/save, /api/fix) reuse the CLI functions
directly. Saving type-checks the result, and Auto-fix runs the compile self-heal loop.
The server binds to 127.0.0.1, rejects non-localhost Host headers, and can require a
shared token (AIWRIGHT_TOKEN).
1 — Inspect finds getmobil's real selectors (it has data-test-id="selenium-..." hooks
that the broader recognition picks up; the old data-test-only inspector missed them):
$ npm run ai:inspect -- https://getmobil.com
Getmobil ile Yenilenmiş Teknoloji Ürünlerini Keşfedin!
Elements found : 69
Unique selectors : 53
Repeated (lists) : 2 (parametrize per item)
Needs disambig. : 9
Unresolved (0 hit): 5
Selector map: reports/selector-map-getmobil-com.json
# search box → [data-test-id="selenium-header-search-input"]
2 — Self-heal recovers from a drifted selector (a test-bug, fixed automatically):
break a selector → npm test → ✘ Timeout waiting for locator('…WRONG…')
→ agent: heal-selectors → re-inspect getmobil, patch the real selector, tsc ✓
→ npm test → ✓ 2 passed
The bundled stories/getmobil-search.txt (product search) is live-green end to end.
playwright.config.ts defineBddProject (chromium UI + browserless api) + reporter + webServer
features/ Gherkin feature files (api/ holds the @api lane)
fixtures/ Test data (users.json, sensitive/ …)
docs/api/ OpenAPI spec(s) — grounding for the API lane (openapi.json)
mock/ Local Express mock standing in for the real API (server.ts)
src/
ai/ Claude pipeline: testDesigner · pageInspector · specProbe · testGenerator ·
failureAnalyzer · selectorHealer · contractHealer · prompts · redact · client
agent/ Autonomous orchestrator: orchestrator (tool-use loop) · tools ·
state · policy (guardrails) · prompts · io
cli/ ai:design / ai:inspect / ai:probe / ai:generate / ai:analyze / ai:agent
api/ API lane: BaseApiClient · clients/*Api · contracts/* (validators) · fixtures
pages/ Page Object Model (extends BasePage)
selectors/ Centralised selector modules (one per site, *.selectors.ts)
steps/ Step definitions (fixture-based, via createBdd); steps/api/ for the @api lane
fixtures/ Playwright fixtures (page objects) + data helpers
web/ Express server exposing the pipeline (npm run web)
public/ AI QA Studio single-page UI
.features-gen/ specs generated by bddgen (not committed)
reports/ designs, selector/endpoint maps, run state, analysis (not committed)
Sensitive data (national IDs, cards, IBANs, …) never reaches the LLM. Three layers:
- Isolation — real PII lives under
fixtures/sensitive/, git-ignored (only*.example.jsontemplates are committed). - Read-deny —
permissions.denyin.claude/settings.jsonstops the coding agent from readingfixtures/sensitive/**and.env. - Redaction — before any Claude API call (
src/ai/redact.ts):- Pattern-based: national ID (11 digits), card, IBAN, email, phone.
- Value-based denylist: every real value read via
loadSensitive()is masked verbatim even when it matches no format (names, secret codes, …).
Regression check: npm run verify:redaction. Policy: fixtures/sensitive/README.md.
npm run eval scores the pipeline (redaction, project-surface discovery, and the inspector
against the live home page); npm run eval -- --full also checks that design produces
structured output and that generation compiles. Non-zero on failure, so it can gate CI —
giving a number to "how well does it work" instead of a vibe.
- Steps use fixtures:
async ({ searchPage }, param) => …— nevernew SearchPage(page). - Selectors are centralised: one
src/pages/selectors/<site>.selectors.tsper app; page objects read from it — no raw selector strings scattered inline. - Selector priority: a
data-test*attribute > stable id > role/accessible name. No brittle structural CSS chains. - Test data lives in
fixtures/*.json: no hardcoded credentials in steps (getUser(...)). - Scenarios are independent: shared setup goes in
Background; no state shared between scenarios. Generated scenarios target 6–10 declarative steps each.
- playwright-bdd core (fixtures, POM, parallel runs, reporting)
-
design— user story → risk-ranked test design ("what to test") -
inspect— live page → verified, stability-ranked selector map -
generate— user story → feature + steps + page objects (structured outputs) -
analyze— failure classification (app-bug | test-bug | flaky | environment) - agent — autonomous orchestrator over the whole pipeline, with guardrails
- self-healing — compile (
heal) + runtime selector (heal-selectors) + runtime contract (heal-contract) - API lane — browserless
APIRequestContextsuite withprobe(OpenAPI → endpoint map) andheal-contract -
generateAPI mode — endpoint map → @api feature + steps + client/contract files (build API suites end to end) - CI/CD integration (GitHub Actions: type-check + redaction gate)
- Allure reporting — combined UI + API report (history, steps, traces/screenshots); cucumber JSON kept as the AI feed
- human-approval web UI over the agent loop (surface run state + confirm/escalate gates)
- Jira integration (pull stories, write results back)
- MCP server (secure tool access layer)
- TestRail / Slack notifications