aiwright — AI QA Agent

A Claude-powered BDD test-automation framework on TypeScript + playwright-bdd (Gherkin → Playwright Test runner). It turns a plain-language user story into a working, reviewed test suite — grounded in your app's real DOM and real code, not guesses.

What it is — a copilot for the first draft of your BDD tests. You describe a feature; aiwright proposes what to test (risk-ranked scenarios you curate), grounds the code in the live DOM (inspected, verified selectors) and your existing page objects/steps, self-heals until it compiles and runs, and classifies failures so you know whether the test or the app is wrong.

What it is not — an autonomous test writer that ships unreviewed tests. A human stays in the loop on every side-effecting step, and it never fakes a green test for behaviour your app does not have: a real app bug is escalated, not "healed" away.

flowchart TD
    story(["📝 User Story"]):::io
    page(["🌐 Live page"]):::io
    pass(["✅ reviewed, green suite"]):::io

    design["<b>design</b><br/>risk-ranked scenario ideas"]:::step
    human{{"👤 human curates scope"}}:::human
    inspect["<b>inspect</b><br/>real DOM → verified selectors"]:::step
    generate["<b>generate</b><br/>.feature · steps · page objects"]:::step
    verify["<b>verify</b> · tsc"]:::step
    heal["<b>heal</b><br/>rewrite until it compiles"]:::heal
    run["<b>run</b><br/>Playwright · real browser"]:::step
    healsel["<b>heal-selectors</b><br/>re-inspect + patch real selector"]:::heal
    analyze["<b>analyze</b><br/>classify the failure"]:::step
    escalate[["⚠️ real app-bug<br/>STOP — ask a human"]]:::stop

    story --> design --> human --> generate
    page --> inspect
    inspect -. verified selectors .-> generate
    generate --> verify
    verify -- fails --> heal --> verify
    verify -- ok --> run
    run -- green --> pass
    run -- fails --> analyze
    healsel --> verify
    analyze -- "test-bug<br/>(locator)" --> healsel
    analyze -- "flaky / env" --> run
    analyze -- "app-bug" --> escalate

    classDef io fill:#161b22,stroke:#30363d,color:#e6edf3
    classDef step fill:#1f6feb,stroke:#1158c7,color:#ffffff
    classDef heal fill:#238636,stroke:#196c2e,color:#ffffff
    classDef human fill:#9e6a03,stroke:#7d5400,color:#ffffff
    classDef stop fill:#da3633,stroke:#b62324,color:#ffffff

Drive it autonomously with the agent, or run each step yourself from the CLI. The two green nodes are the self-heal loops; a real app bug is escalated, never healed green.

Setup

npm install
npx playwright install chromium
cp .env.example .env      # set ANTHROPIC_API_KEY

Retargeting to your app — everything app-specific lives in one file, aiwright.config.ts (targetUrl, apiBaseUrl, openApiSpec, testIdAttributes); env vars (TARGET_URL/BASE_URL, API_BASE_URL, OPENAPI_SPEC) override it at runtime. To scaffold the project-owned layer for a fresh target (config, .env, a starter story, the directory layout):

npm run ai:init -- [dir] --target https://your-app.com --api https://api.your-app.com

Two ways to drive it

A) The autonomous agent (one command)

agent is an Anthropic tool-use loop that sequences the whole pipeline itself — design → inspect → generate → verify → heal → run → analyze — pausing for your OK before any side-effecting step, and self-healing failures along the way.

npm run ai:agent -- stories/getmobil-search.txt        # interactive: confirms inspect/generate/run
npm run ai:agent -- stories/getmobil-search.txt --auto  # CI / non-interactive: gates become non-blocking

It carries state across steps (verified selectors, generated files, run history) in reports/agent-run-<slug>.json, so it reasons about the whole run instead of starting each command from scratch.

Guardrails — "amplify, don't replace":

Read-only steps (design, verify, analyze, heal) run automatically.
Side-effecting steps (inspect, probe, generate, run, heal-selectors, heal-contract) pause for a human OK (skipped under --auto, but the run record still shows what would have asked).
Semantic escalation: if the agent decides a failure is a real app bug (not a test bug), it stops and asks a human — it will not rewrite the test to make a genuine regression go green.

B) Step by step from the CLI

Each stage is also a standalone command — useful when you want to review between steps:

npm run ai:design   -- stories/getmobil-search.txt
npm run ai:inspect  -- https://getmobil.com
npm run ai:probe    -- docs/api/openapi.json --live    # API lane: spec → verified endpoint map
npm run ai:generate -- stories/getmobil-search.txt --design <report> --selectors <map>
npm run ai:analyze

Self-healing

The feedback loop is closed in two places, both bounded (they stop and escalate instead of looping forever) and both honest (a real app bug is never healed green):

Layer	Trigger	What it does
`heal` (compile)	`verify` (tsc) fails	Feeds the TypeScript errors + current sources back to the model and rewrites only what's needed to compile (merging new members into existing page objects). Re-verifies.
`heal-selectors` (runtime, UI)	a scenario fails on a locator (timeout / strict-mode / not-visible)	Pulls the failing locator from the run report, re-inspects the live page, and patches the bad selector with a real one from the fresh map. Writes are confined to `src/pages`/`src/steps` with a `.bak` rollback, never touch Gherkin step text, and re-verify with `tsc`.
`heal-contract` (runtime, API)	an API scenario fails on schema drift (a thrown Contract violation, a body-field assertion, or an unexpected status)	Pulls the failure from the report, re-fetches the live response from the endpoint, and rewrites the stale contract/assertions. Writes are confined to `src/api`/`src/steps/api` with a `.bak` rollback, never touch Gherkin step text, and re-verify with `tsc`.

A locator or contract drift is treated as a test bug, not an app bug — the selector/schema drifted, the app didn't. Re-inspect/re-fetch + patch is exactly the right fix, and it's automatic. A real API regression (an error status or missing data the story requires) is an app-bug: escalated, never healed green.

The pipeline stages

`design` — what to test (no code)

From a user story, produces a test design for a human to review: risk areas, prioritised scenario ideas, open questions (ambiguous requirements), assumptions, and deliberate out-of-scope calls.

npm run ai:design -- stories/getmobil-search.txt

Output: reports/test-design-<slug>.md. Flow: review/edit → approve the scenarios → generate.

`inspect` — real DOM, no guessing

Opens the live page and extracts a stability-ranked selector map from the DOM, so the generator uses real selectors instead of guessing. Strategy priority:

test-id attribute  >  stable id  >  role + accessible name  >  static text  >  structural CSS

Recognised test-id attributes: data-test, data-testid, data-test-id, data-cy, data-qa, data-automation-id, data-e2e — and the selector is built with the actual attribute name found. Each selector is verified unique against the live DOM; ambiguous ones are scoped to a stable ancestor; repeated list rows collapse to one representative to parametrize. Page text is PII-redacted before the map is written.

npm run ai:inspect -- https://getmobil.com               # any public page
npm run ai:inspect -- "https://getmobil.com/ara/?term=iphone"   # a results/listing page

Output: reports/selector-map-<slug>.json. Accepts a full URL or a path resolved against BASE_URL.

`probe` — real API contract, no guessing

The API counterpart of inspect. Instead of a live DOM it reads an OpenAPI spec and turns it into a map of real, declared endpoints (methods, params, response schemas), so API tests target real paths/shapes instead of guessing. With --live it also calls each GET endpoint and records the observed status — the verification half, mirroring how inspect checks selectors against the live DOM. Deterministic (no LLM); JSON spec only (so it parses dependency-free).

npm run ai:probe                                   # spec only (default docs/api/openapi.json)
npm run ai:probe -- docs/api/openapi.json --live   # also verify each GET against the running API
npm run ai:probe -- <spec.json> --base http://localhost:4010 --live

Output: reports/endpoint-map-<slug>.json.

`generate` — feature + steps + page objects

npm run ai:generate -- stories/getmobil-search.txt

Grounded generation (recommended):

npm run ai:generate -- stories/getmobil-search.txt \
  --design    reports/test-design-product-search-on-getmobil.md \  # exact scenarios to build
  --selectors reports/selector-map-getmobil-com.json \             # verified selectors, verbatim
  --max 2                                                          # quick trial: top N scenarios only

--design makes the curated design the authoritative scope — it builds exactly those scenarios, inventing none and dropping none.
--selectors makes the generator use the inspected selectors verbatim instead of guessing.
--max N caps a fast trial run to the N highest-priority scenarios (works with --design).
--verify type-checks the result; --fix runs the compile self-heal loop; --run executes the scenarios once they compile.

It never silently overwrites an existing file — on a conflict it writes a .generated sibling. New page objects come with a fixture-registration snippet in the output notes.

`run` + `analyze`

run executes the scenarios in a real browser and retries on failure to tell a flaky scenario (passes on re-run) from a consistent one. analyze reads the run's results and classifies each failure as app-bug | test-bug | flaky | environment with a root cause and a concrete fix.

npm test           # all scenarios (parallel)
npm run ai:analyze # → reports/ai-analysis.md

Running tests

npm test                  # all scenarios (UI + API, parallel) → one Allure result set
npm run test:api          # only the @api lane (browserless)
npm run test:smoke        # only @smoke tagged
npm run test:ui           # Playwright UI mode
HEADLESS=false npm test   # watch the browser
npm run report            # generate + open the Allure report

Reporting: Allure is the human-facing report for both lanes (UI + API) — run history, per-step detail, and traces/screenshots attached on failure. The test scripts clear reports/allure-results first, so npm test (which runs both projects) gives one combined report; npm run report renders it (needs Java for the Allure CLI). A Cucumber JSON (reports/*-report.json) is still emitted, but only as the machine feed the AI pipeline reads (analyze / heal-selectors / heal-contract) — not a report you open.

CI note: npm test targets the live app (https://getmobil.com). Public sites can sit behind bot challenges that block data-centre IPs (e.g. CI runners), so the browser tests run locally where a normal browser/IP passes. CI gates on the offline checks (type-check, redaction). Reports land under reports/; screenshots and traces are captured for failed scenarios under reports/test-results/.

API testing (APIRequestContext lane)

Alongside the UI lane there's a second, browserless lane for HTTP/API tests — same BDD/Gherkin format, but driven by Playwright's APIRequestContext instead of a page. The two lanes share the features/ tree and are split by tag: @api scenarios run in the api Playwright project (no browser), everything else in chromium.

npm run test:api          # @api scenarios only (browserless) → Allure results + cucumber JSON feed
npm run mock:api          # run the local mock API standalone (otherwise auto-booted)

getmobil.com doesn't publish a JSON API yet, so a dummy contract (docs/api/openapi.json) plus a local Express mock (mock/server.ts) stand in. Playwright's webServer auto-boots the mock for test:api. When the real API ships, point API_BASE_URL at it and update the spec — the clients/steps don't change.

The lane mirrors the UI structure: src/api/BaseApiClient.ts is the page-object analogue over APIRequestContext, src/api/clients/*Api.ts are the resource clients, src/api/contracts/* are dependency-free response validators (the seam heal-contract acts on), and src/api/fixtures.ts wires it all into the api BDD project. The same agent pipeline applies — probe grounds it (the API twin of inspect) and heal-contract self-heals schema drift (the API twin of heal-selectors).

Web UI (AI QA Studio)

A small browser front end over the same pipeline: paste a user story → review the AI design and tick the scenarios you want → generate the code, preview it, and save it into the project. The API key stays server-side.

npm run web          # http://localhost:5173

Endpoints (/api/design, /api/generate, /api/save, /api/fix) reuse the CLI functions directly. Saving type-checks the result, and Auto-fix runs the compile self-heal loop. The server binds to 127.0.0.1, rejects non-localhost Host headers, and can require a shared token (AIWRIGHT_TOKEN).

Worked example

1 — Inspect finds getmobil's real selectors (it has data-test-id="selenium-..." hooks that the broader recognition picks up; the old data-test-only inspector missed them):

$ npm run ai:inspect -- https://getmobil.com
Getmobil ile Yenilenmiş Teknoloji Ürünlerini Keşfedin!
  Elements found    : 69
  Unique selectors  : 53
  Repeated (lists)  : 2  (parametrize per item)
  Needs disambig.   : 9
  Unresolved (0 hit): 5
Selector map: reports/selector-map-getmobil-com.json
# search box → [data-test-id="selenium-header-search-input"]

2 — Self-heal recovers from a drifted selector (a test-bug, fixed automatically):

break a selector  →  npm test            →  ✘ Timeout waiting for locator('…WRONG…')
                  →  agent: heal-selectors →  re-inspect getmobil, patch the real selector, tsc ✓
                  →  npm test            →  ✓ 2 passed

The bundled stories/getmobil-search.txt (product search) is live-green end to end.

Project Structure

playwright.config.ts   defineBddProject (chromium UI + browserless api) + reporter + webServer
features/              Gherkin feature files (api/ holds the @api lane)
fixtures/              Test data (users.json, sensitive/ …)
docs/api/              OpenAPI spec(s) — grounding for the API lane (openapi.json)
mock/                  Local Express mock standing in for the real API (server.ts)
src/
  ai/                  Claude pipeline: testDesigner · pageInspector · specProbe · testGenerator ·
                       failureAnalyzer · selectorHealer · contractHealer · prompts · redact · client
  agent/               Autonomous orchestrator: orchestrator (tool-use loop) · tools ·
                       state · policy (guardrails) · prompts · io
  cli/                 ai:design / ai:inspect / ai:probe / ai:generate / ai:analyze / ai:agent
  api/                 API lane: BaseApiClient · clients/*Api · contracts/* (validators) · fixtures
  pages/               Page Object Model (extends BasePage)
    selectors/         Centralised selector modules (one per site, *.selectors.ts)
  steps/               Step definitions (fixture-based, via createBdd); steps/api/ for the @api lane
  fixtures/            Playwright fixtures (page objects) + data helpers
  web/                 Express server exposing the pipeline (npm run web)
public/                AI QA Studio single-page UI
.features-gen/         specs generated by bddgen (not committed)
reports/               designs, selector/endpoint maps, run state, analysis (not committed)

Sensitive Data Protection (PII)

Sensitive data (national IDs, cards, IBANs, …) never reaches the LLM. Three layers:

Isolation — real PII lives under fixtures/sensitive/, git-ignored (only *.example.json templates are committed).
Read-deny — permissions.deny in .claude/settings.json stops the coding agent from reading fixtures/sensitive/** and .env.
Redaction — before any Claude API call (src/ai/redact.ts):
- Pattern-based: national ID (11 digits), card, IBAN, email, phone.
- Value-based denylist: every real value read via loadSensitive() is masked verbatim even when it matches no format (names, secret codes, …).

Regression check: npm run verify:redaction. Policy: fixtures/sensitive/README.md.

Quality scorecard

npm run eval scores the pipeline (redaction, project-surface discovery, and the inspector against the live home page); npm run eval -- --full also checks that design produces structured output and that generation compiles. Non-zero on failure, so it can gate CI — giving a number to "how well does it work" instead of a vibe.

Conventions

Steps use fixtures: async ({ searchPage }, param) => … — never new SearchPage(page).
Selectors are centralised: one src/pages/selectors/<site>.selectors.ts per app; page objects read from it — no raw selector strings scattered inline.
Selector priority: a data-test* attribute > stable id > role/accessible name. No brittle structural CSS chains.
Test data lives in fixtures/*.json: no hardcoded credentials in steps (getUser(...)).
Scenarios are independent: shared setup goes in Background; no state shared between scenarios. Generated scenarios target 6–10 declarative steps each.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

aiwright — AI QA Agent

Setup

Two ways to drive it

A) The autonomous agent (one command)

B) Step by step from the CLI

Self-healing

The pipeline stages

`design` — what to test (no code)

`inspect` — real DOM, no guessing

`probe` — real API contract, no guessing

`generate` — feature + steps + page objects

`run` + `analyze`

Running tests

API testing (APIRequestContext lane)

Web UI (AI QA Studio)

Worked example

Project Structure

Sensitive Data Protection (PII)

Quality scorecard

Conventions

Roadmap

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
.claude		.claude
.github/workflows		.github/workflows
docs		docs
features		features
fixtures		fixtures
mock		mock
public		public
scripts		scripts
src		src
stories		stories
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
aiwright.config.ts		aiwright.config.ts
package-lock.json		package-lock.json
package.json		package.json
playwright.config.ts		playwright.config.ts
tsconfig.build.json		tsconfig.build.json
tsconfig.json		tsconfig.json

Folders and files

Latest commit

History

Repository files navigation

aiwright — AI QA Agent

Setup

Two ways to drive it

A) The autonomous agent (one command)

B) Step by step from the CLI

Self-healing

The pipeline stages

design — what to test (no code)

inspect — real DOM, no guessing

probe — real API contract, no guessing

generate — feature + steps + page objects

run + analyze

Running tests

API testing (APIRequestContext lane)

Web UI (AI QA Studio)

Worked example

Project Structure

Sensitive Data Protection (PII)

Quality scorecard

Conventions

Roadmap

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`design` — what to test (no code)

`inspect` — real DOM, no guessing

`probe` — real API contract, no guessing

`generate` — feature + steps + page objects

`run` + `analyze`

Packages