Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
e84dd9e
docs: add candidate interview grader plan
asabaylus May 10, 2026
de1eab4
feat(interview): add rubric module with 9 observation dimensions
asabaylus May 10, 2026
dcc31c8
feat(interview): register `teamhero interview` CLI subcommand
asabaylus May 10, 2026
4003048
feat(tui): wire `teamhero interview` subcommand with verb stubs
asabaylus May 10, 2026
017b18d
feat(interview): add JSON-lines protocol stub for bidirectional events
asabaylus May 10, 2026
138862a
docs(interview): add rubric and classification-rationale docs
asabaylus May 10, 2026
e05e173
feat(interview): slice 2 — bootstrap, project validator/generator, ro…
asabaylus May 11, 2026
c1321b6
feat(interview): slice 3 — recording kit (privacy gate + start/end sc…
asabaylus May 11, 2026
6e928d7
feat(interview): slice 4 — assessment (collectors, extractors, observ…
asabaylus May 11, 2026
b87cbeb
feat(interview): slice 5 — cohort summary + Claude skill
asabaylus May 11, 2026
ae0ad2b
fix(ci): bash arithmetic robustness + cover more interview functions
asabaylus May 11, 2026
c7ff78f
feat(interview): v1.5 — wizard, TUI polish, manual test script
asabaylus May 11, 2026
106d3ac
docs(interview): mark MVP + v1.5 as shipped in the planning doc
asabaylus May 11, 2026
d295fdc
chore: add CodeRabbit auto-review configuration
asabaylus May 11, 2026
87d20dc
refactor(interview): rename 'grade' to 'review' across CLI and codebase
asabaylus May 11, 2026
ffe507a
test(discrepancy): pin getEnv mock to neutralize cross-file leakage
asabaylus May 11, 2026
790b9e8
fix(interview): address CodeRabbit critical/major findings on PR #10
asabaylus May 11, 2026
b80560d
fix(cli): allow positional args on top-level commands
asabaylus May 11, 2026
cf8229b
📝 CodeRabbit Chat: Add generated unit tests
coderabbitai[bot] May 11, 2026
fddbb66
fix(interview): YAML control-char escapes + report subcommand guard
asabaylus May 11, 2026
be38c7f
fix(interview): reject null bytes in generated file paths
asabaylus May 11, 2026
a993968
feat(tui): refactor interview bootstrap wizard onto shared layout
asabaylus May 11, 2026
a72a594
fix: address adversarial review of PR #10
asabaylus May 11, 2026
31817f9
fix: address remaining CodeRabbit threads + confirm-default UX bug
asabaylus May 11, 2026
6eb3ae6
Merge remote-tracking branch 'origin/main' into claude/slice-1-founda…
asabaylus May 13, 2026
32cb14e
fix(interview): timeout + model config so wizard finishes idea-fetch …
asabaylus May 15, 2026
8ca39cc
fix(interview): make project validator polyglot (closes teamhero-scri…
asabaylus May 15, 2026
3ec0d72
feat(interview): headless bootstrap offers GitHub publish + clickable…
asabaylus May 16, 2026
14164d9
fix(interview): picker no longer hides the first option on initial paint
asabaylus May 16, 2026
310fb76
feat(interview): sensible defaults — output to ./interviews/<role>, 6…
asabaylus May 16, 2026
e872a3b
fix(interview): declutter the Ready-to-bootstrap confirm screen
asabaylus May 16, 2026
f09205a
fix(interview): bootstrap reliably hits the 400-700 LOC budget on fir…
asabaylus May 16, 2026
1402b9f
fix(interview): shorten static wizard descriptions to dodge huh bar-b…
asabaylus May 16, 2026
390468c
feat(interview): kit scaffolding ships with every bootstrap by default
asabaylus May 16, 2026
f09741c
refactor(interview): collapse redundant prompt-source steps + add deb…
asabaylus May 16, 2026
368e62f
feat(interview): greenfield/brownfield project type + question descri…
asabaylus May 16, 2026
03ada8a
refactor(interview): drop the project-size validator + prompt enforce…
asabaylus May 16, 2026
3754852
fix(interview): strip GLOSSARY, sample tests, and kit CLAUDE.md from …
asabaylus May 16, 2026
73840ef
feat(interview): JD becomes a standalone input with project-influence…
asabaylus May 16, 2026
b7197c1
fix(interview): --debug actually prints the per-field body logs
asabaylus May 16, 2026
a6ba2d5
fix(interview): JD steps move before Domain; rubric label names the f…
asabaylus May 16, 2026
07cfe6e
fix(interview): manual-test feedback pass — kit ships from wizard, co…
asabaylus May 16, 2026
f49a65b
feat(interview): --json agent payload + --publish flag (orthogonal)
asabaylus May 16, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 19 additions & 0 deletions .coderabbit.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json
# CodeRabbit configuration — https://docs.coderabbit.ai/configuration/auto-review
#
# Goal: auto-review PRs opened by Claude (branches named claude/*).
# CodeRabbit auto-reviews every PR targeting the default branch (main) by default,
# so PRs from claude/* branches are covered automatically.
#
# Note: the CodeRabbit schema has no source/head-branch filter. `base_branches`
# only matches the TARGET branch, not the source. The configuration below simply
# ensures auto-review stays on for the default branch.

language: en-US

reviews:
auto_review:
enabled: true
drafts: false
auto_incremental_review: true
auto_pause_after_reviewed_commits: 5
7 changes: 7 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,13 @@ dist/
teamhero-report-*
docs/maintenance_results.md

# Interview kit scaffolds — `teamhero interview bootstrap` defaults to
# ./interviews/<slug>. Both entries are listed: `interviews/` is the current
# default; `roles/` remains so users with content from the prior default
# don't accidentally commit it on upgrade.
interviews/
roles/

# Cache
out
claude-plugin/bin/
Expand Down
12 changes: 0 additions & 12 deletions .mcp.json

This file was deleted.

29 changes: 29 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,35 @@ teamhero report --headless --since 2026-03-01 --until 2026-03-14 --sections loc,

Run `teamhero report --help` for all flags.

### 4. Review candidate interviews

TeamHero also includes a candidate AI-collaboration interview reviewer. Run
the interactive wizard to configure a role:

```bash
teamhero interview bootstrap
```

The wizard walks you through role slug, tech stack, business domain, feature
description, time-box, project mode, analysis mode, and rubric mode (with
conditional follow-ups for a custom prompt or a job-description file).

Once a candidate has submitted their repository, review it:

```bash
teamhero interview review --candidate "Jane Doe" --repo https://github.com/jane/submission
```

The review run prints a phased progress display (clone → collect-evidence →
extract-measurements → observe → audit-write) and finishes with a
glamour-rendered preview of the audit. **Every audit ships with a mandatory
ADVISORY banner** — the audit is advisory; hiring decisions are made by
humans. See `docs/interview-classification-rationale.md` for the full
ethical framing.

For scripting or agents, the headless flag list documented at
`teamhero interview bootstrap --help` is fully equivalent to the wizard.

---

## Use with Claude Code
Expand Down
809 changes: 809 additions & 0 deletions docs/2026-05-09-candidate-interview-reviewer-plan.md

Large diffs are not rendered by default.

263 changes: 263 additions & 0 deletions docs/interview-classification-rationale.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,263 @@
# Interview Classification — Methodology and Ethics Rationale

This document accompanies the interview rubric. It exists for two reasons:

1. **Defensibility.** If a hiring decision informed by this tool is ever
legally challenged, this document is the methodology of record. It states
what the AI does, what it deliberately does *not* do, and why.
2. **Internal honesty.** Anyone running an interview through this tool should
read the preamble before they trust any output.

The preamble is binding. The per-dimension methodology details that follow are
descriptions of the implementation; they do not soften, qualify, or contradict
the preamble.

---

## Preamble — Ethical Commitments

These four commitments shape every part of the system. They are non-negotiable
and any change to them requires explicit org sign-off.

### 1. Observations, not scores

The AI does not produce a score for a candidate. It does not produce a
per-dimension score, a weighted total, a band ("Strong Hire / Mixed / No
Hire"), or any other reductive label. The output is:

- **Narrative observations** for dimensions where the LLM is the judge —
1–3 sentences, paired with the reasoning chain that produced them and the
cited evidence excerpts that ground them.
- **Raw measurements** for dimensions that are deterministically observable —
e.g. "ran tests 5 times, interleaved with 12 prompts" presented as a fact.

The categorical decision is the hiring manager's, captured in the sign-off
section of each candidate's `summary.md` (Hire / Hire with notes / No hire).
That decision is *theirs*. The AI's output is one input among many.

**Why this matters:** numerical scoring of humans creates harms that
safeguards (opt-in, banners, sign-off) do not fully address — cognitive
anchoring on a number, false precision without true calibration, comparative
drift across candidates, bias amplification when scores are averaged or
thresholded, and increased legal exposure. Observations + evidence +
measurements provide the structure of a rubric without the harms of a score.

### 2. Bias diversification, NOT bias elimination

This tool **never** claims the AI is "non-biased," "objective," "neutral," or
"bias-free." Those claims are factually wrong and indefensible when
challenged.

LLMs trained via RLHF carry well-documented systematic biases:

- **Training-data bias** — overrepresented demographics, languages, cultural
contexts in the corpus.
- **Preference-tuning bias** — RLHF raters' demographic and aesthetic
preferences encoded in the "preferred response" signal.
- **Sycophancy** — LLMs tend to agree with their user, including subtly
approving of what the framing implies they should approve of.
- **Familiarity bias** — model is more familiar with mainstream tools and
patterns; less-mainstream alternatives are systematically disadvantaged.
- **Verbosity preference** — verbose output rated more favorably than
concise output, even when concise is better.
- **Name- and demographic-cue bias** — empirically documented disparate
treatment based on names alone replicates in LLM evaluators.

The defensible claim about the AI is this: **the AI offers a structurally
different perspective with different biases than the human reviewer.** Two
imperfect perspectives covering different blind spots is genuinely better
than one — but only because the *overlap set* of biases is smaller, not
because either perspective is unbiased.

Critically, AI bias is *systematic across all evaluations* — every candidate
faces the same biased model — while individual reviewer biases are local.
This means AI bias can scale harm more efficiently than individual bias if
deployed without the safeguards in commitment #3.

### 3. Human-in-the-loop is mandatory

Every interview run **requires** a human hiring manager to read the AI's
observations and write a sign-off. The sign-off has three categorical
outcomes (Hire / Hire with notes / No hire) plus a free-form reasoning field
where the manager explains their decision in their own words.

The tool refuses to consider an interview "complete" without this sign-off.
The cohort report displays sign-off status and the manager's recommendation
only — it does not display anything the AI produced as a verdict.

The standing copy at the top of every per-candidate audit and the cohort
report reads:

> ⚠ THIS AUDIT IS ADVISORY. Hiring decisions are made by humans using
> professional judgment. The candidate is a person, not a score. This rubric
> is one factor among many; your evaluation is the primary, first, and most
> important basis for your decision.

This is not boilerplate; it is the load-bearing framing of the tool.

### 4. GDPR Article 15 caveat — candidate audit access (MVP)

GDPR Article 15 ("right of access by the data subject") grants candidates in
the EU/EEA the right to obtain confirmation of, and access to, personal data
processed about them. The observations and measurements this tool produces
about a candidate fall within scope.

**MVP behavior:** candidate-facing audit access is **not** included. The
audit artifacts are stored locally on the hiring manager's disk and shared
only within the company. This is a *deliberate constraint*, not an oversight:
exposing the audit externally introduces legal review burden the MVP cannot
absorb.

**Implications the company must accept when running the tool in EU/EEA
contexts:**

- A candidate filing an Article 15 request must be served via the company's
existing data-subject-request process. The company is responsible for
producing the audit artifacts on request, not the tool.
- The candidate must be informed at the start of the interview that AI
observation is occurring (consent / transparency obligation under Article
13). This is implemented as the opt-in privacy gate in `bootstrap` and is
reproduced in the per-candidate `PRIVACY_RELEASE.md`.
- Candidates do not see the AI's narrative observation about them as part of
the standard hiring process. If a request is made, the audit is shared in
full — the reasoning chain is preserved precisely so this is possible
without redaction surprises.

A future enhancement may add a candidate-facing audit-access flow. Until that
is built and legally reviewed, the MVP default stands: company-only access,
candidate-served-on-request via existing processes.

---

## Per-dimension methodology

The implementation details below describe *how* observations and measurements
are produced for each dimension. They do not change anything in the preamble.

### 1. Upfront design & decomposition (`upfront-design`)

**Evidence mode:** llm-judge.

The LLM observer reads the interview log and terminal recording, looking for
evidence of decomposition behavior before the candidate began prompting:
explicit problem framing, identification of constraints, sketching of
interfaces or data flow, alignment on approach.

Output: narrative observation (1–3 sentences), reasoning chain, and 1–3
evidence excerpts cited from the interview log or transcript.

### 2. Context engineering (`context-engineering`)

**Evidence mode:** hybrid.

Deterministic extractor counts: CLAUDE.md references in prompts, glossary
terms used, file paths cited verbatim, examples provided as context.

LLM observer interprets the counts in context: high counts with poor
relevance are different from low counts with high relevance. Narrative
observation pairs with the raw counts.

### 3. Critical evaluation / "tasting" (`critical-evaluation`)

**Evidence mode:** llm-judge.

LLM observer scans the diff stream and prompt log for evidence of the
candidate rejecting, modifying, or pushing back on AI suggestions versus
accepting them verbatim. Reasoning chain preserved alongside the
observation.

### 4. Verification discipline (`verification`)

**Evidence mode:** deterministic.

Deterministic extractor counts: test invocations, type-check invocations,
diff/grep commands, manual verification commands. Reports the count and
interleaving rhythm (e.g. "8 test runs, alternating roughly every other
prompt").

No LLM observation is generated for this dimension. The facts speak for
themselves.

### 5. Course-correction (`course-correction`)

**Evidence mode:** hybrid.

Deterministic extractor detects course-correction signals: `git reset`,
`git checkout --`, file rollbacks, prompt re-asks, abandoned branches.

LLM observer pairs the detected signals with a narrative observation about
whether they reflect productive correction or thrashing.

### 6. Risk awareness (`risk-awareness`)

**Evidence mode:** deterministic.

Deterministic extractor detects destructive operations (`rm -rf`, `git push
--force`, schema-altering migrations, prod-affecting commands) and reports
them with timestamps and the pause-before-Enter duration if available.

No LLM observation is generated. The detected events and timings are the
output.

### 7. Architectural quality (`architectural-quality`)

**Evidence mode:** llm-judge.

LLM observer reads the final repo state and produces a narrative observation
on modularity, naming, separation of concerns, and depth of abstraction.
Reasoning chain preserved. Cited evidence excerpts from the produced code.

### 8. Test pass / spec satisfaction (`test-pass`)

**Evidence mode:** deterministic.

Deterministic extractor runs the role-specific acceptance tests against the
candidate's final repo state and reports pass/fail per acceptance criterion.

No LLM observation is generated. Pass/fail is a fact.

### 9. Throughput (`throughput`)

**Evidence mode:** deterministic.

Deterministic extractor reports timestamps from the terminal recording, git
log, and agent log. Reports time-to-first-passing-test, commits within the
time-box, and total elapsed time. No LLM interpretation.

---

## Interviewer-bias guard (binding)

Audio transcripts and interviewer notes are fed to the LLM observer as
context. A biased interviewer remark ("she seemed nervous", "he was
hesitant") can propagate into the AI's narrative observation if not guarded
against.

The observation prompt MUST include this instruction verbatim:

> The audio transcript and interviewer notes are provided as context about
> what was happening during the session. Treat the interviewer's verbal
> commentary as situational context only — do NOT weight it as evidence of
> the candidate's skill, competence, or character. Your observations must be
> grounded in the candidate's *actions* (prompts they wrote, tools they used,
> code they produced, tests they ran, decisions they made) — not in the
> interviewer's framing of those actions. If an interviewer remark could be
> interpreted multiple ways, do not let it bias your observation; rely on the
> directly observable artifacts (interview.log, terminal.cast, repo state).

Validation: the first 10 candidates run through the tool will have their
observations inspected for phrasing that echoes interviewer commentary
verbatim. If found, the instruction is tightened further before broader use.

---

## Schema-level guard against scoring drift

The LLM is called via the OpenAI Responses API with a strict `json_schema`
that explicitly omits `score`, `weighted_total`, `raw_total`, `band`,
`signal_count`, and similar reductive fields. The schema is `strict: true`,
which means a response containing any unlisted field is rejected at the
provider level — the LLM cannot drift into scoring even if prompted to.

If the schema is ever relaxed, this document and the rubric must be
re-reviewed in the same change. This guard is load-bearing.
Loading
Loading