Status: Draft v0.5 Date: 30 April 2026 Author: Spec Architect (interviewing user) Classification: Personal project, single user
Changes from v0.4 (alignment-resolution pass before M0):
- FR-007 reconciled with the full
ExerciseFormatenum, organized by mode affinity, andLISTENINGrestored. - FR-017 (manual approval buffer for tutor-surfaced words) deleted — superseded by the §4.6 conversational on-ramp.
- §4.7 Pronunciation Engine: documented
NOT_DETECTEDas a third operational outcome that does not affect coverage; updated the quotedPronunciationScorerinterface to match the contract signature; stated the word-vs-sense denominator rule for word-pronunciation coverage. - §4.5 cooldown wording: "5 days of active use" → "5 calendar days."
- §6.2 FsrsState: documented lazy creation on first exposure (UNTESTED → LEARNING).
- §6.5: pruned the audio-retention open question; retention is indefinite per A-014.
- New FR-046: FSRS rating mapping from
ExerciseOutcome→ Again/Hard/Good. Easy reserved for V2. - New FR-047: Tutor Engine composes weekly-review sessions (≥2 fresh contexts per CANDIDATE, 1 per maintenance, plus pronunciation sample).
- §16.3: assumption added recording the FR-017 deletion.
Changes from v0.3: Restructured the user-facing experience into four explicit modes (Vocabulary, Sentences, Speaking, Conversation) — each with its own success criteria and exercise mix. Promoted Pronunciation Engine to a fully independent module with pluggable scorer interface; V1 ships with a Whisper-based heuristic scorer ("did transcript match?"), Azure Speech Pronunciation Assessment slots in as a configuration change later. Pronunciation tracking is now coverage-based (numerator/denominator) and decoupled from semantic vocabulary count. Added the "Explain in English" touchpoint as a first-class interaction primitive. Cost cap raised to $60/month.
Changes from v0.2: Closed eight of ten open questions per user direction. Added pronunciation grading as a V1 feature (now via pluggable interface — see v0.4 changes). Strengthened the conversational stretch-word mechanic into an automatic vocabulary on-ramp.
Changes from v0.1: Added the Pacer module — a macro-progression layer with four independent levers, advance-only.
A browser-based, voice-and-text Spanish tutor for a single learner (the project owner), built around the CEFR proficiency scale. Initial build covers levels A0 → A1, with the architecture extending cleanly to A2 and B1. The system uses Claude (via the Anthropic API) for all language-understanding work — exercise generation, response evaluation, conversational tutoring, and CEFR-level assessment — paired with third-party speech services for STT and TTS.
The system is structured around three layers of progression. The Vocabulary Engine uses Anki-style spaced repetition (FSRS) extended to track word senses rather than words, with a formal "weekly review" gate before a sense is admitted to confirmed vocabulary. The Pacer sits above this and decides when the user is fluent enough at the current level to be pushed harder, advancing along four independent levers: vocabulary breadth, sense depth, grammatical complexity, and production demand. The Tutor Engine uses Claude to generate exercises and conduct dialogue, scoped to whatever the Vocabulary Engine and Pacer say is appropriate for the current moment.
The system is Castilian Spanish (Spain), targeting the project owner's pronunciation, vocabulary choices, and grammatical conventions (e.g. vosotros, ceceo, coger used freely).
The owner is starting Spanish at A0 and wants a learning tool that (a) genuinely measures progress against the CEFR scale instead of inventing its own metrics, (b) doesn't pretend a word is "known" after one correct use, (c) supports speaking practice and not just reading/writing, and (d) is a single coherent system rather than a stack of disconnected apps (Duolingo + Anki + iTalki + ChatGPT).
- G1. Reach verified A1 within 6 months of consistent daily use. (Verified = passing an internal CEFR-A1 assessment that mirrors the official descriptors.)
- G2. Maintain a confirmed vocabulary list where every entry has been validated across all of its senses relevant at the current CEFR level, not just used once.
- G3. Support voice-in / voice-out as a first-class interaction mode, not a bolt-on.
- G4. Use Claude as the language-evaluation backbone for everything except STT/TTS.
- Multi-user support, accounts, sharing, or sync across devices. Single user, single device profile.
- Spanish variants other than Castilian.
- Levels above A1 in the initial build (architecture supports them; content does not).
- Mobile-native apps. Browser only; mobile-browser usage is fine but not optimised in V1.
- Offline mode. Always-online assumed.
- Single developer, personal time budget. Build incrementally; an A0-only working slice should be usable within ~4–6 weekends. [ASSUMPTION — confirm your time budget.]
- Anthropic API for all AI work. Non-Anthropic services permitted only where Anthropic has no offering — specifically STT and TTS.
- Cost. Personal budget, not enterprise. Target steady-state cost under $30/month at daily use. [ASSUMPTION — confirm.]
- Browser-based. No native desktop or mobile binaries.
- Short term (3 months): Confirmed vocabulary ≥ 300 words (A1 baseline is ~500), daily active use ≥ 5 days/week, weekly review completed each week.
- Long term (6 months): Pass internal A1 assessment. Confirmed vocabulary ≥ 500 senses. Hold a 5-minute Castilian conversation with the tutor on a familiar topic without falling back to English.
A single user: the project owner. No other personas. No admin role; you are the admin.
This is worth stating explicitly because it removes a huge amount of complexity from the spec — no auth flows, no RBAC, no multi-tenancy, no user-content moderation, no abuse prevention beyond what protects the API keys.
- Vocabulary engine — word and word-sense tracking, FSRS-based scheduling, validation state machine, confirmed-vocabulary list. Drives the Vocabulary mode.
- Pacer — macro-progression engine: monitors performance signals and advances the four difficulty levers (breadth, depth, grammar, production) when the user is demonstrably ready. Advance-only; regression is manual.
- Four user-facing modes — the user explicitly chooses what kind of session to start:
- Vocabulary mode — single-word drill (translation, recognition, definition matching). Fast-paced, ~15 items per session.
- Sentences mode — sentence-level work (cloze, translation, error correction, production prompts). Slower, ~10 items per session.
- Speaking mode — pronunciation-focused; user reads or repeats prompts and gets pronunciation feedback. Independent of semantic vocabulary tracking.
- Conversation mode — open-ended dialogue with the tutor, voice or text, with the on-ramp mechanic active.
- Tutor engine (Claude) — generates exercises across modes, conducts conversation, evaluates responses. Maintains "tutor feel" via on-demand English scaffolding (see #6).
- Voice I/O — STT for user speech, TTS for tutor speech, push-to-talk in the browser. Available across all modes that need it.
- "Explain in English" touchpoint — a first-class interaction primitive available in any mode: a button (and voice command, where appropriate) that asks the tutor to explain the current item, last response, or current concept in English without losing session context. Makes the system feel like a tutor, not a quizmaster.
- Pronunciation engine (independent module) — scores user speech against expected pronunciation. V1 ships with a Whisper-based heuristic scorer (boolean: did transcript match?). Architecture supports plugging in Azure Speech Pronunciation Assessment or other phoneme-level scorers later via a
PronunciationScorerinterface. Pronunciation results are stored on a separate track from semantic vocabulary mastery; the dashboards show them as coverage percentages over a defined denominator (set of phonemes + set of in-pool words to be pronounced). - Weekly review — scheduled assessment session that promotes/demotes word-senses in/out of confirmed vocabulary.
- CEFR progress dashboard — current level estimate, vocabulary coverage against A1 thematic domains, grammar competency checklist, per-lever Pacer panel, and separate pronunciation coverage bars (% phonemes covered, % in-pool words pronounceable).
- Pacer decision log — every advancement decision recorded with the signals that drove it.
- Session history — every exchange logged, replayable, searchable.
- Reading practice from external texts (news, books). Possible for V2.
- Writing practice with longer compositions. Sentence-level only in V1.
- Listening practice from external audio (podcasts, films).
- Grammar drills as a dedicated mode (grammar is evaluated as it surfaces in exercises and dialogue, not drilled).
- Cultural / regional content beyond what arises naturally.
User journeys are organised around the four modes. The user picks a mode from the dashboard; each mode has its own session shape, exercise mix, and success criteria. The Pacer operates across all of them.
Journey 1 — Vocabulary mode session (10–15 min). User picks "Vocabulary" → system pulls due word-senses (FSRS-scheduled) plus any priority-drill flags from prior conversations → runs ~15 single-word exercises across translation/recognition/definition formats → user gets per-item feedback with the "Explain in English" touchpoint always available → session ends with summary of state changes.
Journey 2 — Sentences mode session (15–20 min). User picks "Sentences" → system generates ~10 sentence-level exercises (cloze, full-sentence translation, error correction, sentence production from prompts) drawing from the active pool → grammar lever's state determines which constructions appear → "Explain in English" available throughout → session ends.
Journey 3 — Speaking mode session (10–15 min). User picks "Speaking" → system presents prompts (single words, then phrases, then sentences) for the user to read aloud or repeat after audio → Pronunciation Engine scores each utterance via the configured scorer (V1: Whisper heuristic; later: Azure phoneme-level) → coverage metrics update → no semantic mastery is affected. This mode is purely about whether the user can produce the sounds correctly.
Journey 4 — Conversation mode session (open-ended, typically 5–10 min of dialogue). User picks "Conversation" → tutor opens with a greeting or topic → push-to-talk dialogue, voice or text → on-ramp mechanic active (untracked words evaluated and routed per §4.6) → "Explain in English" available as a button and voice command ("explícame en inglés" or just clicking the button) → session ends with summary including any vocabulary added via on-ramp.
Journey 5 — Weekly review. Once per week, scheduled session that ignores FSRS scheduling and instead samples broadly across modes: vocabulary knowledge, sentence-level competency, and a small pronunciation sample. Every word-sense currently in CANDIDATE status gets tested in ≥2 fresh contexts; CONFIRMED words get a maintenance check; pronunciation coverage gets re-sampled to detect decay; the user gets a CEFR-level reassessment.
Journey 6 — Vocabulary inspection. User asks "do I know quedar?" → system shows the word, all senses currently tracked, semantic state of each, separate pronunciation status for the word, recent exercise history, next due date.
This is the heart of the system. Each (word, sense) pair moves through these states:
┌────────────────┐
│ UNTESTED │
│ (just added) │
└────────┬───────┘
│ first exposure exercise
▼
┌────────────────┐
│ LEARNING │
│ needs ≥3 OK │
│ in different │
│ exercise │
│ formats │
└────────┬───────┘
│ 3 successful varied uses
▼
┌────────────────┐
│ CANDIDATE │
│ in FSRS rotation; awaiting
│ confirmation in next │
│ weekly review │
└────┬───────────┬──────────────┘
│ │
passes weekly review │ │ fails weekly review
▼ ▼
┌──────────────┐ ┌──────────────┐
│ CONFIRMED │ │ LEARNING │
│ counts │ │ (back down) │
│ toward │ └──────────────┘
│ vocabulary │
└──────┬───────┘
│ 2 consecutive misses in maintenance
▼
┌──────────────┐
│ LAPSED │
│ removed from│
│ count; │
│ re-enters │
│ LEARNING │
└──────────────┘
The transitions are spec'd precisely in §5.
The Pacer holds four levers, each independently advanceable. Each lever has a small set of discrete states; the Pacer never takes fractional steps. State is durable — saved after every advancement and never forgotten.
Lever 1 — Breadth. Controls the size of the active vocabulary pool (the set of word-senses eligible to appear in exercises and tutor speech). States: core_500 → core_1000 → extended_1500 → extended_2000. Numbers correspond roughly to frequency-ranked words from the Plan curricular del Instituto Cervantes. Advancing one step pulls in the next ~500 words at A1 level into the UNTESTED queue.
Lever 2 — Depth. Controls how aggressively secondary senses of known words are introduced. States: primary_only → common_secondary → all_documented. At primary_only, quedar is tracked only in its most-frequent sense ("to remain"); at common_secondary, the next 1–2 most-frequent senses are activated as UNTESTED; at all_documented, every sense in the lexicon entry is tracked.
Lever 3 — Grammar. Controls the grammatical constructions the Tutor expects and uses. States: present_only → +past_tenses → +subjunctive_mood → +conditional_and_compound. Each step adds tense/mood families to the Tutor's "expect and use" list, which feeds into both exercise generation prompts and conversational system prompts.
Lever 4 — Production. Controls the mix between recognition-heavy and production-heavy exercises. States: recognition_heavy (70/30 split favouring translation/cloze/listening) → balanced (50/50) → production_heavy (30/70 favouring use-in-sentence and free dialogue). Affects the daily session's exercise format weighting.
Lever interaction. The levers are independent in that each can advance on its own, but the content the Pacer pulls in is influenced by all four. Adding +past_tenses (Grammar) doesn't add new words to the pool — but it does change which exercises are generated for words already in the pool. Advancing Breadth from core_500 to core_1000 adds words but those words then inherit the current Grammar and Production settings for how they're tested.
┌──────────────────────────────────────────┐
│ PACER (per lever) │
│ │
│ ┌──────────┐ signals met & │
│ │ STATE_N │──cooldown elapsed──────┐ │
│ └────┬─────┘ │ │
│ │ ▼ │
│ │ user manually steps ┌──────────┐
│ │ down (regression) │ STATE_N+1│
│ ◄─────────────────────────┤ │
│ └────┬─────┘
│ │ │
│ (continues to N+2…)│
└──────────────────────────────────────────────┘
Cooldown. After any advancement on any lever, that lever cannot advance again for 5 calendar days. Other levers are not affected. Calendar time, not active-use time: a 3-day break still counts toward the cooldown — punishing the user for being away would be the wrong incentive. The cooldown exists because new content needs to enter the FSRS rotation and produce real signal before the system can know whether further advancement is warranted; without it, the Pacer would advance on stale data.
The "Suggest ease back" mechanic. Per the user's choice, the Pacer never automatically regresses. But it does compute a "drift signal" each session: when 7-day accuracy on a lever's recently-advanced content drops below 60%, or when the FSRS system is heavily redistributing toward failed cards, the dashboard shows a non-blocking suggestion: "Production has been at production_heavy for 8 days and accuracy is at 54%. You may want to step it back to balanced." The user can act on it or ignore it. The decision log records both the suggestion and the user's response (or lack thereof).
The conversational tutor doesn't just expose the user to stretch words passively — it acts as an active vocabulary on-ramp. Every word the tutor uses that is not already in the user's tracked vocabulary becomes a candidate for tracking, and the tutor's evaluation of the user's response determines what happens next.
When a tutor turn introduces an untracked word (within the conversation's stretch budget — see FR-010), the system tags the word in the conversation log. When the user responds, Claude evaluates not just whether the response is grammatically and contextually correct, but specifically whether the user demonstrated comprehension of the new word. There are three outcomes:
Handled well — user produced a contextually appropriate response, used or correctly responded to the new word, no signs of confusion. The word is fast-tracked: it skips the UNTESTED state and enters LEARNING with one "correct" already credited (since the conversational use counts as one successful exposure across one format). The user is informed at session end: "You handled posiblemente correctly when I used it. I've added it to your tracking with a head start."
Handled neutrally — user's response didn't engage with the new word either way (they might have answered around it). The word enters UNTESTED state for normal tracking, no head start.
Struggled — user asked for clarification, gave a clearly off-topic response, or used the word incorrectly when reusing it. The word enters UNTESTED state but is also flagged as priority for next daily session — meaning the next session's exercise generator will explicitly pull this word in for direct drill, ahead of the FSRS schedule.
This means conversation is a real vocabulary growth pathway, not just exposure. It also means the Breadth lever isn't the only source of new words — important caveat for Pacer behaviour: words added via the on-ramp count toward the active pool but do not consume the Breadth lever's batch budget. The Pacer's Breadth advancement still pulls from the next frequency band when its thresholds are met; conversational on-ramps run in parallel.
Stretch budget cap remains. No more than 3 untracked words per conversation (per FR-010), to avoid drowning the user. The cap is on tutor-introduced novelty per session, not a cumulative cap.
Pronunciation is tracked independently from semantic vocabulary mastery. A word can be CONFIRMED in the vocabulary engine while still being "uncovered" for pronunciation, and vice versa. The user's headline vocabulary count is the semantic count; pronunciation has its own coverage metrics.
Two denominators, two coverage metrics:
- Phoneme coverage. Castilian Spanish has ~24 distinct phonemes (5 vowels + ~19 consonants, depending on how you count ʎ vs ʝ and the ceceo/seseo distinction). The system tracks which phonemes the user has demonstrably pronounced correctly. Phoneme coverage % = covered / 24.
- Word-pronunciation coverage. Of the words in the user's active pool, what percentage have been pronounced correctly at least twice across separate sessions? A word is "in the active pool" iff at least one of its tracked senses is in LEARNING, CANDIDATE, or CONFIRMED state — pronunciation is a property of the surface form, not the sense (banco sounds the same whether you mean "bank" or "bench"), so we count distinct words rather than multiplying by sense count. Word-pronunciation coverage % = pronounceable_distinct_words / distinct_active_pool_words.
Boolean per attempt with confirmation and decay:
Each pronunciation attempt has one of three outcomes from the configured PronunciationScorer: correct, incorrect, or not_detected. NOT_DETECTED is an operational outcome covering mic failure, silence, or scorer error — these attempts are logged for diagnostics but are invisible to the coverage state machine: they neither count toward advancement nor toward decay. The user-facing tracking layer applies the following rules:
- A phoneme or word is uncovered by default.
- A phoneme/word transitions to covered after 2 correct attempts in different sessions.
- A covered phoneme/word decays back to uncovered after 3 consecutive failures in subsequent sessions.
This is deliberately stricter than semantic mastery (no "candidate" intermediate state) because pronunciation is a motor skill — you can do it or you can't, and there's less value in a partial-credit state. The decay rule means the percentage reflects current ability, not historical bests.
No effect on vocabulary count. Pronunciation does not gate semantic state transitions. A word can move UNTESTED → LEARNING → CANDIDATE → CONFIRMED on the semantic side regardless of pronunciation performance, and vice versa.
The PronunciationScorer interface:
class IPronunciationScorer(ABC):
@property
@abstractmethod
def scorer_name(self) -> ScorerName: ...
@abstractmethod
def score(
self,
audio_path: str,
expected_text: str,
language: str = "es-ES",
) -> tuple[PronunciationOutcome, float, dict[str, Any]]:
"""Returns (outcome, confidence, raw_response)."""
...V1 ships with WhisperHeuristicScorer — sends the audio file to Whisper, asks for transcript, returns CORRECT if the normalised transcript equals the normalised expected text, INCORRECT if they differ, NOT_DETECTED if Whisper returns nothing usable. This is a crude proxy. Future implementations include AzurePronunciationScorer (real phoneme-level scoring) and potentially a local model. Configuration selects which scorer is active; the rest of the system is scorer-agnostic.
| ID | Requirement | Priority |
|---|---|---|
| FR-001 | The system shall track vocabulary at the (word, sense) granularity, not the word level. | Must |
| FR-002 | Each word-sense shall hold a state ∈ {UNTESTED, LEARNING, CANDIDATE, CONFIRMED, LAPSED} and an FSRS state (stability, difficulty, last review, next due). | Must |
| FR-003 | A word-sense shall transition from LEARNING to CANDIDATE only after ≥3 successful exercises in ≥2 different exercise formats, with ≥24h between the first and third success. | Must |
| FR-004 | A word-sense shall transition from CANDIDATE to CONFIRMED only via a weekly review session, requiring ≥2 successful evaluations in fresh contexts not previously seen by the user. | Must |
| FR-005 | A CONFIRMED word-sense shall transition to LAPSED after 2 consecutive maintenance failures (FSRS "Again" responses). LAPSED returns to LEARNING with reset FSRS parameters. | Must |
| FR-006 | The "vocabulary count" displayed to the user shall include only senses in CONFIRMED state. | Must |
| FR-007 | The system shall generate exercises in the following formats, with the indicated mode affinities — Vocabulary mode: TRANSLATE_EN_ES, TRANSLATE_ES_EN, DEFINITION_MATCH, SENSE_DISAMBIGUATION, LISTENING (single-word audio); Sentences mode: CLOZE, FULL_TRANSLATE, ERROR_CORRECTION, PRODUCTION, LISTENING (sentence-level audio); Speaking mode: READ_ALOUD, REPEAT_AFTER (no semantic evaluation, pronunciation only); Conversation mode: no formal exercise format (uses ConversationTurn). | Must |
| FR-008 | All exercise generation shall be done via Claude API calls, parameterised by the target word-sense, current user level, and exercise format. | Must |
| FR-009 | Response evaluation (was the user correct?) shall be done via Claude with a structured JSON output schema; no regex or string-match grading. | Must |
| FR-010 | The conversational tutor mode shall constrain its outputs to vocabulary the user has been EXPOSED to (any state including UNTESTED/LEARNING) plus an explicit "stretch budget" of ≤3 new words per conversation, surfaced for later vocabulary tracking. | Must |
| FR-011 | The system shall support push-to-talk speech input, transcribe via configured STT provider, and play tutor responses via configured TTS provider. | Must |
| FR-012 | The system shall log every exchange (timestamp, mode, prompt, user response, evaluation, state changes) to a local persistent store. | Must |
| FR-013 | The weekly review shall be scheduled (by default, every Sunday) and shall not be skippable without an explicit override; missed reviews shall be flagged on the dashboard. | Should |
| FR-014 | The CEFR dashboard shall show: current estimated level, vocabulary count by state, A1 thematic domain coverage (greetings, family, food, etc.), grammar competency checklist. | Must |
| FR-015 | The user shall be able to inspect any word's full state including all tracked senses, history, and next due date. | Must |
| FR-016 | The system shall support inserting new word-senses both manually (user adds quedar:to_remain) and automatically (tutor surfaces a new word during conversation). | Must |
| FR-017 | (Removed in v0.5; superseded by the §4.6 conversational on-ramp. Numbering preserved to avoid breaking references.) | — |
| FR-018 | The CEFR level estimation shall be re-computed after each weekly review and shall use Claude as the assessor, fed the user's vocabulary state and a sample of recent productions. | Must |
| FR-019 | The system shall include a Pacer module that holds four independent difficulty levers (Breadth, Depth, Grammar, Production), each with discrete advancement states per §4.5. | Must |
| FR-020 | The Pacer shall evaluate advancement signals after every completed daily session and after every weekly review. It shall not run during active sessions. | Must |
| FR-021 | The Pacer shall advance the Breadth lever when ALL of: (a) 7-day rolling accuracy across active pool ≥ 85% on at least 3 distinct exercise formats, (b) most recent weekly review CANDIDATE→CONFIRMED conversion rate ≥ 70%, (c) the lever's 5-day cooldown has elapsed, (d) at least 80% of the current pool is in CANDIDATE or CONFIRMED state (no advancing while there's still substantial UNTESTED inventory). | Must |
| FR-022 | The Pacer shall advance the Depth lever when: (a) ≥ 30 distinct words have been in CONFIRMED state for ≥ 14 days, (b) the user has encountered ≥ 5 secondary senses in conversation/exercises and handled them correctly without confusion, (c) cooldown elapsed. | Must |
| FR-023 | The Pacer shall advance the Grammar lever when: (a) the user has successfully produced the next-tier construction unprompted in ≥ 3 separate exchanges within the past 14 days (verified by Claude evaluation), (b) cooldown elapsed. | Must |
| FR-024 | The Pacer shall advance the Production lever when: (a) accuracy on production-format exercises ≥ 80% over the rolling 7-day window, (b) the user has completed ≥ 5 conversational tutor sessions in the past 14 days, (c) cooldown elapsed. | Must |
| FR-025 | The Pacer shall never automatically regress any lever. Regression is a manual action initiated by the user from the dashboard. | Must |
| FR-026 | The Pacer shall compute a "drift signal" per lever each session and surface a non-blocking Suggest ease back notification on the dashboard when drift exceeds threshold (per §4.5). The notification does not change Pacer state. | Should |
| FR-027 | Every Pacer decision (advance, hold, suggest-ease-back) shall be written to a persistent PacerDecision log including: timestamp, lever, decision type, all signal values evaluated, threshold values used, and the resulting state (or unchanged state). The log is append-only and user-browsable. | Must |
| FR-028 | The dashboard shall display the current state of each of the four levers, the time-since-last-advancement, the next signal value needed for advancement (e.g. "needs 7-day accuracy ≥ 85%, currently 81%"), and the cooldown status. | Must |
| FR-029 | When the Pacer advances Breadth, the new words shall be selected from the next frequency band of the active CEFR level using the Plan curricular del Instituto Cervantes frequency ordering, and shall enter UNTESTED state in the Vocabulary Engine in batches of 10 (configurable). | Must |
| FR-030 | The Pacer's threshold values (the 85%, 70%, 60%, 14 days, 5 days, etc.) shall be stored in a configuration file, not hardcoded, so they can be calibrated after 4 weeks of real-use data without code changes. | Must |
| FR-031 | The system shall include a Pronunciation Engine as an independent backend module with a PronunciationScorer interface (per §4.7). V1 shall ship with a WhisperHeuristicScorer implementation; the architecture shall support adding AzurePronunciationScorer and other implementations as configuration changes without code changes elsewhere. |
Must |
| FR-032 | Pronunciation scoring shall produce a binary outcome per attempt: correct or incorrect, plus a confidence score and the raw scorer response for debugging. |
Must |
| FR-033 | Pronunciation results shall be stored as PronunciationAttempt entities and shall not affect the semantic FSRS state of the underlying word-sense. The two tracks are independent. |
Must |
| FR-034 | A phoneme or word shall be considered covered for pronunciation after 2 correct attempts in different sessions, and shall decay back to uncovered after 3 consecutive failures in subsequent sessions. (Per §4.7.) |
Must |
| FR-035 | The dashboard shall display two pronunciation coverage metrics as percentages: (a) phoneme coverage out of ~24 Castilian phonemes, (b) word-pronunciation coverage out of active pool size. These are shown separately from the semantic vocabulary count. | Must |
| FR-036 | When the conversational tutor introduces an untracked word (within the stretch budget), the system shall tag the word in the conversation log and Claude shall evaluate the user's comprehension of that specific word in their response. | Must |
| FR-037 | Based on the comprehension evaluation, the untracked word shall be processed per §4.6: handled-well → fast-track to LEARNING with one credit; handled-neutrally → enter UNTESTED; struggled → enter UNTESTED with priority_drill_next_session flag set. |
Must |
| FR-038 | Words flagged priority_drill_next_session shall be force-included in the next Vocabulary or Sentences mode session's exercise queue, ahead of FSRS-scheduled items, until the priority flag is cleared (cleared after 1 successful exercise). |
Should |
| FR-039 | The user shall be presented with a session-end summary listing all words added via the conversational on-ramp during that session, with their resulting state, so the user is never surprised by silent vocabulary additions. | Must |
| FR-040 | The system shall expose four explicit user-facing modes: Vocabulary, Sentences, Speaking, and Conversation. Each mode shall have its own session shape, exercise mix, and end-of-session summary tailored to that mode's goals. | Must |
| FR-041 | The dashboard shall let the user start a session in any mode directly (single click/tap), with the most recently used mode pre-selected. | Should |
| FR-042 | An "Explain in English" touchpoint shall be available in every mode that involves Spanish-language content. It shall be invocable as a UI button at all times during a session, and as a voice command (the user saying "explícame" or "in English") in voice-enabled modes. | Must |
| FR-043 | When invoked, the Explain-in-English touchpoint shall send the current item, the user's last response (if any), and a brief context summary to Claude, and shall stream back an English-language explanation. The session state (current exercise, conversation context) shall not be lost — the explanation is a side conversation that resumes the session afterward. | Must |
| FR-044 | The Explain-in-English explanation shall be logged in the session history, distinguishable from the main exchange, so the user can review where they needed help. | Should |
| FR-045 | The Pronunciation Engine module shall be invoked only by the Speaking mode in V1. Other modes that involve speech (Conversation) shall route audio through STT for transcription only; pronunciation tracking from those modes shall be deferred to V2. [Rationale: keeps the Speaking mode a clean, focused surface for pronunciation work; conversation latency would suffer if every utterance were also pronunciation-scored.] | Should |
| FR-046 | The Vocabulary Engine shall map exercise outcomes to FSRS ratings as follows: INCORRECT → Again; PARTIAL with score < 0.7 → Hard; PARTIAL with score ≥ 0.7 → Good; CORRECT → Good. The Easy rating is reserved for V2 and shall not be used in V1. [Rationale: defaulting CORRECT → Good is conservative — reviews come slightly more often than they need to, but always safe. Wrong-direction Easy assignments would inflate intervals and silently degrade retention.] | Must |
| FR-047 | The Tutor Engine shall compose weekly-review sessions by sampling each CANDIDATE sense in ≥2 fresh contexts (formats and prompts not previously seen by the user), each due-for-maintenance CONFIRMED sense once, plus a small pronunciation sample (default 5 phonemes, 5 words) for decay detection. | Must |
erDiagram
WORD ||--o{ SENSE : has
SENSE ||--o{ EXERCISE_RESULT : tested_in
SENSE ||--o{ STATE_TRANSITION : changes
SENSE ||--|| FSRS_STATE : has
SESSION ||--o{ EXERCISE : contains
SESSION ||--o{ CONVERSATION_TURN : contains
EXERCISE ||--|| EXERCISE_RESULT : produces
THEMATIC_DOMAIN ||--o{ SENSE : groups
PACER_LEVER ||--o{ PACER_DECISION : produces
PACER_LEVER ||--o{ PACER_SIGNAL_SAMPLE : observed_by
Word — surface form, lemma, part of speech, gender (for nouns), CEFR level introduced, IPA pronunciation, audio sample URL, frequency rank within Plan curricular (used by Pacer for breadth advancement ordering).
Sense — parent word ID, sense identifier (e.g. quedar:to_remain vs quedar:to_meet_up), gloss in English, example sentences (≥3, generated up front by Claude and human-spot-checked by you), CEFR level for this sense, current state, thematic domain tags, sense rank within parent word (1 = primary, 2 = first secondary, etc; used by Pacer for depth advancement).
FSRS state — stability, difficulty, retrievability, last review timestamp, next due timestamp, review count, lapse count. (Parameters per the FSRS algorithm; see §8.6 for the library.) FsrsState is created lazily on a sense's first exposure (UNTESTED → LEARNING transition), not at sense registration; UNTESTED senses are not scheduled and have no FsrsState row.
Exercise — format, target sense ID, prompt text, expected response shape, generated_at, generation_model_version, generation_prompt_hash (so we can replay), grammar constructions required (set of tags drawn from the Grammar lever vocabulary, e.g. present_indicative, preterite, subjunctive_present).
ExerciseResult — exercise ID, user response (text + optional audio file path), Claude evaluation (correct / partial / incorrect + reasoning + score 0–1), latency, state-change-triggered, grammar constructions used (extracted by Claude from the response; feeds Grammar lever's "user produced unprompted" signal).
Session — type (daily / conversation / weekly_review / inspection), start, end, summary stats, pacer_evaluated_at_end (whether the Pacer ran after this session and what it decided).
ConversationTurn — session ID, speaker (user / tutor), text, audio file path if applicable, words-surfaced (for stretch budget tracking), constructions-used (Claude-tagged grammar features in this turn).
StateTransition — sense ID, from state, to state, reason, timestamp. Append-only audit log.
ThematicDomain — id, name, CEFR level, description. Static reference data seeded from CEFR documentation.
PacerLever — lever name (Breadth / Depth / Grammar / Production), current state, last_advancement_at, cooldown_expires_at. Exactly four rows; one per lever.
PacerDecision — id, timestamp, lever, decision_type (advance / hold / suggest_ease_back / user_regressed), from_state, to_state (same as from_state if hold), signals_evaluated (JSON blob: each signal name, its computed value, the threshold it was compared against, pass/fail), reasoning_text. Append-only.
PacerSignalSample — timestamp, signal_name (e.g. accuracy_7day_translation_es_to_en), value, computed_from (window definition: which exercises feed this signal). Used so the Pacer's decisions are reproducible — given the same samples, the same decision should be reached. Useful for retroactive calibration after threshold changes.
PronunciationAttempt — id, timestamp, target_type (phoneme / word / phrase), target_value (the actual phoneme symbol, word, or phrase text), session_id, audio file path, scorer_name (which PronunciationScorer was used), correct (boolean), confidence (0–1), raw_response (full scorer response for debugging).
PhonemeCoverage — phoneme symbol (e.g. r̄ for trilled r, θ for ceceo s), correct_attempts_count, distinct_sessions_with_correct_attempt, consecutive_failures, current_state (covered / uncovered), last_attempt_at. One row per Castilian phoneme.
WordPronunciationCoverage — word_id (FK to Word), correct_attempts_count, distinct_sessions_with_correct_attempt, consecutive_failures, current_state (covered / uncovered), last_attempt_at. Independent of word's semantic state.
ConversationOnRamp — id, conversation_turn ID where word was introduced, word ID, comprehension_evaluation (handled_well / handled_neutrally / struggled), claude_reasoning, resulting_state_change. Auditable record of every conversational vocabulary addition.
EnglishExplanationLog — id, session_id, exercise_id_or_turn_id (what triggered the explanation), trigger_method (button / voice command), user_context_at_trigger (current exercise/turn snapshot), claude_explanation, timestamp. Lets the user review what they needed help with.
- Creation. Words come from (a) manual entry, (b) curriculum seed lists, (c) tutor-surfaced during conversation (pending user approval).
- Mutation. State changes are append-only — we never overwrite the state field; we insert a
StateTransitionand the sense's "current state" is derived from the latest transition. This makes the entire history auditable. - Deletion. No deletion in V1. If you regret adding a word, mark it ARCHIVED (a soft state outside the main lifecycle).
Almost none. The user is the sole user; their voice recordings and chat logs are personal but not regulated. Stored locally. No PII processing in any meaningful sense. [NOTE — this changes the moment you put this on a server with auth. Don't.]
Estimated upper bound at A1: ~500 confirmed senses, ~1500 senses tracked total, ~10k exercise results, ~100k conversation turns. All comfortably fits in SQLite. Audio files are the bulky asset; budget ~1GB if you keep all session audio. All session audio is retained indefinitely in V1 — no rotation policy. Storage growth is acknowledged but acceptable for a single user (per A-014).
Single SQLite file + audio directory. Backup = nightly copy to a cloud sync folder (Dropbox / iCloud / Google Drive). RPO: 24h is fine. RTO: whatever it takes you to download the backup and reopen the app.
| ID | Category | Requirement | Target | How verified |
|---|---|---|---|---|
| NFR-001 | Latency | Exercise generation round trip (Claude API) | p95 < 4s | Manual; Claude API typically 2–6s for short generations |
| NFR-002 | Latency | Speech-to-text round trip | p95 < 3s for ≤15s audio | Provider SLA |
| NFR-003 | Latency | TTS playback start | p95 < 2s | Provider SLA |
| NFR-004 | Availability | App functional when online | best effort, single user | n/a |
| NFR-005 | Reliability | No data loss on crash | Every state transition persisted before UI confirmation | Test with mid-session kill |
| NFR-006 | Cost | Steady-state monthly cost | ≤ $30/mo at 30 min/day | Monthly review of API bills |
| NFR-007 | Usability | Time from "open app" to "first exercise" | < 5s | Manual stopwatch |
| NFR-008 | Maintainability | Anyone (you, in 6 months) can read the code | Reasonable file size, comments on the FSRS layer and the Claude prompt construction | Code review by future-you |
| NFR-009 | Portability | Browser support | Latest Chrome, Firefox, Safari on desktop | Manual smoke test |
| NFR-010 | Privacy | Voice and chat data never leave your control unencrypted | API providers see request payloads in transit (unavoidable); local store is your machine | Configuration review |
Categories I'm omitting from the table because they don't apply: scalability (single user), accessibility (you are the user; you know your own needs), i18n (the UI chrome can stay in English; the content is the Spanish — that's the whole point).
A single-page web app (React or Svelte; pick whichever you prefer) talking to a thin local backend (Node or Python) that owns the SQLite database and proxies calls to Claude, the STT provider, and the TTS provider. The backend exists primarily to keep the API keys off the client and to centralise the prompt construction logic. Run it locally — no deployment, no hosting cost.
The "interesting" software is concentrated in five modules:
- Vocabulary engine — owns the FSRS layer, the state machine, and the validation logic. Pure logic, no AI. This is the module worth unit-testing thoroughly.
- Pacer — owns the four difficulty levers, computes signals from the database, makes advancement decisions per the rules in §4.5 and FR-019–030. Pure logic, no AI calls (signals are all derived from existing data). Runs after sessions, not during. Highly unit-testable; do it.
- Tutor engine — owns the Claude prompts and the response parsing. Has one job: turn (user state, Pacer state, intent) into (exercise or dialogue turn) and back into (evaluation). The Pacer state feeds into prompt construction — when Grammar is at
+past_tenses, exercise-generation prompts say so. Also evaluates comprehension of stretch-budget words for the conversational on-ramp. - Speech engine — STT and TTS adapters with a common interface. Pluggable so you can swap providers.
- Pronunciation engine — adapter to Azure Speech Pronunciation Assessment (or equivalent). Takes audio + reference text, returns structured scores. Independent of STT — they run in parallel on the same audio when both are needed (transcription for content evaluation, pronunciation scoring for accent feedback).
graph LR
U[User<br/>browser + mic + speakers]
UI[Web UI<br/>SPA]
BE[Local backend<br/>API + SQLite]
Claude[Anthropic Claude API]
STT[STT provider<br/>Whisper / Deepgram]
TTS[TTS provider<br/>ElevenLabs / Azure]
U <--> UI
UI <--> BE
BE <--> Claude
BE <--> STT
BE <--> TTS
graph TB
API[HTTP API layer]
VE[Vocabulary Engine<br/>FSRS + state machine]
PA[Pacer<br/>4 levers + signal computation]
TE[Tutor Engine<br/>Claude prompts + parsing]
SE[Speech Engine<br/>STT/TTS adapters]
PE[Pronunciation Engine<br/>Azure Speech adapter]
DB[(SQLite)]
AUDIO[(Audio file store)]
API --> VE
API --> PA
API --> TE
API --> SE
API --> PE
VE --> DB
PA --> DB
PA -.reads state from.-> VE
PA -.publishes state to.-> TE
TE --> DB
PE --> DB
TE -.calls.-> Claude[Claude API]
SE -.calls.-> ExtSTT[STT API]
SE -.calls.-> ExtTTS[TTS API]
PE -.calls.-> ExtPron[Azure Speech<br/>Pronunciation API]
SE --> AUDIO
PE --> AUDIO
The Pacer reads from both the Vocabulary Engine's state and the raw exercise/result history; it doesn't write to FSRS state directly. When the Pacer advances Breadth, it inserts new word-senses into UNTESTED state via the Vocabulary Engine's normal "add new sense" path — no side door. When the Pacer advances Grammar or Production, those settings are read by the Tutor Engine when it constructs prompts; nothing in the Vocabulary Engine changes.
The Pronunciation Engine is independent of the semantic evaluation path. When the user submits an audio response, it goes to both the Speech Engine (STT for transcription → Tutor Engine for content evaluation) and the Pronunciation Engine (Azure for phoneme scoring) in parallel. The two evaluations are stitched together in the session response so the user gets both kinds of feedback in one place.
Scenario A — Single exercise during a daily session
sequenceDiagram
actor User
participant UI
participant BE as Backend
participant VE as Vocab Engine
participant TE as Tutor Engine
participant Claude
User->>UI: Start daily session
UI->>BE: POST /session/start
BE->>VE: get_due_senses(limit=15)
VE-->>BE: list of (sense, format_to_use)
BE->>TE: generate_exercise(sense, format, pacer_state)
TE->>Claude: structured prompt (incorporates Pacer Grammar/Production state)
Claude-->>TE: exercise JSON
TE-->>BE: exercise
BE-->>UI: exercise payload
UI-->>User: render exercise
User->>UI: response (text or audio)
UI->>BE: POST /exercise/respond
BE->>TE: evaluate(exercise, response)
TE->>Claude: evaluation prompt (also extracts grammar constructions used)
Claude-->>TE: {correct, score, reasoning, constructions_used}
TE-->>BE: evaluation
BE->>VE: record_result(sense, result)
VE->>VE: update FSRS, check state transitions
VE-->>BE: updated state
BE-->>UI: result + state delta
UI-->>User: feedback
Scenario B — Post-session Pacer evaluation
sequenceDiagram
participant UI
participant BE as Backend
participant PA as Pacer
participant DB
UI->>BE: POST /session/end
BE->>PA: evaluate_after_session(session_id)
PA->>DB: read 7-day exercise results, FSRS states, sense states
PA->>PA: compute signals for each lever
loop per lever (Breadth, Depth, Grammar, Production)
PA->>PA: check cooldown
PA->>PA: check thresholds
alt thresholds met & cooldown elapsed
PA->>DB: write PacerDecision (advance) + update PacerLever state
opt lever == Breadth
PA->>DB: insert N new senses as UNTESTED
end
else drift signal triggered
PA->>DB: write PacerDecision (suggest_ease_back)
else
PA->>DB: write PacerDecision (hold) with signal values
end
end
PA-->>BE: summary of decisions
BE-->>UI: session summary including any Pacer changes
Session-end orchestration (API-layer responsibility): the Backend, on POST /session/end, drives the following ordered sequence:
- Vocabulary Engine finalises any pending state on the session.
- Pronunciation Engine flushes any pending coverage updates.
- Pacer's
evaluate_after_session(session_id)runs. - The session summary is composed and returned.
The Pacer is never invoked as a side effect of any other module's work. Only the API layer triggers it. This keeps the modules decoupled: the Vocabulary Engine doesn't know the Pacer exists, the Pronunciation Engine doesn't know the Pacer exists, and bugs in upstream modules cannot accidentally trigger Pacer evaluations on inconsistent state.
Localhost only. Single machine. Run via docker compose up or a package.json script. Database is a SQLite file in a known location (e.g. ~/.castellano/db.sqlite). Audio in ~/.castellano/audio/. Backups via the user's existing cloud sync.
-
Frontend: React + Vite. Confirmed.
-
Backend: Python with FastAPI. Confirmed.
-
Database: SQLite. Single user, embedded, zero ops. Use SQLAlchemy + Alembic for schema migrations.
-
FSRS library:
fsrsPython package — Apache 2.0, well-maintained, the same algorithm Anki ships with by default since 2024. -
STT: OpenAI Whisper API as default. Pluggable interface.
-
TTS: ElevenLabs as default (best Castilian voices), Azure Neural TTS as fallback. Pluggable.
-
Pronunciation Assessment: Pluggable
PronunciationScorerinterface. V1 default =WhisperHeuristicScorer(uses Whisper API, transcript-match boolean — already paying for Whisper for STT, so marginal cost is zero). Future =AzurePronunciationScorer(Azure Speech Pronunciation Assessment with phoneme-level scoring) — pluggable as a configuration change when desired. Acknowledge the V1 limitation: Whisper normalises non-native pronunciation toward correct text, so the heuristic will be over-generous; Speaking mode's coverage metrics will reflect "intelligible enough for Whisper" rather than "Castilian-accurate." Acceptable trade-off for V1. -
Claude routing — confirmed split for cost control:
- Opus (
claude-opus-4-7) — used for: response evaluation in daily sessions, weekly review assessment, CEFR level estimation, conversational on-ramp comprehension evaluation, grammar-construction extraction. The high-stakes "is this correct" judgements where errors compound. - Sonnet (
claude-sonnet-4-6) — used for: exercise generation, conversational tutor turns, content seed structuring during Milestone 6. The high-volume creative work where speed matters and minor variation is fine.
Estimated split at 30 min/day: ~70% Sonnet calls, ~30% Opus calls by count, but Opus dominates cost share. Monthly target ~$15 for Anthropic API. [ASSUMPTION — based on current Anthropic pricing; verify and adjust split if cost runs hot. Hard kill switch in §14.]
- Opus (
| Provider | Purpose | Failure mode | Fallback |
|---|---|---|---|
| Anthropic | Generation, evaluation, conversation, assessment | API down or rate-limited | Show clear error; cached exercises for offline-ish drill mode? [OPEN] |
| STT provider | Speech → text | API down | Fall back to typed input, surface mic-disabled state |
| TTS provider | Text → speech | API down | Show text only; mute the audio button |
The backend exposes a small REST API for the SPA. Sketched:
POST /session/start{ mode: vocabulary | sentences | speaking | conversation }→ session ID + first item appropriate to the modePOST /exercise/respond{ exerciseId, response }→ evaluation + state delta (used by Vocabulary, Sentences, Speaking modes)POST /session/end{ sessionId }→ session summary including Pacer decisions made, on-ramp additions, pronunciation coverage updatesPOST /explain{ sessionId, contextRef }→ English-language explanation of the current item/turn, streamed; does not advance or end the sessionGET /vocab→ semantic confirmed list, paginatedGET /vocab/:wordId→ all senses + history + pronunciation coverage statusPOST /vocab/word{ word, candidate_senses }→ manual additionPOST /conversation/turn{ sessionId, audioOrText }→ tutor response (audio + text)GET /pronunciation/coverage→ phoneme coverage + word-pronunciation coverage percentages, plus per-phoneme breakdownPOST /pronunciation/score{ targetType, targetValue, audio }→ uses configured scorer to evaluate (used by Speaking mode)GET /progress→ CEFR snapshotPOST /review/weekly→ run the weekly review sessionGET /pacer→ current state of all four levers + signals computed at last evaluation + cooldown statusGET /pacer/decisions?lever=&since=→ browseable decision logPOST /pacer/regress{ lever, target_state }→ manual user-initiated regression on a leverPOST /pacer/dismiss-suggestion{ lever }→ user dismissing a suggest-ease-back notification
OpenAPI schema lives in the repo. Single user, no auth, but bind to 127.0.0.1 only — never expose to the network.
Seven primary screens:
- Dashboard — vocab count, due today, current CEFR estimate, pronunciation coverage bars (phonemes %, words %), Pacer panel, weekly review status, and four prominent mode-start buttons: Vocabulary, Sentences, Speaking, Conversation. Most-recently-used mode is pre-selected.
- Vocabulary mode session screen — single exercise, response area, feedback panel, progress bar, persistent "Explain in English" button.
- Sentences mode session screen — sentence-level exercise display, response area, feedback panel, progress bar, persistent "Explain in English" button.
- Speaking mode session screen — current prompt (word/phrase/sentence), large push-to-talk button, last-attempt feedback (correct / not yet, with confidence), running coverage display.
- Conversation mode screen — chat-style transcript, large push-to-talk button, current topic header, words-surfaced sidebar, "Explain in English" button and voice command, on-ramp evaluations shown post-session.
- Vocabulary browser — searchable, filterable list of all tracked words and senses; shows semantic state and pronunciation coverage state in separate columns.
- Pacer detail screen — per-lever drilldown with decision log, signal history, manual regression controls.
Plus a Settings screen for provider keys, scorer selection (Whisper-heuristic vs Azure when added), voice selection, weekly review schedule.
Personal project, localhost-only — but a few things still matter.
- API keys for Claude / STT / TTS go in a
.envfile, never in code, never in localStorage. Backend reads them; UI never sees them. - Bind to 127.0.0.1 explicitly. If you ever run this on a laptop on hotel WiFi and bind to 0.0.0.0 by accident, you've shared your Claude key with the hotel.
- No auth on the local API is fine only because of the binding above. If that ever changes, add auth.
- Audio files stored unencrypted on local disk. Reasonable for personal use; full-disk encryption (FileVault, BitLocker) is your friend here.
- Prompt injection. The user (you) is the only source of input, so the threat is low — but the tutor mode will be feeding Claude its own prior outputs as conversation context, which is a classic spot for self-injection drift. Mitigation: clear system prompt boundaries, structured output schemas where possible, and don't let the model's prior outputs influence the validation logic.
- No secret-ful logs. Don't log API request bodies that include the keys.
You are the only user. No GDPR, no CCPA, no HIPAA, no PCI. You have a duty of care to yourself; that's it.
One genuine consideration: third-party processors see your speech. Whisper, ElevenLabs, etc. process your voice recordings on their servers. Read their data-use policies. OpenAI explicitly says API audio is not used for training; ElevenLabs has a similar policy. Worth confirming before onboarding any new provider.
Personal project. Don't over-build this. But:
- Logs. All Claude calls, STT calls, TTS calls logged with cost estimate (token counts, audio seconds). Daily tally to a local file. This is how you'll catch a runaway prompt that's burning money.
- No metrics stack. Don't run Prometheus on your laptop for this.
- Health. A
/healthendpoint that pings each upstream provider with a trivial call. Useful when something seems broken. - The one alert that matters: monthly API cost over budget. A simple cron-ish check that emails you (or just shows a banner in the UI) if month-to-date cost exceeds the threshold.
There is no deployment. There is git pull && docker compose up. There is no release; there is main.
Migration strategy: SQLite + Alembic (Python) or Prisma migrate (Node). Run on app start.
If you ever want to share this with someone else, this whole section needs rewriting.
Revised back-of-envelope at ~30 min/day, with v0.4 changes:
- Claude. Sonnet handles ~70% of calls (exercise generation, tutor dialogue, English explanations) at ~$0.30/day. Opus handles ~30% of calls (evaluation, assessment, comprehension judgement) at ~$0.50/day. Adding the Explain-in-English mechanic adds maybe 5–10 calls/day at Sonnet pricing — negligible. Total ~$15-20/month.
- STT / Whisper. $0.006/min × 30 min/day × 30 days ≈ $5/month. Whisper now serves double duty (transcription + V1 pronunciation scorer) so no separate pronunciation cost in V1.
- TTS. ElevenLabs paid tier ~$5–11/month depending on tier.
- Azure Pronunciation (deferred). Not in V1 cost. When enabled later, expect ~$15/month additional.
Total V1 estimate: $25–35/month — comfortably under the new $60 cap. Headroom for adding Azure pronunciation later ($40–50/month total) or for higher usage.
Kill switches:
- Daily soft cap: pause Opus calls if daily cost > $2; fall back to Sonnet for everything except weekly review.
- Monthly hard cap: refuse new Claude calls if month-to-date > $60 budget. Recover next month or with explicit override.
- (Future, when Azure pronunciation is enabled) Daily soft cap on pronunciation calls.
There is no rollout. There is "you start using it." But there is a sensible build order:
Milestone 0 — skeleton (1 weekend). Backend + frontend wired up, mode-selection dashboard, hello-world exercise hardcoded in Vocabulary mode, Claude API call working end-to-end.
Milestone 1 — vocab engine, no AI (1 weekend). Database schema (including PronunciationAttempt, PhonemeCoverage, WordPronunciationCoverage, PacerLever), FSRS integration, semantic state machine implemented and unit-tested with synthetic exercise results. Pronunciation coverage state machine (boolean + 2-correct/3-fail rules) implemented and unit-tested. No Claude in this milestone.
Milestone 2 — Vocabulary mode and Sentences mode (1 weekend). Exercise generation prompts for word-level and sentence-level formats. Sonnet for generation, Opus for evaluation. Both modes end-to-end with typed input. Explain-in-English touchpoint integrated as a button.
Milestone 3 — voice + Conversation mode (1–2 weekends). STT and TTS integrated, push-to-talk in the UI, Conversation mode end-to-end with voice. Conversational on-ramp logic. Explain-in-English voice command added (Conversation mode only).
Milestone 4 — Speaking mode and Pronunciation Engine (1 weekend). PronunciationScorer interface defined. WhisperHeuristicScorer implementation. Speaking mode UI with prompts, push-to-talk, immediate feedback. Phoneme coverage and word-pronunciation coverage displays on dashboard.
Milestone 5 — weekly review and CEFR dashboard (1 weekend). The validation gate from CANDIDATE → CONFIRMED, the assessment session, the level estimator (Opus). Weekly review samples broadly across modes including a small pronunciation re-check for decay detection.
Milestone 6 — Pacer (1 weekend). Four levers, signal computation, advancement logic, decision log, dashboard panel. Mostly pure-logic code that operates on data already produced by milestones 1–5.
Milestone 7 — content seed (1–2 weekends). Scrape the CVC Plan curricular A1-A2 inventories. Use Claude (Sonnet) to structure into JSON. Spot-check by hand. Load into database. Seed the phoneme list (~24 Castilian phonemes) for pronunciation coverage tracking.
Milestone 8 — polish. Cost monitoring dashboard with kill switches, error handling, audio replay UI, Pacer threshold calibration based on first 4 weeks of real use, prompt regression testing infrastructure. Possibly: integrate AzurePronunciationScorer if Whisper-heuristic is proving too generous.
Start using the system for real after Milestone 4 — that's when all four modes are functional. Milestones 5–8 add structure and polish but the four core modes work as of M4.
| ID | Risk | Probability | Impact | Mitigation |
|---|---|---|---|---|
| R-001 | Claude evaluations are inconsistent — same response graded differently across sessions | Med | High | Use structured output schemas, low temperature, Opus for evaluation, sample-test a fixed set of (response, expected_grade) pairs before each prompt change |
| R-002 | V1's Whisper-heuristic pronunciation scorer is too generous because Whisper normalises non-native pronunciation. Coverage % will drift up faster than actual pronunciation skill. | High | Med | Acknowledged limitation. Treat V1 coverage % as "intelligibility coverage" not "Castilian-accuracy coverage." Plan to swap in AzurePronunciationScorer in M8 if the heuristic proves uninformative. |
| R-003 | The "weekly review" gate is too strict and confirmed vocabulary grows too slowly | Med | Med | Start strict (per the spec); relax thresholds based on data after 4 weeks of use |
| R-004 | The "weekly review" gate is too lenient and confirmed vocabulary inflates | Med | Med | Same — calibrate after 4 weeks |
| R-005 | API cost overruns | Low | Med | Hard kill switch in §14; daily cost log |
| R-006 | Prompt drift over time as you tweak prompts | High | Med | Version every prompt template; log which version produced each exercise; keep a regression set of (prompt, expected output shape) pairs |
| R-007 | FSRS parameters tuned for native-language flashcards behave oddly for L2 vocabulary acquisition with sense-level granularity | Med | Med | Start with FSRS defaults; collect ≥1000 reviews; re-fit parameters using the FSRS optimizer with your own data |
| R-008 | Single-user assumption is violated (you let a friend try it) and the data model doesn't support it | Low | Low | Ignore until it happens; the migration is straightforward (add a user_id column) |
| R-009 | Castilian-specific vocabulary surfaces issues with Claude's default Spanish (which leans Latin American) | Med | Low | Pin the system prompt: "Castilian Spanish from Spain. Use vosotros. Use peninsular vocabulary (coger, ordenador, zumo). Avoid Latin Americanisms." Spot-check generations. |
| R-010 | Speech latency makes conversational mode feel sluggish | Med | Med | Stream Claude responses as they generate; start TTS on the first sentence rather than waiting for full response |
| R-011 | Pacer thresholds are wrong out of the box and either advance too aggressively (frustrating) or too slowly (boring) | High | Med | Configurable thresholds (FR-030); explicit calibration pass at week 4 of real use; log every signal value computed so retroactive analysis is possible |
| R-012 | Pacer advances on noisy short-window data — e.g. one good 7-day window gets you advanced, then accuracy drops the next week | Med | Med | Cooldown (5 days) gives FSRS time to redistribute; FR-021 also requires high CANDIDATE→CONFIRMED rate at the most recent weekly review, which is a slower-moving signal that filters out noise |
| R-013 | Manual-only regression means user lets things drift past the point where regression would help, because regressing feels like failure | Med | Med | The "Suggest ease back" mechanic (FR-026) explicitly normalises the action by surfacing the option non-judgementally, with the data justifying it; the decision log makes regression a recordable choice rather than a hidden setback |
| R-014 | Grammar-construction extraction (used by Lever 3 advancement signals) requires Claude to reliably tag what tense/mood you used in your responses, which is itself an evaluation task and could be wrong | Med | Med | Use Opus for this extraction; build a regression set of (response, expected tags) pairs; tolerate noise by requiring 3+ unprompted productions before advancement (FR-023) — single misclassification doesn't matter |
| R-015 | Pacer adds complexity to the Tutor's prompt construction (4 levers' worth of state injected into every exercise/conversation prompt), increasing token usage and the risk of prompt drift | Med | Low | Keep Pacer-state injection terse and structured (a small JSON object the prompt references), not prose; version Pacer-prompt templates separately from base templates |
| R-016 | STT (Whisper) auto-corrects beginner pronunciation errors before the Tutor sees the transcript, so semantic evaluation is based on what the user would have said correctly, not what they actually said | High | Med | This is now somewhat mitigated because we have a separate Pronunciation Engine for accent feedback; but for content evaluation, accept the limitation. Pronunciation correctness is a separate track from semantic correctness. |
| R-017 | Conversational on-ramp fast-tracks words that the user actually didn't understand (Claude evaluates "handled well" too generously) | Med | Med | Conservative bar in the comprehension prompt: require explicit comprehension evidence (the user used the word back, or responded specifically to its meaning), not just absence of confusion; log the evaluation reasoning for review |
| R-018 | Cost overruns due to pronunciation engine pricing per audio minute | Med | Med | Daily kill switch on Azure pronunciation calls (§14); rate-limit to scoring N utterances per session, not all of them, if needed |
| R-019 | CVC scrape produces noisy structured data (HTML inconsistent across pages, words in tables vs. lists, sense disambiguation requires linguistic judgement) | High | Med | This is content work expected to need manual cleanup. Plan for ~20% of seed words needing hand-correction. Don't try to fully automate it. |
| R-020 | Agentic codegen produces working code that doesn't survive the first weekend's real use (subtle bugs in the FSRS integration, off-by-one in scheduler, etc.) | High | Med | Heavy unit testing on Vocabulary Engine and Pacer is non-negotiable. These are the modules where agentic generation is most likely to produce something that seems right but isn't. Real session data the second weekend will surface issues; budget for one debugging weekend after M3. |
| R-021 | The "Explain in English" touchpoint becomes a crutch — user invokes it constantly and never builds Spanish-only fluency | Med | Med | Log invocations and surface a "you used Explain 12 times in 5 sessions" gentle nudge on the dashboard; do NOT rate-limit or block it (defeats the tutor-feel goal). Trust the user to self-correct once the data is visible. |
| R-022 | Modes get used unevenly — user always picks Vocabulary, never Speaking, and gets a lopsided skill profile | Med | Med | Dashboard surfaces "last session per mode" timestamps; weekly review samples across all four modes regardless of recent usage to keep coverage honest |
| R-023 | The boolean-with-decay pronunciation tracking misses real progress — user pronounces a word correctly today, fails it tomorrow because of context, decays back to uncovered, gets discouraged | Med | Low | The 3-consecutive-failures threshold (FR-034) is intentionally lenient; one bad session doesn't decay anything. If it still feels punishing in practice, raise to 4 consecutive failures. |
| R-024 | Whisper-as-pronunciation-scorer means Speaking mode looks identical in cost to other voice modes, so users may not realise pronunciation work has lower-quality feedback than they think | Low | Low | Surface scorer name in Speaking mode UI: "scoring via Whisper-heuristic — for Castilian-accurate scoring, configure Azure in Settings." |
The majority of v0.2 and v0.3's open questions have been closed by user direction. Two open questions remain plus one new one introduced by v0.4:
| ID | Question | Why it matters | Needed by |
|---|---|---|---|
| Q-011 | Does the conversational on-ramp surface the new word visually during the conversation (subtle highlight, footnote with translation) or strictly through the post-session summary? | UX design choice with real learning-experience implications | Milestone 3 |
| Q-012 | When pronunciation coverage and semantic mastery for a word disagree (e.g. semantic CONFIRMED but word is uncovered for pronunciation), how is this surfaced to the user? Default proposal: vocabulary count is purely semantic; a separate "speakable vocabulary" count reports the intersection. | Affects dashboard headline metrics | Milestone 4 |
| Q-013 (NEW) | When the Whisper-heuristic scorer judges an utterance as "correct," what's the threshold? Strict equality (transcript matches expected exactly), normalised equality (case/punctuation/diacritics ignored), or fuzzy match (Levenshtein distance below threshold)? | Determines V1 Speaking mode's leniency | Milestone 4 |
Closed questions: Q-001 through Q-010 (per v0.2 and v0.3 user direction), plus Q-007-revised (cost cap raised to $60).
| ID | Assumption | Section |
|---|---|---|
| A-001 | Time budget: fast, agentic development; one weekend per milestone target | §2.4, §15 |
| A-002 | Cost ceiling raised to $60/month for V1 (was $50 in v0.3, $30 in v0.2) | §2.4, §14 |
| A-003 | Python/FastAPI backend, React + Vite frontend — confirmed, no longer assumption | §8.6 |
| A-004 | Whisper API for STT, ElevenLabs for TTS, Azure Speech for pronunciation assessment | §8.6 |
| A-005 | Local SQLite, no cloud DB | §8.5 |
| A-006 | Castilian variant pinned via system prompt; no separate fine-tuning | §8.6, R-009 |
| A-007 | Pronunciation grading in scope for V1 via pluggable PronunciationScorer; V1 default = WhisperHeuristicScorer; Azure later (was: Azure required in V1) |
§4.1 #7, §8.6 |
| A-008 | Single-user, localhost-bound, no auth | §3, §10 |
| A-009 | Pacer pacing setting "moderate" — thresholds at the values listed in FR-021 through FR-024 | §4.5, §5 |
| A-010 | Pacer is advance-only; regression is a manual user action triggered from the dashboard | §4.5, FR-025 |
| A-011 | All four Pacer levers visible to user as separate dashboard panels, not a single combined dial | §4.1, §9.3 |
| A-012 | Pacer threshold defaults will be calibrated against ≥4 weeks of real-use data before being treated as final | §16.1 R-011 |
| A-013 | Breadth lever advances in batches of 10 word-senses (default, configurable) | FR-029 |
| A-014 | Audio recordings retained indefinitely (no rotation policy); user has personal storage to absorb the volume | §6.5 |
| A-015 | Opus/Sonnet routing per §8.6 to balance evaluation quality with monthly cost | §8.6, §14 |
| A-016 | Conversational on-ramp acts as automatic vocabulary growth pathway with bidirectional outcomes (fast-track or priority-drill) | §4.6, FR-036–039 |
| A-017 | Seed vocabulary acquired by scraping CVC's Plan curricular HTML pages and structuring with Claude (Sonnet); manual cleanup expected for ~20% of entries | §15 Milestone 7, R-019 |
| A-018 | User experience structured around four explicit modes: Vocabulary, Sentences, Speaking, Conversation. Each is a separate session type with its own start button on the dashboard. | §4.1 #3, §4.3, FR-040 |
| A-019 | Pronunciation tracking uses a coverage model (numerator over denominator), not a mastery model. Two coverages: phoneme % out of ~24, word-pronunciation % out of active pool. | §4.7, FR-035 |
| A-020 | Pronunciation transitions are boolean per attempt with confirmation (2 corrects in different sessions = covered) and decay (3 consecutive failures = uncovered). | §4.7, FR-034 |
| A-021 | "Explain in English" is a first-class interaction primitive available in every Spanish-content mode, invocable as button always and as voice command in voice modes. | FR-042–044 |
| A-022 | Pronunciation Engine is invoked only by Speaking mode in V1; Conversation mode does not score pronunciation per utterance (deferred to V2). | FR-045 |
| A-023 | FR-017 (manual approval buffer for tutor-surfaced words) was deleted in v0.5; the §4.6 conversational on-ramp supersedes it. Adding a manual approval gate would defeat the on-ramp's purpose, which is to make conversation an automatic vocabulary growth path. The post-session summary (FR-039) provides visibility without creating a gate; misclassifications are handled by manual regression. | §4.6, FR-039 |
| A-024 | FSRS rating mapping is INCORRECT→Again, PARTIAL<0.7→Hard, PARTIAL≥0.7→Good, CORRECT→Good. The Easy rating is reserved for V2. | §5 FR-046 |
| A-025 | Pacer cooldowns are calendar time, not active-use time. A break still counts toward the cooldown. | §4.5 |
| A-026 | FsrsState is created lazily on first exposure (UNTESTED → LEARNING), not at sense registration. | §6.2 |
- CEFR — Common European Framework of Reference for Languages. The A0/A1/A2/B1/B2/C1/C2 scale.
- A0 — Pre-A1, "absolute beginner" / "breakthrough." Not officially in the CEFR but commonly used.
- FSRS — Free Spaced Repetition Scheduler. Modern open-source algorithm used by Anki since 2024. Replaces the older SM-2.
- Sense — a specific meaning of a word. Quedar has senses including "to remain", "to agree to meet", "to fit (clothing)", "to look (a certain way)."
- STT / TTS — Speech-to-text / text-to-speech.
- Stretch budget — the number of words above your current vocabulary that the tutor is permitted to introduce in a single conversation.
- Thematic domain — CEFR groups vocabulary by topic area (greetings, family, food, weather, etc.). Used for coverage tracking.
- Pacer — the macro-progression engine that decides when to push the user along each of four difficulty levers.
- Lever — one axis of difficulty controlled independently by the Pacer: Breadth, Depth, Grammar, or Production.
- Active pool — the set of word-senses currently eligible to appear in exercises and tutor speech, governed by the Breadth lever's state.
- Cooldown — minimum time after a Pacer advancement before the same lever can be evaluated for advancement again. Prevents overshoot from stale signal.
- Drift signal — Pacer-computed indicator that the user is failing on recently-advanced content, used to surface non-blocking "ease back" suggestions.
- Advance-only ratchet — the Pacer never automatically reduces a lever's state; only the user can do that, manually, from the dashboard.
- Pronunciation mastery — separate from semantic mastery; tracks whether the user can say a word correctly, scored at the phoneme level by Azure Speech. A word can be at CONFIRMED for semantic mastery but UNTESTED for pronunciation.
- Conversational on-ramp — the mechanic by which words introduced in tutor speech are automatically routed into the vocabulary tracking system based on Claude's evaluation of whether the user comprehended them. See §4.6.
- Priority drill — a flag set on a word when the user struggled with it during conversation; ensures the next daily session pulls that word in for explicit drill ahead of FSRS scheduling.
- CVC — Centro Virtual Cervantes; the Instituto Cervantes's online resource portal, hosting the Plan curricular HTML.
- Mode — a top-level user-facing session type. The system has four: Vocabulary, Sentences, Speaking, Conversation. Each has its own UI screen, exercise mix, and success criteria.
- PronunciationScorer — pluggable interface that takes audio + expected text and returns a boolean correct/incorrect plus confidence. V1 implementation is
WhisperHeuristicScorer; future implementations includeAzurePronunciationScorer. - Coverage (pronunciation) — the percentage-based metric for pronunciation skill. Two flavours: phoneme coverage (out of ~24 Castilian phonemes) and word-pronunciation coverage (out of active pool). Distinct from semantic vocabulary count.
- Explain in English — first-class interaction primitive invokable in any mode; sends current context to Claude for an English-language explanation without ending or advancing the session.
The spec's most opinionated claim is FR-003 + FR-004: a word-sense isn't "yours" until you've used it correctly in three different exercise types over at least 24 hours, and survived a fresh-context test in the next weekly review. That's slow. That's deliberate. The reason: vocabulary apps that count a word as "learned" after one or two correct answers are measuring recognition, not mastery, and produce inflated numbers that don't survive contact with real Spanish. By raising the bar — multiple formats, time-spaced, fresh contexts at review — the confirmed count is a number you can actually trust.
The cost is that growth feels slow at first. That's fine; calibration after 4 weeks of real use will tell you if the gates are too strict.
SM-2 (the algorithm Anki originally used and many imitators still use) treats every card the same way: ease factor adjusts up or down, intervals are deterministic. FSRS models each card with three latent variables (stability, difficulty, retrievability), fits parameters from your actual review history, and predicts the optimal next interval to hit a target retention rate. In practice it gives ~20–30% fewer reviews for the same retention. Anki shipped it as the default in 2024. Use it.
The Instituto Cervantes' Plan curricular del Instituto Cervantes is the canonical CEFR-aligned reference for Spanish. It lists vocabulary by level and thematic domain, and it's specifically Castilian. Free online. Use this as the source of truth for what counts as A0/A1 vocabulary rather than scraping random word lists.
The Pacer's job is to notice when you have headroom — not just when you're succeeding. These are different things and the distinction matters.
Succeeding means: this morning's session went well. That is a noisy, short-window signal. A system that advanced on this would advance and retreat constantly, and you'd live in a thrashing equilibrium where the system never lets you settle and never lets you breathe.
Headroom means: you are succeeding and the success isn't effortful and it isn't on stale, over-rehearsed material and the validation gate (weekly review) is comfortably passing. That's a slower, more boring signal — and it's the one worth advancing on. The thresholds in FR-021 through FR-024 are written to enforce this distinction:
- 85% accuracy on three formats prevents advancing when you're great at translation but failing at production.
- 70% CANDIDATE→CONFIRMED rate at the most recent weekly review is the slowest-moving of all the signals — it's a once-a-week check on whether the validation gate is really working, not just whether you got lucky on a few exercises. This is the load-bearing signal.
- 5-day cooldown lets new content settle into FSRS rotation before being evaluated. Without it, you'd advance, the new words wouldn't have shown up as failures yet, and you'd advance again.
- The pool-saturation requirement on Breadth (≥80% of pool out of UNTESTED) prevents the case where the Pacer pulls in 10 new words, you don't get to them yet, and the Pacer pulls in 10 more.
Each of these is here because of a specific way the Pacer could fail without it. Removing one is fine if you're calibrating — but know which failure mode you're re-enabling.
The reason there are four levers, rather than one, is that the four kinds of fluency are genuinely independent. You can recognise 1500 words and produce 200; you can handle present indicative comfortably and freeze on subjunctive; you can know one sense of a word fluently and not the others. Pretending these advance together is what produces the Duolingo problem — looking advanced on the dashboard and being unable to hold a conversation. By making them separate, the system is honest about which kind of fluency it has evidence for.