Skip to content

Latest commit

 

History

History
814 lines (612 loc) · 82.2 KB

File metadata and controls

814 lines (612 loc) · 82.2 KB

Castellano — Personal Spanish Tutor

Status: Draft v0.5 Date: 30 April 2026 Author: Spec Architect (interviewing user) Classification: Personal project, single user

Changes from v0.4 (alignment-resolution pass before M0):

  • FR-007 reconciled with the full ExerciseFormat enum, organized by mode affinity, and LISTENING restored.
  • FR-017 (manual approval buffer for tutor-surfaced words) deleted — superseded by the §4.6 conversational on-ramp.
  • §4.7 Pronunciation Engine: documented NOT_DETECTED as a third operational outcome that does not affect coverage; updated the quoted PronunciationScorer interface to match the contract signature; stated the word-vs-sense denominator rule for word-pronunciation coverage.
  • §4.5 cooldown wording: "5 days of active use" → "5 calendar days."
  • §6.2 FsrsState: documented lazy creation on first exposure (UNTESTED → LEARNING).
  • §6.5: pruned the audio-retention open question; retention is indefinite per A-014.
  • New FR-046: FSRS rating mapping from ExerciseOutcome → Again/Hard/Good. Easy reserved for V2.
  • New FR-047: Tutor Engine composes weekly-review sessions (≥2 fresh contexts per CANDIDATE, 1 per maintenance, plus pronunciation sample).
  • §16.3: assumption added recording the FR-017 deletion.

Changes from v0.3: Restructured the user-facing experience into four explicit modes (Vocabulary, Sentences, Speaking, Conversation) — each with its own success criteria and exercise mix. Promoted Pronunciation Engine to a fully independent module with pluggable scorer interface; V1 ships with a Whisper-based heuristic scorer ("did transcript match?"), Azure Speech Pronunciation Assessment slots in as a configuration change later. Pronunciation tracking is now coverage-based (numerator/denominator) and decoupled from semantic vocabulary count. Added the "Explain in English" touchpoint as a first-class interaction primitive. Cost cap raised to $60/month.

Changes from v0.2: Closed eight of ten open questions per user direction. Added pronunciation grading as a V1 feature (now via pluggable interface — see v0.4 changes). Strengthened the conversational stretch-word mechanic into an automatic vocabulary on-ramp.

Changes from v0.1: Added the Pacer module — a macro-progression layer with four independent levers, advance-only.


1. Executive Summary

A browser-based, voice-and-text Spanish tutor for a single learner (the project owner), built around the CEFR proficiency scale. Initial build covers levels A0 → A1, with the architecture extending cleanly to A2 and B1. The system uses Claude (via the Anthropic API) for all language-understanding work — exercise generation, response evaluation, conversational tutoring, and CEFR-level assessment — paired with third-party speech services for STT and TTS.

The system is structured around three layers of progression. The Vocabulary Engine uses Anki-style spaced repetition (FSRS) extended to track word senses rather than words, with a formal "weekly review" gate before a sense is admitted to confirmed vocabulary. The Pacer sits above this and decides when the user is fluent enough at the current level to be pushed harder, advancing along four independent levers: vocabulary breadth, sense depth, grammatical complexity, and production demand. The Tutor Engine uses Claude to generate exercises and conduct dialogue, scoped to whatever the Vocabulary Engine and Pacer say is appropriate for the current moment.

The system is Castilian Spanish (Spain), targeting the project owner's pronunciation, vocabulary choices, and grammatical conventions (e.g. vosotros, ceceo, coger used freely).

2. Context and Goals

2.1 Problem statement

The owner is starting Spanish at A0 and wants a learning tool that (a) genuinely measures progress against the CEFR scale instead of inventing its own metrics, (b) doesn't pretend a word is "known" after one correct use, (c) supports speaking practice and not just reading/writing, and (d) is a single coherent system rather than a stack of disconnected apps (Duolingo + Anki + iTalki + ChatGPT).

2.2 Goals

  • G1. Reach verified A1 within 6 months of consistent daily use. (Verified = passing an internal CEFR-A1 assessment that mirrors the official descriptors.)
  • G2. Maintain a confirmed vocabulary list where every entry has been validated across all of its senses relevant at the current CEFR level, not just used once.
  • G3. Support voice-in / voice-out as a first-class interaction mode, not a bolt-on.
  • G4. Use Claude as the language-evaluation backbone for everything except STT/TTS.

2.3 Non-goals (V1)

  • Multi-user support, accounts, sharing, or sync across devices. Single user, single device profile.
  • Spanish variants other than Castilian.
  • Levels above A1 in the initial build (architecture supports them; content does not).
  • Mobile-native apps. Browser only; mobile-browser usage is fine but not optimised in V1.
  • Offline mode. Always-online assumed.

2.4 Constraints

  • Single developer, personal time budget. Build incrementally; an A0-only working slice should be usable within ~4–6 weekends. [ASSUMPTION — confirm your time budget.]
  • Anthropic API for all AI work. Non-Anthropic services permitted only where Anthropic has no offering — specifically STT and TTS.
  • Cost. Personal budget, not enterprise. Target steady-state cost under $30/month at daily use. [ASSUMPTION — confirm.]
  • Browser-based. No native desktop or mobile binaries.

2.5 Success metrics

  • Short term (3 months): Confirmed vocabulary ≥ 300 words (A1 baseline is ~500), daily active use ≥ 5 days/week, weekly review completed each week.
  • Long term (6 months): Pass internal A1 assessment. Confirmed vocabulary ≥ 500 senses. Hold a 5-minute Castilian conversation with the tutor on a familiar topic without falling back to English.

3. Users

A single user: the project owner. No other personas. No admin role; you are the admin.

This is worth stating explicitly because it removes a huge amount of complexity from the spec — no auth flows, no RBAC, no multi-tenancy, no user-content moderation, no abuse prevention beyond what protects the API keys.

4. Scope and Use Cases

4.1 In-scope features (V1, A0 → A1)

  1. Vocabulary engine — word and word-sense tracking, FSRS-based scheduling, validation state machine, confirmed-vocabulary list. Drives the Vocabulary mode.
  2. Pacer — macro-progression engine: monitors performance signals and advances the four difficulty levers (breadth, depth, grammar, production) when the user is demonstrably ready. Advance-only; regression is manual.
  3. Four user-facing modes — the user explicitly chooses what kind of session to start:
    • Vocabulary mode — single-word drill (translation, recognition, definition matching). Fast-paced, ~15 items per session.
    • Sentences mode — sentence-level work (cloze, translation, error correction, production prompts). Slower, ~10 items per session.
    • Speaking mode — pronunciation-focused; user reads or repeats prompts and gets pronunciation feedback. Independent of semantic vocabulary tracking.
    • Conversation mode — open-ended dialogue with the tutor, voice or text, with the on-ramp mechanic active.
  4. Tutor engine (Claude) — generates exercises across modes, conducts conversation, evaluates responses. Maintains "tutor feel" via on-demand English scaffolding (see #6).
  5. Voice I/O — STT for user speech, TTS for tutor speech, push-to-talk in the browser. Available across all modes that need it.
  6. "Explain in English" touchpoint — a first-class interaction primitive available in any mode: a button (and voice command, where appropriate) that asks the tutor to explain the current item, last response, or current concept in English without losing session context. Makes the system feel like a tutor, not a quizmaster.
  7. Pronunciation engine (independent module) — scores user speech against expected pronunciation. V1 ships with a Whisper-based heuristic scorer (boolean: did transcript match?). Architecture supports plugging in Azure Speech Pronunciation Assessment or other phoneme-level scorers later via a PronunciationScorer interface. Pronunciation results are stored on a separate track from semantic vocabulary mastery; the dashboards show them as coverage percentages over a defined denominator (set of phonemes + set of in-pool words to be pronounced).
  8. Weekly review — scheduled assessment session that promotes/demotes word-senses in/out of confirmed vocabulary.
  9. CEFR progress dashboard — current level estimate, vocabulary coverage against A1 thematic domains, grammar competency checklist, per-lever Pacer panel, and separate pronunciation coverage bars (% phonemes covered, % in-pool words pronounceable).
  10. Pacer decision log — every advancement decision recorded with the signals that drove it.
  11. Session history — every exchange logged, replayable, searchable.

4.2 Out of scope (V1)

  • Reading practice from external texts (news, books). Possible for V2.
  • Writing practice with longer compositions. Sentence-level only in V1.
  • Listening practice from external audio (podcasts, films).
  • Grammar drills as a dedicated mode (grammar is evaluated as it surfaces in exercises and dialogue, not drilled).
  • Cultural / regional content beyond what arises naturally.

4.3 Primary user journeys

User journeys are organised around the four modes. The user picks a mode from the dashboard; each mode has its own session shape, exercise mix, and success criteria. The Pacer operates across all of them.

Journey 1 — Vocabulary mode session (10–15 min). User picks "Vocabulary" → system pulls due word-senses (FSRS-scheduled) plus any priority-drill flags from prior conversations → runs ~15 single-word exercises across translation/recognition/definition formats → user gets per-item feedback with the "Explain in English" touchpoint always available → session ends with summary of state changes.

Journey 2 — Sentences mode session (15–20 min). User picks "Sentences" → system generates ~10 sentence-level exercises (cloze, full-sentence translation, error correction, sentence production from prompts) drawing from the active pool → grammar lever's state determines which constructions appear → "Explain in English" available throughout → session ends.

Journey 3 — Speaking mode session (10–15 min). User picks "Speaking" → system presents prompts (single words, then phrases, then sentences) for the user to read aloud or repeat after audio → Pronunciation Engine scores each utterance via the configured scorer (V1: Whisper heuristic; later: Azure phoneme-level) → coverage metrics update → no semantic mastery is affected. This mode is purely about whether the user can produce the sounds correctly.

Journey 4 — Conversation mode session (open-ended, typically 5–10 min of dialogue). User picks "Conversation" → tutor opens with a greeting or topic → push-to-talk dialogue, voice or text → on-ramp mechanic active (untracked words evaluated and routed per §4.6) → "Explain in English" available as a button and voice command ("explícame en inglés" or just clicking the button) → session ends with summary including any vocabulary added via on-ramp.

Journey 5 — Weekly review. Once per week, scheduled session that ignores FSRS scheduling and instead samples broadly across modes: vocabulary knowledge, sentence-level competency, and a small pronunciation sample. Every word-sense currently in CANDIDATE status gets tested in ≥2 fresh contexts; CONFIRMED words get a maintenance check; pronunciation coverage gets re-sampled to detect decay; the user gets a CEFR-level reassessment.

Journey 6 — Vocabulary inspection. User asks "do I know quedar?" → system shows the word, all senses currently tracked, semantic state of each, separate pronunciation status for the word, recent exercise history, next due date.

4.4 Key state machine — word-sense lifecycle

This is the heart of the system. Each (word, sense) pair moves through these states:

                              ┌────────────────┐
                              │   UNTESTED     │
                              │  (just added)  │
                              └────────┬───────┘
                                       │ first exposure exercise
                                       ▼
                              ┌────────────────┐
                              │   LEARNING     │
                              │  needs ≥3 OK   │
                              │  in different  │
                              │  exercise      │
                              │  formats       │
                              └────────┬───────┘
                                       │ 3 successful varied uses
                                       ▼
                              ┌────────────────┐
                              │   CANDIDATE    │
                              │  in FSRS rotation; awaiting
                              │  confirmation in next         │
                              │  weekly review                │
                              └────┬───────────┬──────────────┘
                                   │           │
              passes weekly review │           │ fails weekly review
                                   ▼           ▼
                         ┌──────────────┐  ┌──────────────┐
                         │  CONFIRMED   │  │  LEARNING    │
                         │  counts      │  │  (back down) │
                         │  toward      │  └──────────────┘
                         │  vocabulary  │
                         └──────┬───────┘
                                │ 2 consecutive misses in maintenance
                                ▼
                         ┌──────────────┐
                         │   LAPSED     │
                         │  removed from│
                         │  count;      │
                         │  re-enters   │
                         │  LEARNING    │
                         └──────────────┘

The transitions are spec'd precisely in §5.

4.5 Pacer state model

The Pacer holds four levers, each independently advanceable. Each lever has a small set of discrete states; the Pacer never takes fractional steps. State is durable — saved after every advancement and never forgotten.

Lever 1 — Breadth. Controls the size of the active vocabulary pool (the set of word-senses eligible to appear in exercises and tutor speech). States: core_500core_1000extended_1500extended_2000. Numbers correspond roughly to frequency-ranked words from the Plan curricular del Instituto Cervantes. Advancing one step pulls in the next ~500 words at A1 level into the UNTESTED queue.

Lever 2 — Depth. Controls how aggressively secondary senses of known words are introduced. States: primary_onlycommon_secondaryall_documented. At primary_only, quedar is tracked only in its most-frequent sense ("to remain"); at common_secondary, the next 1–2 most-frequent senses are activated as UNTESTED; at all_documented, every sense in the lexicon entry is tracked.

Lever 3 — Grammar. Controls the grammatical constructions the Tutor expects and uses. States: present_only+past_tenses+subjunctive_mood+conditional_and_compound. Each step adds tense/mood families to the Tutor's "expect and use" list, which feeds into both exercise generation prompts and conversational system prompts.

Lever 4 — Production. Controls the mix between recognition-heavy and production-heavy exercises. States: recognition_heavy (70/30 split favouring translation/cloze/listening) → balanced (50/50) → production_heavy (30/70 favouring use-in-sentence and free dialogue). Affects the daily session's exercise format weighting.

Lever interaction. The levers are independent in that each can advance on its own, but the content the Pacer pulls in is influenced by all four. Adding +past_tenses (Grammar) doesn't add new words to the pool — but it does change which exercises are generated for words already in the pool. Advancing Breadth from core_500 to core_1000 adds words but those words then inherit the current Grammar and Production settings for how they're tested.

                ┌──────────────────────────────────────────┐
                │              PACER (per lever)           │
                │                                          │
                │   ┌──────────┐  signals met &           │
                │   │  STATE_N │──cooldown elapsed──────┐ │
                │   └────┬─────┘                        │ │
                │        │                              ▼ │
                │        │ user manually steps     ┌──────────┐
                │        │ down (regression)       │ STATE_N+1│
                │        ◄─────────────────────────┤          │
                │                                  └────┬─────┘
                │                                       │       │
                │                            (continues to N+2…)│
                └──────────────────────────────────────────────┘

Cooldown. After any advancement on any lever, that lever cannot advance again for 5 calendar days. Other levers are not affected. Calendar time, not active-use time: a 3-day break still counts toward the cooldown — punishing the user for being away would be the wrong incentive. The cooldown exists because new content needs to enter the FSRS rotation and produce real signal before the system can know whether further advancement is warranted; without it, the Pacer would advance on stale data.

The "Suggest ease back" mechanic. Per the user's choice, the Pacer never automatically regresses. But it does compute a "drift signal" each session: when 7-day accuracy on a lever's recently-advanced content drops below 60%, or when the FSRS system is heavily redistributing toward failed cards, the dashboard shows a non-blocking suggestion: "Production has been at production_heavy for 8 days and accuracy is at 54%. You may want to step it back to balanced." The user can act on it or ignore it. The decision log records both the suggestion and the user's response (or lack thereof).

4.6 Conversational vocabulary on-ramp

The conversational tutor doesn't just expose the user to stretch words passively — it acts as an active vocabulary on-ramp. Every word the tutor uses that is not already in the user's tracked vocabulary becomes a candidate for tracking, and the tutor's evaluation of the user's response determines what happens next.

When a tutor turn introduces an untracked word (within the conversation's stretch budget — see FR-010), the system tags the word in the conversation log. When the user responds, Claude evaluates not just whether the response is grammatically and contextually correct, but specifically whether the user demonstrated comprehension of the new word. There are three outcomes:

Handled well — user produced a contextually appropriate response, used or correctly responded to the new word, no signs of confusion. The word is fast-tracked: it skips the UNTESTED state and enters LEARNING with one "correct" already credited (since the conversational use counts as one successful exposure across one format). The user is informed at session end: "You handled posiblemente correctly when I used it. I've added it to your tracking with a head start."

Handled neutrally — user's response didn't engage with the new word either way (they might have answered around it). The word enters UNTESTED state for normal tracking, no head start.

Struggled — user asked for clarification, gave a clearly off-topic response, or used the word incorrectly when reusing it. The word enters UNTESTED state but is also flagged as priority for next daily session — meaning the next session's exercise generator will explicitly pull this word in for direct drill, ahead of the FSRS schedule.

This means conversation is a real vocabulary growth pathway, not just exposure. It also means the Breadth lever isn't the only source of new words — important caveat for Pacer behaviour: words added via the on-ramp count toward the active pool but do not consume the Breadth lever's batch budget. The Pacer's Breadth advancement still pulls from the next frequency band when its thresholds are met; conversational on-ramps run in parallel.

Stretch budget cap remains. No more than 3 untracked words per conversation (per FR-010), to avoid drowning the user. The cap is on tutor-introduced novelty per session, not a cumulative cap.

4.7 Pronunciation tracking model

Pronunciation is tracked independently from semantic vocabulary mastery. A word can be CONFIRMED in the vocabulary engine while still being "uncovered" for pronunciation, and vice versa. The user's headline vocabulary count is the semantic count; pronunciation has its own coverage metrics.

Two denominators, two coverage metrics:

  1. Phoneme coverage. Castilian Spanish has ~24 distinct phonemes (5 vowels + ~19 consonants, depending on how you count ʎ vs ʝ and the ceceo/seseo distinction). The system tracks which phonemes the user has demonstrably pronounced correctly. Phoneme coverage % = covered / 24.
  2. Word-pronunciation coverage. Of the words in the user's active pool, what percentage have been pronounced correctly at least twice across separate sessions? A word is "in the active pool" iff at least one of its tracked senses is in LEARNING, CANDIDATE, or CONFIRMED state — pronunciation is a property of the surface form, not the sense (banco sounds the same whether you mean "bank" or "bench"), so we count distinct words rather than multiplying by sense count. Word-pronunciation coverage % = pronounceable_distinct_words / distinct_active_pool_words.

Boolean per attempt with confirmation and decay:

Each pronunciation attempt has one of three outcomes from the configured PronunciationScorer: correct, incorrect, or not_detected. NOT_DETECTED is an operational outcome covering mic failure, silence, or scorer error — these attempts are logged for diagnostics but are invisible to the coverage state machine: they neither count toward advancement nor toward decay. The user-facing tracking layer applies the following rules:

  • A phoneme or word is uncovered by default.
  • A phoneme/word transitions to covered after 2 correct attempts in different sessions.
  • A covered phoneme/word decays back to uncovered after 3 consecutive failures in subsequent sessions.

This is deliberately stricter than semantic mastery (no "candidate" intermediate state) because pronunciation is a motor skill — you can do it or you can't, and there's less value in a partial-credit state. The decay rule means the percentage reflects current ability, not historical bests.

No effect on vocabulary count. Pronunciation does not gate semantic state transitions. A word can move UNTESTED → LEARNING → CANDIDATE → CONFIRMED on the semantic side regardless of pronunciation performance, and vice versa.

The PronunciationScorer interface:

class IPronunciationScorer(ABC):
    @property
    @abstractmethod
    def scorer_name(self) -> ScorerName: ...

    @abstractmethod
    def score(
        self,
        audio_path: str,
        expected_text: str,
        language: str = "es-ES",
    ) -> tuple[PronunciationOutcome, float, dict[str, Any]]:
        """Returns (outcome, confidence, raw_response)."""
        ...

V1 ships with WhisperHeuristicScorer — sends the audio file to Whisper, asks for transcript, returns CORRECT if the normalised transcript equals the normalised expected text, INCORRECT if they differ, NOT_DETECTED if Whisper returns nothing usable. This is a crude proxy. Future implementations include AzurePronunciationScorer (real phoneme-level scoring) and potentially a local model. Configuration selects which scorer is active; the rest of the system is scorer-agnostic.

5. Functional Requirements

ID Requirement Priority
FR-001 The system shall track vocabulary at the (word, sense) granularity, not the word level. Must
FR-002 Each word-sense shall hold a state ∈ {UNTESTED, LEARNING, CANDIDATE, CONFIRMED, LAPSED} and an FSRS state (stability, difficulty, last review, next due). Must
FR-003 A word-sense shall transition from LEARNING to CANDIDATE only after ≥3 successful exercises in ≥2 different exercise formats, with ≥24h between the first and third success. Must
FR-004 A word-sense shall transition from CANDIDATE to CONFIRMED only via a weekly review session, requiring ≥2 successful evaluations in fresh contexts not previously seen by the user. Must
FR-005 A CONFIRMED word-sense shall transition to LAPSED after 2 consecutive maintenance failures (FSRS "Again" responses). LAPSED returns to LEARNING with reset FSRS parameters. Must
FR-006 The "vocabulary count" displayed to the user shall include only senses in CONFIRMED state. Must
FR-007 The system shall generate exercises in the following formats, with the indicated mode affinities — Vocabulary mode: TRANSLATE_EN_ES, TRANSLATE_ES_EN, DEFINITION_MATCH, SENSE_DISAMBIGUATION, LISTENING (single-word audio); Sentences mode: CLOZE, FULL_TRANSLATE, ERROR_CORRECTION, PRODUCTION, LISTENING (sentence-level audio); Speaking mode: READ_ALOUD, REPEAT_AFTER (no semantic evaluation, pronunciation only); Conversation mode: no formal exercise format (uses ConversationTurn). Must
FR-008 All exercise generation shall be done via Claude API calls, parameterised by the target word-sense, current user level, and exercise format. Must
FR-009 Response evaluation (was the user correct?) shall be done via Claude with a structured JSON output schema; no regex or string-match grading. Must
FR-010 The conversational tutor mode shall constrain its outputs to vocabulary the user has been EXPOSED to (any state including UNTESTED/LEARNING) plus an explicit "stretch budget" of ≤3 new words per conversation, surfaced for later vocabulary tracking. Must
FR-011 The system shall support push-to-talk speech input, transcribe via configured STT provider, and play tutor responses via configured TTS provider. Must
FR-012 The system shall log every exchange (timestamp, mode, prompt, user response, evaluation, state changes) to a local persistent store. Must
FR-013 The weekly review shall be scheduled (by default, every Sunday) and shall not be skippable without an explicit override; missed reviews shall be flagged on the dashboard. Should
FR-014 The CEFR dashboard shall show: current estimated level, vocabulary count by state, A1 thematic domain coverage (greetings, family, food, etc.), grammar competency checklist. Must
FR-015 The user shall be able to inspect any word's full state including all tracked senses, history, and next due date. Must
FR-016 The system shall support inserting new word-senses both manually (user adds quedar:to_remain) and automatically (tutor surfaces a new word during conversation). Must
FR-017 (Removed in v0.5; superseded by the §4.6 conversational on-ramp. Numbering preserved to avoid breaking references.)
FR-018 The CEFR level estimation shall be re-computed after each weekly review and shall use Claude as the assessor, fed the user's vocabulary state and a sample of recent productions. Must
FR-019 The system shall include a Pacer module that holds four independent difficulty levers (Breadth, Depth, Grammar, Production), each with discrete advancement states per §4.5. Must
FR-020 The Pacer shall evaluate advancement signals after every completed daily session and after every weekly review. It shall not run during active sessions. Must
FR-021 The Pacer shall advance the Breadth lever when ALL of: (a) 7-day rolling accuracy across active pool ≥ 85% on at least 3 distinct exercise formats, (b) most recent weekly review CANDIDATE→CONFIRMED conversion rate ≥ 70%, (c) the lever's 5-day cooldown has elapsed, (d) at least 80% of the current pool is in CANDIDATE or CONFIRMED state (no advancing while there's still substantial UNTESTED inventory). Must
FR-022 The Pacer shall advance the Depth lever when: (a) ≥ 30 distinct words have been in CONFIRMED state for ≥ 14 days, (b) the user has encountered ≥ 5 secondary senses in conversation/exercises and handled them correctly without confusion, (c) cooldown elapsed. Must
FR-023 The Pacer shall advance the Grammar lever when: (a) the user has successfully produced the next-tier construction unprompted in ≥ 3 separate exchanges within the past 14 days (verified by Claude evaluation), (b) cooldown elapsed. Must
FR-024 The Pacer shall advance the Production lever when: (a) accuracy on production-format exercises ≥ 80% over the rolling 7-day window, (b) the user has completed ≥ 5 conversational tutor sessions in the past 14 days, (c) cooldown elapsed. Must
FR-025 The Pacer shall never automatically regress any lever. Regression is a manual action initiated by the user from the dashboard. Must
FR-026 The Pacer shall compute a "drift signal" per lever each session and surface a non-blocking Suggest ease back notification on the dashboard when drift exceeds threshold (per §4.5). The notification does not change Pacer state. Should
FR-027 Every Pacer decision (advance, hold, suggest-ease-back) shall be written to a persistent PacerDecision log including: timestamp, lever, decision type, all signal values evaluated, threshold values used, and the resulting state (or unchanged state). The log is append-only and user-browsable. Must
FR-028 The dashboard shall display the current state of each of the four levers, the time-since-last-advancement, the next signal value needed for advancement (e.g. "needs 7-day accuracy ≥ 85%, currently 81%"), and the cooldown status. Must
FR-029 When the Pacer advances Breadth, the new words shall be selected from the next frequency band of the active CEFR level using the Plan curricular del Instituto Cervantes frequency ordering, and shall enter UNTESTED state in the Vocabulary Engine in batches of 10 (configurable). Must
FR-030 The Pacer's threshold values (the 85%, 70%, 60%, 14 days, 5 days, etc.) shall be stored in a configuration file, not hardcoded, so they can be calibrated after 4 weeks of real-use data without code changes. Must
FR-031 The system shall include a Pronunciation Engine as an independent backend module with a PronunciationScorer interface (per §4.7). V1 shall ship with a WhisperHeuristicScorer implementation; the architecture shall support adding AzurePronunciationScorer and other implementations as configuration changes without code changes elsewhere. Must
FR-032 Pronunciation scoring shall produce a binary outcome per attempt: correct or incorrect, plus a confidence score and the raw scorer response for debugging. Must
FR-033 Pronunciation results shall be stored as PronunciationAttempt entities and shall not affect the semantic FSRS state of the underlying word-sense. The two tracks are independent. Must
FR-034 A phoneme or word shall be considered covered for pronunciation after 2 correct attempts in different sessions, and shall decay back to uncovered after 3 consecutive failures in subsequent sessions. (Per §4.7.) Must
FR-035 The dashboard shall display two pronunciation coverage metrics as percentages: (a) phoneme coverage out of ~24 Castilian phonemes, (b) word-pronunciation coverage out of active pool size. These are shown separately from the semantic vocabulary count. Must
FR-036 When the conversational tutor introduces an untracked word (within the stretch budget), the system shall tag the word in the conversation log and Claude shall evaluate the user's comprehension of that specific word in their response. Must
FR-037 Based on the comprehension evaluation, the untracked word shall be processed per §4.6: handled-well → fast-track to LEARNING with one credit; handled-neutrally → enter UNTESTED; struggled → enter UNTESTED with priority_drill_next_session flag set. Must
FR-038 Words flagged priority_drill_next_session shall be force-included in the next Vocabulary or Sentences mode session's exercise queue, ahead of FSRS-scheduled items, until the priority flag is cleared (cleared after 1 successful exercise). Should
FR-039 The user shall be presented with a session-end summary listing all words added via the conversational on-ramp during that session, with their resulting state, so the user is never surprised by silent vocabulary additions. Must
FR-040 The system shall expose four explicit user-facing modes: Vocabulary, Sentences, Speaking, and Conversation. Each mode shall have its own session shape, exercise mix, and end-of-session summary tailored to that mode's goals. Must
FR-041 The dashboard shall let the user start a session in any mode directly (single click/tap), with the most recently used mode pre-selected. Should
FR-042 An "Explain in English" touchpoint shall be available in every mode that involves Spanish-language content. It shall be invocable as a UI button at all times during a session, and as a voice command (the user saying "explícame" or "in English") in voice-enabled modes. Must
FR-043 When invoked, the Explain-in-English touchpoint shall send the current item, the user's last response (if any), and a brief context summary to Claude, and shall stream back an English-language explanation. The session state (current exercise, conversation context) shall not be lost — the explanation is a side conversation that resumes the session afterward. Must
FR-044 The Explain-in-English explanation shall be logged in the session history, distinguishable from the main exchange, so the user can review where they needed help. Should
FR-045 The Pronunciation Engine module shall be invoked only by the Speaking mode in V1. Other modes that involve speech (Conversation) shall route audio through STT for transcription only; pronunciation tracking from those modes shall be deferred to V2. [Rationale: keeps the Speaking mode a clean, focused surface for pronunciation work; conversation latency would suffer if every utterance were also pronunciation-scored.] Should
FR-046 The Vocabulary Engine shall map exercise outcomes to FSRS ratings as follows: INCORRECT → Again; PARTIAL with score < 0.7 → Hard; PARTIAL with score ≥ 0.7 → Good; CORRECT → Good. The Easy rating is reserved for V2 and shall not be used in V1. [Rationale: defaulting CORRECT → Good is conservative — reviews come slightly more often than they need to, but always safe. Wrong-direction Easy assignments would inflate intervals and silently degrade retention.] Must
FR-047 The Tutor Engine shall compose weekly-review sessions by sampling each CANDIDATE sense in ≥2 fresh contexts (formats and prompts not previously seen by the user), each due-for-maintenance CONFIRMED sense once, plus a small pronunciation sample (default 5 phonemes, 5 words) for decay detection. Must

6. Data Model

6.1 Core entities

erDiagram
    WORD ||--o{ SENSE : has
    SENSE ||--o{ EXERCISE_RESULT : tested_in
    SENSE ||--o{ STATE_TRANSITION : changes
    SENSE ||--|| FSRS_STATE : has
    SESSION ||--o{ EXERCISE : contains
    SESSION ||--o{ CONVERSATION_TURN : contains
    EXERCISE ||--|| EXERCISE_RESULT : produces
    THEMATIC_DOMAIN ||--o{ SENSE : groups
    PACER_LEVER ||--o{ PACER_DECISION : produces
    PACER_LEVER ||--o{ PACER_SIGNAL_SAMPLE : observed_by
Loading

6.2 Entity field specifications

Word — surface form, lemma, part of speech, gender (for nouns), CEFR level introduced, IPA pronunciation, audio sample URL, frequency rank within Plan curricular (used by Pacer for breadth advancement ordering).

Sense — parent word ID, sense identifier (e.g. quedar:to_remain vs quedar:to_meet_up), gloss in English, example sentences (≥3, generated up front by Claude and human-spot-checked by you), CEFR level for this sense, current state, thematic domain tags, sense rank within parent word (1 = primary, 2 = first secondary, etc; used by Pacer for depth advancement).

FSRS state — stability, difficulty, retrievability, last review timestamp, next due timestamp, review count, lapse count. (Parameters per the FSRS algorithm; see §8.6 for the library.) FsrsState is created lazily on a sense's first exposure (UNTESTED → LEARNING transition), not at sense registration; UNTESTED senses are not scheduled and have no FsrsState row.

Exercise — format, target sense ID, prompt text, expected response shape, generated_at, generation_model_version, generation_prompt_hash (so we can replay), grammar constructions required (set of tags drawn from the Grammar lever vocabulary, e.g. present_indicative, preterite, subjunctive_present).

ExerciseResult — exercise ID, user response (text + optional audio file path), Claude evaluation (correct / partial / incorrect + reasoning + score 0–1), latency, state-change-triggered, grammar constructions used (extracted by Claude from the response; feeds Grammar lever's "user produced unprompted" signal).

Session — type (daily / conversation / weekly_review / inspection), start, end, summary stats, pacer_evaluated_at_end (whether the Pacer ran after this session and what it decided).

ConversationTurn — session ID, speaker (user / tutor), text, audio file path if applicable, words-surfaced (for stretch budget tracking), constructions-used (Claude-tagged grammar features in this turn).

StateTransition — sense ID, from state, to state, reason, timestamp. Append-only audit log.

ThematicDomain — id, name, CEFR level, description. Static reference data seeded from CEFR documentation.

PacerLever — lever name (Breadth / Depth / Grammar / Production), current state, last_advancement_at, cooldown_expires_at. Exactly four rows; one per lever.

PacerDecision — id, timestamp, lever, decision_type (advance / hold / suggest_ease_back / user_regressed), from_state, to_state (same as from_state if hold), signals_evaluated (JSON blob: each signal name, its computed value, the threshold it was compared against, pass/fail), reasoning_text. Append-only.

PacerSignalSample — timestamp, signal_name (e.g. accuracy_7day_translation_es_to_en), value, computed_from (window definition: which exercises feed this signal). Used so the Pacer's decisions are reproducible — given the same samples, the same decision should be reached. Useful for retroactive calibration after threshold changes.

PronunciationAttempt — id, timestamp, target_type (phoneme / word / phrase), target_value (the actual phoneme symbol, word, or phrase text), session_id, audio file path, scorer_name (which PronunciationScorer was used), correct (boolean), confidence (0–1), raw_response (full scorer response for debugging).

PhonemeCoverage — phoneme symbol (e.g. for trilled r, θ for ceceo s), correct_attempts_count, distinct_sessions_with_correct_attempt, consecutive_failures, current_state (covered / uncovered), last_attempt_at. One row per Castilian phoneme.

WordPronunciationCoverage — word_id (FK to Word), correct_attempts_count, distinct_sessions_with_correct_attempt, consecutive_failures, current_state (covered / uncovered), last_attempt_at. Independent of word's semantic state.

ConversationOnRamp — id, conversation_turn ID where word was introduced, word ID, comprehension_evaluation (handled_well / handled_neutrally / struggled), claude_reasoning, resulting_state_change. Auditable record of every conversational vocabulary addition.

EnglishExplanationLog — id, session_id, exercise_id_or_turn_id (what triggered the explanation), trigger_method (button / voice command), user_context_at_trigger (current exercise/turn snapshot), claude_explanation, timestamp. Lets the user review what they needed help with.

6.3 Data lifecycle

  • Creation. Words come from (a) manual entry, (b) curriculum seed lists, (c) tutor-surfaced during conversation (pending user approval).
  • Mutation. State changes are append-only — we never overwrite the state field; we insert a StateTransition and the sense's "current state" is derived from the latest transition. This makes the entire history auditable.
  • Deletion. No deletion in V1. If you regret adding a word, mark it ARCHIVED (a soft state outside the main lifecycle).

6.4 Sensitive data

Almost none. The user is the sole user; their voice recordings and chat logs are personal but not regulated. Stored locally. No PII processing in any meaningful sense. [NOTE — this changes the moment you put this on a server with auth. Don't.]

6.5 Volumes

Estimated upper bound at A1: ~500 confirmed senses, ~1500 senses tracked total, ~10k exercise results, ~100k conversation turns. All comfortably fits in SQLite. Audio files are the bulky asset; budget ~1GB if you keep all session audio. All session audio is retained indefinitely in V1 — no rotation policy. Storage growth is acknowledged but acceptable for a single user (per A-014).

6.6 Backup

Single SQLite file + audio directory. Backup = nightly copy to a cloud sync folder (Dropbox / iCloud / Google Drive). RPO: 24h is fine. RTO: whatever it takes you to download the backup and reopen the app.

7. Non-Functional Requirements

ID Category Requirement Target How verified
NFR-001 Latency Exercise generation round trip (Claude API) p95 < 4s Manual; Claude API typically 2–6s for short generations
NFR-002 Latency Speech-to-text round trip p95 < 3s for ≤15s audio Provider SLA
NFR-003 Latency TTS playback start p95 < 2s Provider SLA
NFR-004 Availability App functional when online best effort, single user n/a
NFR-005 Reliability No data loss on crash Every state transition persisted before UI confirmation Test with mid-session kill
NFR-006 Cost Steady-state monthly cost ≤ $30/mo at 30 min/day Monthly review of API bills
NFR-007 Usability Time from "open app" to "first exercise" < 5s Manual stopwatch
NFR-008 Maintainability Anyone (you, in 6 months) can read the code Reasonable file size, comments on the FSRS layer and the Claude prompt construction Code review by future-you
NFR-009 Portability Browser support Latest Chrome, Firefox, Safari on desktop Manual smoke test
NFR-010 Privacy Voice and chat data never leave your control unencrypted API providers see request payloads in transit (unavoidable); local store is your machine Configuration review

Categories I'm omitting from the table because they don't apply: scalability (single user), accessibility (you are the user; you know your own needs), i18n (the UI chrome can stay in English; the content is the Spanish — that's the whole point).

8. Architecture

8.1 Solution strategy

A single-page web app (React or Svelte; pick whichever you prefer) talking to a thin local backend (Node or Python) that owns the SQLite database and proxies calls to Claude, the STT provider, and the TTS provider. The backend exists primarily to keep the API keys off the client and to centralise the prompt construction logic. Run it locally — no deployment, no hosting cost.

The "interesting" software is concentrated in five modules:

  1. Vocabulary engine — owns the FSRS layer, the state machine, and the validation logic. Pure logic, no AI. This is the module worth unit-testing thoroughly.
  2. Pacer — owns the four difficulty levers, computes signals from the database, makes advancement decisions per the rules in §4.5 and FR-019–030. Pure logic, no AI calls (signals are all derived from existing data). Runs after sessions, not during. Highly unit-testable; do it.
  3. Tutor engine — owns the Claude prompts and the response parsing. Has one job: turn (user state, Pacer state, intent) into (exercise or dialogue turn) and back into (evaluation). The Pacer state feeds into prompt construction — when Grammar is at +past_tenses, exercise-generation prompts say so. Also evaluates comprehension of stretch-budget words for the conversational on-ramp.
  4. Speech engine — STT and TTS adapters with a common interface. Pluggable so you can swap providers.
  5. Pronunciation engine — adapter to Azure Speech Pronunciation Assessment (or equivalent). Takes audio + reference text, returns structured scores. Independent of STT — they run in parallel on the same audio when both are needed (transcription for content evaluation, pronunciation scoring for accent feedback).

8.2 Context diagram

graph LR
    U[User<br/>browser + mic + speakers]
    UI[Web UI<br/>SPA]
    BE[Local backend<br/>API + SQLite]
    Claude[Anthropic Claude API]
    STT[STT provider<br/>Whisper / Deepgram]
    TTS[TTS provider<br/>ElevenLabs / Azure]

    U <--> UI
    UI <--> BE
    BE <--> Claude
    BE <--> STT
    BE <--> TTS
Loading

8.3 Component diagram (backend internals)

graph TB
    API[HTTP API layer]
    VE[Vocabulary Engine<br/>FSRS + state machine]
    PA[Pacer<br/>4 levers + signal computation]
    TE[Tutor Engine<br/>Claude prompts + parsing]
    SE[Speech Engine<br/>STT/TTS adapters]
    PE[Pronunciation Engine<br/>Azure Speech adapter]
    DB[(SQLite)]
    AUDIO[(Audio file store)]

    API --> VE
    API --> PA
    API --> TE
    API --> SE
    API --> PE
    VE --> DB
    PA --> DB
    PA -.reads state from.-> VE
    PA -.publishes state to.-> TE
    TE --> DB
    PE --> DB
    TE -.calls.-> Claude[Claude API]
    SE -.calls.-> ExtSTT[STT API]
    SE -.calls.-> ExtTTS[TTS API]
    PE -.calls.-> ExtPron[Azure Speech<br/>Pronunciation API]
    SE --> AUDIO
    PE --> AUDIO
Loading

The Pacer reads from both the Vocabulary Engine's state and the raw exercise/result history; it doesn't write to FSRS state directly. When the Pacer advances Breadth, it inserts new word-senses into UNTESTED state via the Vocabulary Engine's normal "add new sense" path — no side door. When the Pacer advances Grammar or Production, those settings are read by the Tutor Engine when it constructs prompts; nothing in the Vocabulary Engine changes.

The Pronunciation Engine is independent of the semantic evaluation path. When the user submits an audio response, it goes to both the Speech Engine (STT for transcription → Tutor Engine for content evaluation) and the Pronunciation Engine (Azure for phoneme scoring) in parallel. The two evaluations are stitched together in the session response so the user gets both kinds of feedback in one place.

8.4 Key runtime scenarios

Scenario A — Single exercise during a daily session

sequenceDiagram
    actor User
    participant UI
    participant BE as Backend
    participant VE as Vocab Engine
    participant TE as Tutor Engine
    participant Claude

    User->>UI: Start daily session
    UI->>BE: POST /session/start
    BE->>VE: get_due_senses(limit=15)
    VE-->>BE: list of (sense, format_to_use)
    BE->>TE: generate_exercise(sense, format, pacer_state)
    TE->>Claude: structured prompt (incorporates Pacer Grammar/Production state)
    Claude-->>TE: exercise JSON
    TE-->>BE: exercise
    BE-->>UI: exercise payload
    UI-->>User: render exercise
    User->>UI: response (text or audio)
    UI->>BE: POST /exercise/respond
    BE->>TE: evaluate(exercise, response)
    TE->>Claude: evaluation prompt (also extracts grammar constructions used)
    Claude-->>TE: {correct, score, reasoning, constructions_used}
    TE-->>BE: evaluation
    BE->>VE: record_result(sense, result)
    VE->>VE: update FSRS, check state transitions
    VE-->>BE: updated state
    BE-->>UI: result + state delta
    UI-->>User: feedback
Loading

Scenario B — Post-session Pacer evaluation

sequenceDiagram
    participant UI
    participant BE as Backend
    participant PA as Pacer
    participant DB

    UI->>BE: POST /session/end
    BE->>PA: evaluate_after_session(session_id)
    PA->>DB: read 7-day exercise results, FSRS states, sense states
    PA->>PA: compute signals for each lever
    loop per lever (Breadth, Depth, Grammar, Production)
        PA->>PA: check cooldown
        PA->>PA: check thresholds
        alt thresholds met & cooldown elapsed
            PA->>DB: write PacerDecision (advance) + update PacerLever state
            opt lever == Breadth
                PA->>DB: insert N new senses as UNTESTED
            end
        else drift signal triggered
            PA->>DB: write PacerDecision (suggest_ease_back)
        else
            PA->>DB: write PacerDecision (hold) with signal values
        end
    end
    PA-->>BE: summary of decisions
    BE-->>UI: session summary including any Pacer changes
Loading

Session-end orchestration (API-layer responsibility): the Backend, on POST /session/end, drives the following ordered sequence:

  1. Vocabulary Engine finalises any pending state on the session.
  2. Pronunciation Engine flushes any pending coverage updates.
  3. Pacer's evaluate_after_session(session_id) runs.
  4. The session summary is composed and returned.

The Pacer is never invoked as a side effect of any other module's work. Only the API layer triggers it. This keeps the modules decoupled: the Vocabulary Engine doesn't know the Pacer exists, the Pronunciation Engine doesn't know the Pacer exists, and bugs in upstream modules cannot accidentally trigger Pacer evaluations on inconsistent state.

8.5 Deployment view

Localhost only. Single machine. Run via docker compose up or a package.json script. Database is a SQLite file in a known location (e.g. ~/.castellano/db.sqlite). Audio in ~/.castellano/audio/. Backups via the user's existing cloud sync.

8.6 Technology choices

  • Frontend: React + Vite. Confirmed.

  • Backend: Python with FastAPI. Confirmed.

  • Database: SQLite. Single user, embedded, zero ops. Use SQLAlchemy + Alembic for schema migrations.

  • FSRS library: fsrs Python package — Apache 2.0, well-maintained, the same algorithm Anki ships with by default since 2024.

  • STT: OpenAI Whisper API as default. Pluggable interface.

  • TTS: ElevenLabs as default (best Castilian voices), Azure Neural TTS as fallback. Pluggable.

  • Pronunciation Assessment: Pluggable PronunciationScorer interface. V1 default = WhisperHeuristicScorer (uses Whisper API, transcript-match boolean — already paying for Whisper for STT, so marginal cost is zero). Future = AzurePronunciationScorer (Azure Speech Pronunciation Assessment with phoneme-level scoring) — pluggable as a configuration change when desired. Acknowledge the V1 limitation: Whisper normalises non-native pronunciation toward correct text, so the heuristic will be over-generous; Speaking mode's coverage metrics will reflect "intelligible enough for Whisper" rather than "Castilian-accurate." Acceptable trade-off for V1.

  • Claude routing — confirmed split for cost control:

    • Opus (claude-opus-4-7) — used for: response evaluation in daily sessions, weekly review assessment, CEFR level estimation, conversational on-ramp comprehension evaluation, grammar-construction extraction. The high-stakes "is this correct" judgements where errors compound.
    • Sonnet (claude-sonnet-4-6) — used for: exercise generation, conversational tutor turns, content seed structuring during Milestone 6. The high-volume creative work where speed matters and minor variation is fine.

    Estimated split at 30 min/day: ~70% Sonnet calls, ~30% Opus calls by count, but Opus dominates cost share. Monthly target ~$15 for Anthropic API. [ASSUMPTION — based on current Anthropic pricing; verify and adjust split if cost runs hot. Hard kill switch in §14.]

9. Interfaces

9.1 External APIs consumed

Provider Purpose Failure mode Fallback
Anthropic Generation, evaluation, conversation, assessment API down or rate-limited Show clear error; cached exercises for offline-ish drill mode? [OPEN]
STT provider Speech → text API down Fall back to typed input, surface mic-disabled state
TTS provider Text → speech API down Show text only; mute the audio button

9.2 APIs produced

The backend exposes a small REST API for the SPA. Sketched:

  • POST /session/start { mode: vocabulary | sentences | speaking | conversation } → session ID + first item appropriate to the mode
  • POST /exercise/respond { exerciseId, response } → evaluation + state delta (used by Vocabulary, Sentences, Speaking modes)
  • POST /session/end { sessionId } → session summary including Pacer decisions made, on-ramp additions, pronunciation coverage updates
  • POST /explain { sessionId, contextRef } → English-language explanation of the current item/turn, streamed; does not advance or end the session
  • GET /vocab → semantic confirmed list, paginated
  • GET /vocab/:wordId → all senses + history + pronunciation coverage status
  • POST /vocab/word { word, candidate_senses } → manual addition
  • POST /conversation/turn { sessionId, audioOrText } → tutor response (audio + text)
  • GET /pronunciation/coverage → phoneme coverage + word-pronunciation coverage percentages, plus per-phoneme breakdown
  • POST /pronunciation/score { targetType, targetValue, audio } → uses configured scorer to evaluate (used by Speaking mode)
  • GET /progress → CEFR snapshot
  • POST /review/weekly → run the weekly review session
  • GET /pacer → current state of all four levers + signals computed at last evaluation + cooldown status
  • GET /pacer/decisions?lever=&since= → browseable decision log
  • POST /pacer/regress { lever, target_state } → manual user-initiated regression on a lever
  • POST /pacer/dismiss-suggestion { lever } → user dismissing a suggest-ease-back notification

OpenAPI schema lives in the repo. Single user, no auth, but bind to 127.0.0.1 only — never expose to the network.

9.3 User-facing UI surfaces

Seven primary screens:

  1. Dashboard — vocab count, due today, current CEFR estimate, pronunciation coverage bars (phonemes %, words %), Pacer panel, weekly review status, and four prominent mode-start buttons: Vocabulary, Sentences, Speaking, Conversation. Most-recently-used mode is pre-selected.
  2. Vocabulary mode session screen — single exercise, response area, feedback panel, progress bar, persistent "Explain in English" button.
  3. Sentences mode session screen — sentence-level exercise display, response area, feedback panel, progress bar, persistent "Explain in English" button.
  4. Speaking mode session screen — current prompt (word/phrase/sentence), large push-to-talk button, last-attempt feedback (correct / not yet, with confidence), running coverage display.
  5. Conversation mode screen — chat-style transcript, large push-to-talk button, current topic header, words-surfaced sidebar, "Explain in English" button and voice command, on-ramp evaluations shown post-session.
  6. Vocabulary browser — searchable, filterable list of all tracked words and senses; shows semantic state and pronunciation coverage state in separate columns.
  7. Pacer detail screen — per-lever drilldown with decision log, signal history, manual regression controls.

Plus a Settings screen for provider keys, scorer selection (Whisper-heuristic vs Azure when added), voice selection, weekly review schedule.

10. Security

Personal project, localhost-only — but a few things still matter.

  • API keys for Claude / STT / TTS go in a .env file, never in code, never in localStorage. Backend reads them; UI never sees them.
  • Bind to 127.0.0.1 explicitly. If you ever run this on a laptop on hotel WiFi and bind to 0.0.0.0 by accident, you've shared your Claude key with the hotel.
  • No auth on the local API is fine only because of the binding above. If that ever changes, add auth.
  • Audio files stored unencrypted on local disk. Reasonable for personal use; full-disk encryption (FileVault, BitLocker) is your friend here.
  • Prompt injection. The user (you) is the only source of input, so the threat is low — but the tutor mode will be feeding Claude its own prior outputs as conversation context, which is a classic spot for self-injection drift. Mitigation: clear system prompt boundaries, structured output schemas where possible, and don't let the model's prior outputs influence the validation logic.
  • No secret-ful logs. Don't log API request bodies that include the keys.

11. Privacy and Compliance

You are the only user. No GDPR, no CCPA, no HIPAA, no PCI. You have a duty of care to yourself; that's it.

One genuine consideration: third-party processors see your speech. Whisper, ElevenLabs, etc. process your voice recordings on their servers. Read their data-use policies. OpenAI explicitly says API audio is not used for training; ElevenLabs has a similar policy. Worth confirming before onboarding any new provider.

12. Observability and Operations

Personal project. Don't over-build this. But:

  • Logs. All Claude calls, STT calls, TTS calls logged with cost estimate (token counts, audio seconds). Daily tally to a local file. This is how you'll catch a runaway prompt that's burning money.
  • No metrics stack. Don't run Prometheus on your laptop for this.
  • Health. A /health endpoint that pings each upstream provider with a trivial call. Useful when something seems broken.
  • The one alert that matters: monthly API cost over budget. A simple cron-ish check that emails you (or just shows a banner in the UI) if month-to-date cost exceeds the threshold.

13. Deployment and Release

There is no deployment. There is git pull && docker compose up. There is no release; there is main.

Migration strategy: SQLite + Alembic (Python) or Prisma migrate (Node). Run on app start.

If you ever want to share this with someone else, this whole section needs rewriting.

14. Cost

Revised back-of-envelope at ~30 min/day, with v0.4 changes:

  • Claude. Sonnet handles ~70% of calls (exercise generation, tutor dialogue, English explanations) at ~$0.30/day. Opus handles ~30% of calls (evaluation, assessment, comprehension judgement) at ~$0.50/day. Adding the Explain-in-English mechanic adds maybe 5–10 calls/day at Sonnet pricing — negligible. Total ~$15-20/month.
  • STT / Whisper. $0.006/min × 30 min/day × 30 days ≈ $5/month. Whisper now serves double duty (transcription + V1 pronunciation scorer) so no separate pronunciation cost in V1.
  • TTS. ElevenLabs paid tier ~$5–11/month depending on tier.
  • Azure Pronunciation (deferred). Not in V1 cost. When enabled later, expect ~$15/month additional.

Total V1 estimate: $25–35/month — comfortably under the new $60 cap. Headroom for adding Azure pronunciation later ($40–50/month total) or for higher usage.

Kill switches:

  • Daily soft cap: pause Opus calls if daily cost > $2; fall back to Sonnet for everything except weekly review.
  • Monthly hard cap: refuse new Claude calls if month-to-date > $60 budget. Recover next month or with explicit override.
  • (Future, when Azure pronunciation is enabled) Daily soft cap on pronunciation calls.

15. Rollout Plan

There is no rollout. There is "you start using it." But there is a sensible build order:

Milestone 0 — skeleton (1 weekend). Backend + frontend wired up, mode-selection dashboard, hello-world exercise hardcoded in Vocabulary mode, Claude API call working end-to-end.

Milestone 1 — vocab engine, no AI (1 weekend). Database schema (including PronunciationAttempt, PhonemeCoverage, WordPronunciationCoverage, PacerLever), FSRS integration, semantic state machine implemented and unit-tested with synthetic exercise results. Pronunciation coverage state machine (boolean + 2-correct/3-fail rules) implemented and unit-tested. No Claude in this milestone.

Milestone 2 — Vocabulary mode and Sentences mode (1 weekend). Exercise generation prompts for word-level and sentence-level formats. Sonnet for generation, Opus for evaluation. Both modes end-to-end with typed input. Explain-in-English touchpoint integrated as a button.

Milestone 3 — voice + Conversation mode (1–2 weekends). STT and TTS integrated, push-to-talk in the UI, Conversation mode end-to-end with voice. Conversational on-ramp logic. Explain-in-English voice command added (Conversation mode only).

Milestone 4 — Speaking mode and Pronunciation Engine (1 weekend). PronunciationScorer interface defined. WhisperHeuristicScorer implementation. Speaking mode UI with prompts, push-to-talk, immediate feedback. Phoneme coverage and word-pronunciation coverage displays on dashboard.

Milestone 5 — weekly review and CEFR dashboard (1 weekend). The validation gate from CANDIDATE → CONFIRMED, the assessment session, the level estimator (Opus). Weekly review samples broadly across modes including a small pronunciation re-check for decay detection.

Milestone 6 — Pacer (1 weekend). Four levers, signal computation, advancement logic, decision log, dashboard panel. Mostly pure-logic code that operates on data already produced by milestones 1–5.

Milestone 7 — content seed (1–2 weekends). Scrape the CVC Plan curricular A1-A2 inventories. Use Claude (Sonnet) to structure into JSON. Spot-check by hand. Load into database. Seed the phoneme list (~24 Castilian phonemes) for pronunciation coverage tracking.

Milestone 8 — polish. Cost monitoring dashboard with kill switches, error handling, audio replay UI, Pacer threshold calibration based on first 4 weeks of real use, prompt regression testing infrastructure. Possibly: integrate AzurePronunciationScorer if Whisper-heuristic is proving too generous.

Start using the system for real after Milestone 4 — that's when all four modes are functional. Milestones 5–8 add structure and polish but the four core modes work as of M4.

16. Risks and Open Questions

16.1 Risk register

ID Risk Probability Impact Mitigation
R-001 Claude evaluations are inconsistent — same response graded differently across sessions Med High Use structured output schemas, low temperature, Opus for evaluation, sample-test a fixed set of (response, expected_grade) pairs before each prompt change
R-002 V1's Whisper-heuristic pronunciation scorer is too generous because Whisper normalises non-native pronunciation. Coverage % will drift up faster than actual pronunciation skill. High Med Acknowledged limitation. Treat V1 coverage % as "intelligibility coverage" not "Castilian-accuracy coverage." Plan to swap in AzurePronunciationScorer in M8 if the heuristic proves uninformative.
R-003 The "weekly review" gate is too strict and confirmed vocabulary grows too slowly Med Med Start strict (per the spec); relax thresholds based on data after 4 weeks of use
R-004 The "weekly review" gate is too lenient and confirmed vocabulary inflates Med Med Same — calibrate after 4 weeks
R-005 API cost overruns Low Med Hard kill switch in §14; daily cost log
R-006 Prompt drift over time as you tweak prompts High Med Version every prompt template; log which version produced each exercise; keep a regression set of (prompt, expected output shape) pairs
R-007 FSRS parameters tuned for native-language flashcards behave oddly for L2 vocabulary acquisition with sense-level granularity Med Med Start with FSRS defaults; collect ≥1000 reviews; re-fit parameters using the FSRS optimizer with your own data
R-008 Single-user assumption is violated (you let a friend try it) and the data model doesn't support it Low Low Ignore until it happens; the migration is straightforward (add a user_id column)
R-009 Castilian-specific vocabulary surfaces issues with Claude's default Spanish (which leans Latin American) Med Low Pin the system prompt: "Castilian Spanish from Spain. Use vosotros. Use peninsular vocabulary (coger, ordenador, zumo). Avoid Latin Americanisms." Spot-check generations.
R-010 Speech latency makes conversational mode feel sluggish Med Med Stream Claude responses as they generate; start TTS on the first sentence rather than waiting for full response
R-011 Pacer thresholds are wrong out of the box and either advance too aggressively (frustrating) or too slowly (boring) High Med Configurable thresholds (FR-030); explicit calibration pass at week 4 of real use; log every signal value computed so retroactive analysis is possible
R-012 Pacer advances on noisy short-window data — e.g. one good 7-day window gets you advanced, then accuracy drops the next week Med Med Cooldown (5 days) gives FSRS time to redistribute; FR-021 also requires high CANDIDATE→CONFIRMED rate at the most recent weekly review, which is a slower-moving signal that filters out noise
R-013 Manual-only regression means user lets things drift past the point where regression would help, because regressing feels like failure Med Med The "Suggest ease back" mechanic (FR-026) explicitly normalises the action by surfacing the option non-judgementally, with the data justifying it; the decision log makes regression a recordable choice rather than a hidden setback
R-014 Grammar-construction extraction (used by Lever 3 advancement signals) requires Claude to reliably tag what tense/mood you used in your responses, which is itself an evaluation task and could be wrong Med Med Use Opus for this extraction; build a regression set of (response, expected tags) pairs; tolerate noise by requiring 3+ unprompted productions before advancement (FR-023) — single misclassification doesn't matter
R-015 Pacer adds complexity to the Tutor's prompt construction (4 levers' worth of state injected into every exercise/conversation prompt), increasing token usage and the risk of prompt drift Med Low Keep Pacer-state injection terse and structured (a small JSON object the prompt references), not prose; version Pacer-prompt templates separately from base templates
R-016 STT (Whisper) auto-corrects beginner pronunciation errors before the Tutor sees the transcript, so semantic evaluation is based on what the user would have said correctly, not what they actually said High Med This is now somewhat mitigated because we have a separate Pronunciation Engine for accent feedback; but for content evaluation, accept the limitation. Pronunciation correctness is a separate track from semantic correctness.
R-017 Conversational on-ramp fast-tracks words that the user actually didn't understand (Claude evaluates "handled well" too generously) Med Med Conservative bar in the comprehension prompt: require explicit comprehension evidence (the user used the word back, or responded specifically to its meaning), not just absence of confusion; log the evaluation reasoning for review
R-018 Cost overruns due to pronunciation engine pricing per audio minute Med Med Daily kill switch on Azure pronunciation calls (§14); rate-limit to scoring N utterances per session, not all of them, if needed
R-019 CVC scrape produces noisy structured data (HTML inconsistent across pages, words in tables vs. lists, sense disambiguation requires linguistic judgement) High Med This is content work expected to need manual cleanup. Plan for ~20% of seed words needing hand-correction. Don't try to fully automate it.
R-020 Agentic codegen produces working code that doesn't survive the first weekend's real use (subtle bugs in the FSRS integration, off-by-one in scheduler, etc.) High Med Heavy unit testing on Vocabulary Engine and Pacer is non-negotiable. These are the modules where agentic generation is most likely to produce something that seems right but isn't. Real session data the second weekend will surface issues; budget for one debugging weekend after M3.
R-021 The "Explain in English" touchpoint becomes a crutch — user invokes it constantly and never builds Spanish-only fluency Med Med Log invocations and surface a "you used Explain 12 times in 5 sessions" gentle nudge on the dashboard; do NOT rate-limit or block it (defeats the tutor-feel goal). Trust the user to self-correct once the data is visible.
R-022 Modes get used unevenly — user always picks Vocabulary, never Speaking, and gets a lopsided skill profile Med Med Dashboard surfaces "last session per mode" timestamps; weekly review samples across all four modes regardless of recent usage to keep coverage honest
R-023 The boolean-with-decay pronunciation tracking misses real progress — user pronounces a word correctly today, fails it tomorrow because of context, decays back to uncovered, gets discouraged Med Low The 3-consecutive-failures threshold (FR-034) is intentionally lenient; one bad session doesn't decay anything. If it still feels punishing in practice, raise to 4 consecutive failures.
R-024 Whisper-as-pronunciation-scorer means Speaking mode looks identical in cost to other voice modes, so users may not realise pronunciation work has lower-quality feedback than they think Low Low Surface scorer name in Speaking mode UI: "scoring via Whisper-heuristic — for Castilian-accurate scoring, configure Azure in Settings."

16.2 Open questions

The majority of v0.2 and v0.3's open questions have been closed by user direction. Two open questions remain plus one new one introduced by v0.4:

ID Question Why it matters Needed by
Q-011 Does the conversational on-ramp surface the new word visually during the conversation (subtle highlight, footnote with translation) or strictly through the post-session summary? UX design choice with real learning-experience implications Milestone 3
Q-012 When pronunciation coverage and semantic mastery for a word disagree (e.g. semantic CONFIRMED but word is uncovered for pronunciation), how is this surfaced to the user? Default proposal: vocabulary count is purely semantic; a separate "speakable vocabulary" count reports the intersection. Affects dashboard headline metrics Milestone 4
Q-013 (NEW) When the Whisper-heuristic scorer judges an utterance as "correct," what's the threshold? Strict equality (transcript matches expected exactly), normalised equality (case/punctuation/diacritics ignored), or fuzzy match (Levenshtein distance below threshold)? Determines V1 Speaking mode's leniency Milestone 4

Closed questions: Q-001 through Q-010 (per v0.2 and v0.3 user direction), plus Q-007-revised (cost cap raised to $60).

16.3 Assumptions made

ID Assumption Section
A-001 Time budget: fast, agentic development; one weekend per milestone target §2.4, §15
A-002 Cost ceiling raised to $60/month for V1 (was $50 in v0.3, $30 in v0.2) §2.4, §14
A-003 Python/FastAPI backend, React + Vite frontend — confirmed, no longer assumption §8.6
A-004 Whisper API for STT, ElevenLabs for TTS, Azure Speech for pronunciation assessment §8.6
A-005 Local SQLite, no cloud DB §8.5
A-006 Castilian variant pinned via system prompt; no separate fine-tuning §8.6, R-009
A-007 Pronunciation grading in scope for V1 via pluggable PronunciationScorer; V1 default = WhisperHeuristicScorer; Azure later (was: Azure required in V1) §4.1 #7, §8.6
A-008 Single-user, localhost-bound, no auth §3, §10
A-009 Pacer pacing setting "moderate" — thresholds at the values listed in FR-021 through FR-024 §4.5, §5
A-010 Pacer is advance-only; regression is a manual user action triggered from the dashboard §4.5, FR-025
A-011 All four Pacer levers visible to user as separate dashboard panels, not a single combined dial §4.1, §9.3
A-012 Pacer threshold defaults will be calibrated against ≥4 weeks of real-use data before being treated as final §16.1 R-011
A-013 Breadth lever advances in batches of 10 word-senses (default, configurable) FR-029
A-014 Audio recordings retained indefinitely (no rotation policy); user has personal storage to absorb the volume §6.5
A-015 Opus/Sonnet routing per §8.6 to balance evaluation quality with monthly cost §8.6, §14
A-016 Conversational on-ramp acts as automatic vocabulary growth pathway with bidirectional outcomes (fast-track or priority-drill) §4.6, FR-036–039
A-017 Seed vocabulary acquired by scraping CVC's Plan curricular HTML pages and structuring with Claude (Sonnet); manual cleanup expected for ~20% of entries §15 Milestone 7, R-019
A-018 User experience structured around four explicit modes: Vocabulary, Sentences, Speaking, Conversation. Each is a separate session type with its own start button on the dashboard. §4.1 #3, §4.3, FR-040
A-019 Pronunciation tracking uses a coverage model (numerator over denominator), not a mastery model. Two coverages: phoneme % out of ~24, word-pronunciation % out of active pool. §4.7, FR-035
A-020 Pronunciation transitions are boolean per attempt with confirmation (2 corrects in different sessions = covered) and decay (3 consecutive failures = uncovered). §4.7, FR-034
A-021 "Explain in English" is a first-class interaction primitive available in every Spanish-content mode, invocable as button always and as voice command in voice modes. FR-042–044
A-022 Pronunciation Engine is invoked only by Speaking mode in V1; Conversation mode does not score pronunciation per utterance (deferred to V2). FR-045
A-023 FR-017 (manual approval buffer for tutor-surfaced words) was deleted in v0.5; the §4.6 conversational on-ramp supersedes it. Adding a manual approval gate would defeat the on-ramp's purpose, which is to make conversation an automatic vocabulary growth path. The post-session summary (FR-039) provides visibility without creating a gate; misclassifications are handled by manual regression. §4.6, FR-039
A-024 FSRS rating mapping is INCORRECT→Again, PARTIAL<0.7→Hard, PARTIAL≥0.7→Good, CORRECT→Good. The Easy rating is reserved for V2. §5 FR-046
A-025 Pacer cooldowns are calendar time, not active-use time. A break still counts toward the cooldown. §4.5
A-026 FsrsState is created lazily on first exposure (UNTESTED → LEARNING), not at sense registration. §6.2

17. Glossary

  • CEFR — Common European Framework of Reference for Languages. The A0/A1/A2/B1/B2/C1/C2 scale.
  • A0 — Pre-A1, "absolute beginner" / "breakthrough." Not officially in the CEFR but commonly used.
  • FSRS — Free Spaced Repetition Scheduler. Modern open-source algorithm used by Anki since 2024. Replaces the older SM-2.
  • Sense — a specific meaning of a word. Quedar has senses including "to remain", "to agree to meet", "to fit (clothing)", "to look (a certain way)."
  • STT / TTS — Speech-to-text / text-to-speech.
  • Stretch budget — the number of words above your current vocabulary that the tutor is permitted to introduce in a single conversation.
  • Thematic domain — CEFR groups vocabulary by topic area (greetings, family, food, weather, etc.). Used for coverage tracking.
  • Pacer — the macro-progression engine that decides when to push the user along each of four difficulty levers.
  • Lever — one axis of difficulty controlled independently by the Pacer: Breadth, Depth, Grammar, or Production.
  • Active pool — the set of word-senses currently eligible to appear in exercises and tutor speech, governed by the Breadth lever's state.
  • Cooldown — minimum time after a Pacer advancement before the same lever can be evaluated for advancement again. Prevents overshoot from stale signal.
  • Drift signal — Pacer-computed indicator that the user is failing on recently-advanced content, used to surface non-blocking "ease back" suggestions.
  • Advance-only ratchet — the Pacer never automatically reduces a lever's state; only the user can do that, manually, from the dashboard.
  • Pronunciation mastery — separate from semantic mastery; tracks whether the user can say a word correctly, scored at the phoneme level by Azure Speech. A word can be at CONFIRMED for semantic mastery but UNTESTED for pronunciation.
  • Conversational on-ramp — the mechanic by which words introduced in tutor speech are automatically routed into the vocabulary tracking system based on Claude's evaluation of whether the user comprehended them. See §4.6.
  • Priority drill — a flag set on a word when the user struggled with it during conversation; ensures the next daily session pulls that word in for explicit drill ahead of FSRS scheduling.
  • CVC — Centro Virtual Cervantes; the Instituto Cervantes's online resource portal, hosting the Plan curricular HTML.
  • Mode — a top-level user-facing session type. The system has four: Vocabulary, Sentences, Speaking, Conversation. Each has its own UI screen, exercise mix, and success criteria.
  • PronunciationScorer — pluggable interface that takes audio + expected text and returns a boolean correct/incorrect plus confidence. V1 implementation is WhisperHeuristicScorer; future implementations include AzurePronunciationScorer.
  • Coverage (pronunciation) — the percentage-based metric for pronunciation skill. Two flavours: phoneme coverage (out of ~24 Castilian phonemes) and word-pronunciation coverage (out of active pool). Distinct from semantic vocabulary count.
  • Explain in English — first-class interaction primitive invokable in any mode; sends current context to Claude for an English-language explanation without ending or advancing the session.

Appendix A — The validation argument, in plain prose

The spec's most opinionated claim is FR-003 + FR-004: a word-sense isn't "yours" until you've used it correctly in three different exercise types over at least 24 hours, and survived a fresh-context test in the next weekly review. That's slow. That's deliberate. The reason: vocabulary apps that count a word as "learned" after one or two correct answers are measuring recognition, not mastery, and produce inflated numbers that don't survive contact with real Spanish. By raising the bar — multiple formats, time-spaced, fresh contexts at review — the confirmed count is a number you can actually trust.

The cost is that growth feels slow at first. That's fine; calibration after 4 weeks of real use will tell you if the gates are too strict.

Appendix B — Why FSRS over SM-2

SM-2 (the algorithm Anki originally used and many imitators still use) treats every card the same way: ease factor adjusts up or down, intervals are deterministic. FSRS models each card with three latent variables (stability, difficulty, retrievability), fits parameters from your actual review history, and predicts the optimal next interval to hit a target retention rate. In practice it gives ~20–30% fewer reviews for the same retention. Anki shipped it as the default in 2024. Use it.

Appendix C — Suggested seed vocabulary source

The Instituto Cervantes' Plan curricular del Instituto Cervantes is the canonical CEFR-aligned reference for Spanish. It lists vocabulary by level and thematic domain, and it's specifically Castilian. Free online. Use this as the source of truth for what counts as A0/A1 vocabulary rather than scraping random word lists.

Appendix D — The Pacer's signal philosophy

The Pacer's job is to notice when you have headroom — not just when you're succeeding. These are different things and the distinction matters.

Succeeding means: this morning's session went well. That is a noisy, short-window signal. A system that advanced on this would advance and retreat constantly, and you'd live in a thrashing equilibrium where the system never lets you settle and never lets you breathe.

Headroom means: you are succeeding and the success isn't effortful and it isn't on stale, over-rehearsed material and the validation gate (weekly review) is comfortably passing. That's a slower, more boring signal — and it's the one worth advancing on. The thresholds in FR-021 through FR-024 are written to enforce this distinction:

  • 85% accuracy on three formats prevents advancing when you're great at translation but failing at production.
  • 70% CANDIDATE→CONFIRMED rate at the most recent weekly review is the slowest-moving of all the signals — it's a once-a-week check on whether the validation gate is really working, not just whether you got lucky on a few exercises. This is the load-bearing signal.
  • 5-day cooldown lets new content settle into FSRS rotation before being evaluated. Without it, you'd advance, the new words wouldn't have shown up as failures yet, and you'd advance again.
  • The pool-saturation requirement on Breadth (≥80% of pool out of UNTESTED) prevents the case where the Pacer pulls in 10 new words, you don't get to them yet, and the Pacer pulls in 10 more.

Each of these is here because of a specific way the Pacer could fail without it. Removing one is fine if you're calibrating — but know which failure mode you're re-enabling.

The reason there are four levers, rather than one, is that the four kinds of fluency are genuinely independent. You can recognise 1500 words and produce 200; you can handle present indicative comfortably and freeze on subjunctive; you can know one sense of a word fluently and not the others. Pretending these advance together is what produces the Duolingo problem — looking advanced on the dashboard and being unable to hold a conversation. By making them separate, the system is honest about which kind of fluency it has evidence for.