An on-device English learning app for IELTS speaking and reading practice, built on KittenML neural TTS plus an on-device Whisper speech recognizer. Read a passage aloud, and the app transcribes your voice and scores your reading accuracy (Word Error Rate) and an IELTS band estimate across the four official criteria — all fully offline, no internet required after install.
It also keeps the original Voice Studio (text-to-speech) so you can hear a model read any passage before you try it yourself.
The app has three tabs: Reading, Speaking, and Voice Studio.
Reading (read-aloud)
- Pick from 200 bundled passages, graded across IELTS bands and topics.
- Listen — a neural voice reads the passage aloud (KittenTTS).
- Record & read aloud — your microphone audio is captured at 16 kHz.
- Score — Whisper-tiny.en (ONNX, on-device) transcribes your speech; it is aligned word-by-word against the passage to compute Word Error Rate, reading accuracy, and an IELTS-style band breakdown.
Speaking (free response)
- Pick an IELTS-style prompt (Part 1 questions, Part 2 cue cards, Part 3 discussion).
- Record your answer in your own words.
- Score — Whisper transcribes it, then a fluency analyzer (pace, pauses, fillers) and a free-speech scorer estimate all four official criteria — now genuinely from your own vocabulary and grammar: Task Response · Fluency & Coherence · Lexical Resource · Grammatical Range & Accuracy, plus an overall band. It also shows which prompt points you covered and a transcript.
IELTS reading practice
- 200 bundled read-aloud passages graded by IELTS band (4.5–9) and topic
- On-device speech recognition (Whisper-tiny.en, ONNX) — fully offline
- Word Error Rate scoring with word-level highlighting (correct / misread / skipped)
- IELTS band estimate across the four official criteria
- "Listen" button — hear a neural voice read the passage first
IELTS speaking practice (free response)
- IELTS-style prompts across Part 1, Part 2 (cue cards), and Part 3
- Answer in your own words; Whisper transcribes on-device
- Fluency analysis from the audio: words-per-minute, long pauses, filler words
- Full four-criteria band estimate from your own vocabulary and grammar
- Prompt-point coverage feedback + transcript
Voice Studio (text-to-speech)
- 3 model sizes: Nano (15M), Micro (40M), Mini (80M)
- 8 voices: Rosie, Bella, Jasper, Luna, Bruno, Hugo, Kiki, Leo
- Adjustable speed (0.5x – 2.0x)
- Download generated audio as WAV to device
- Long text / large context support — paste entire articles, stories, or paragraphs. Text is automatically split into chunks at sentence boundaries, each chunk is synthesized independently, and the audio is seamlessly concatenated into a single output.
- 100% on-device inference via ONNX Runtime
- Dark theme UI matching the iOS version
Microphone (16 kHz mono PCM)
→ Log-mel spectrogram (80 bins, Whisper spec)
→ Whisper encoder ONNX → hidden states
→ Whisper decoder ONNX → greedy token decode (30 s windows)
→ Byte-level BPE detokenize → transcript
→ WER alignment vs. passage → accuracy + highlighting
→ IELTS four-criteria band estimate
The Whisper ONNX models (quantized whisper-tiny.en, ~41 MB) are bundled in the APK
(assets/asr/) and run on the same ONNX Runtime used for TTS. Fetch them with
tools/download_whisper_onnx.py (no PyTorch needed) —
see app/src/main/assets/asr/README.md. The pipeline
(mel spectrogram + greedy decode + byte-level tokenizer) is validated to match the
reference HuggingFace implementation exactly.
The IELTS band estimate is an honest approximation derived from reading accuracy, coverage, and pace. A read-aloud task cannot fully measure free-speech lexical/grammatical range; those proxies are documented in
IeltsScorer.kt.
Text Input (any length)
→ Auto-chunking (max 400 chars at sentence boundaries)
→ Per-chunk: Punctuation normalization
→ Per-chunk: espeak-ng phonemization (JNI/NDK)
→ Per-chunk: IPA tokenization (178-token vocabulary)
→ Per-chunk: ONNX Runtime inference (24kHz Float32 PCM)
→ Concatenate all chunk audio
→ AudioTrack playback / WAV download
| Component | Technology |
|---|---|
| UI | Kotlin + Jetpack Compose + Navigation Compose |
| ML Inference | ONNX Runtime Android (TTS + Whisper ASR) |
| Speech recognition | Whisper-tiny.en (ONNX, on-device) |
| Phonemization | espeak-ng (C via JNI/NDK) |
| Audio | AudioTrack playback (24kHz) · AudioRecord capture (16kHz) |
| Scoring | Word Error Rate (Levenshtein) + IELTS band heuristics |
| Build | Gradle KTS, Android NDK, CMake |
Get the latest APK from Releases.
- Android Studio (latest)
- Android SDK 34
- Android NDK 27+
- JDK 17
-
Clone the repo:
git clone https://github.com/rockerritesh/kitten-tts-android.git cd kitten-tts-android git lfs pull -
Build espeak-ng native library:
./build-espeak-ng.sh
-
Fetch the Whisper ASR model into
assets/asr/(needed for speaking/reading scoring):python3 tools/download_whisper_onnx.py
-
Open in Android Studio and build, or:
./gradlew assembleDebug
app/src/main/
├── java/com/kittenml/tts/
│ ├── MainActivity.kt # Entry point + bottom-nav (Practice / Voice Studio)
│ ├── engine/ # TTS
│ │ ├── KittenTTSEngine.kt # Core TTS pipeline
│ │ ├── EspeakBridge.kt # JNI wrapper
│ │ └── AudioPlayer.kt # AudioTrack playback
│ ├── asr/ # Speech recognition
│ │ ├── AudioRecorder.kt # 16 kHz mic capture
│ │ ├── MelSpectrogram.kt # 80-bin log-mel features
│ │ ├── WhisperTokenizer.kt # byte-level BPE decode
│ │ ├── WhisperAsrEngine.kt # encoder/decoder ONNX inference
│ │ └── AsrState.kt
│ ├── scoring/
│ │ ├── WerScorer.kt # Word Error Rate + alignment
│ │ ├── ScoreResult.kt
│ │ ├── IeltsScorer.kt # read-aloud four-criteria estimate
│ │ ├── IeltsAssessment.kt
│ │ ├── BandUtil.kt # quality → IELTS band helpers
│ │ ├── FluencyAnalyzer.kt # pace / pauses / fillers from audio
│ │ └── FreeSpeechScorer.kt # free-response four-criteria estimate
│ ├── data/
│ │ ├── Paragraph.kt / ParagraphRepository.kt
│ │ └── SpeakingPrompt.kt / SpeakingPromptRepository.kt
│ ├── ui/
│ │ ├── theme/ # Dark theme (Color, Theme, Type)
│ │ └── screen/
│ │ ├── TTSScreen.kt # Voice Studio UI
│ │ ├── TTSViewModel.kt
│ │ ├── practice/ # reading: list + record/score + ViewModel
│ │ └── speaking/ # free speaking: list + record/score + ViewModel
│ └── model/
│ ├── TTSModel.kt # Model enum
│ └── EngineState.kt # Engine state
├── cpp/
│ ├── espeak-bridge.c # C phonemization bridge
│ ├── espeak-jni.c # JNI glue layer
│ └── CMakeLists.txt # NDK build config
└── assets/
├── models/ # TTS ONNX model files (~168 MB)
├── voices/ # Voice embedding JSONs (~51 MB)
├── espeak-ng-data/ # Phoneme data files (~1 MB)
├── asr/ # Whisper ONNX + vocab (see asr/README.md)
└── ielts/
├── paragraphs.json # 200 read-aloud passages
└── speaking_prompts.json # free-speaking prompts (Parts 1–3)
The bundled passages live in app/src/main/assets/ielts/paragraphs.json,
each with id, title, topic, band, and text. The app ships with 200 graded
passages across 15 topics and IELTS bands 4.5–9.0. Add or edit passages by changing this
file — no code change is needed, as the list is read at runtime. Free-speaking prompts
live alongside it in speaking_prompts.json.
Apache 2.0

