Scribe

Transcribe, translate, and burn subtitles into videos — 100% on your Mac.

Plus a real-time Live Captions window that captions whatever you hear from your speakers.

What it does

Drop a video in, pick an ASR engine, hit Start. The whole pipeline is local except for the optional translation API call.

video.mp4  →  audio.wav  →  transcript     →  translated SRT  →  subtitled video
              (ffmpeg)      (Parakeet on ANE,  (OpenAI / Google   (ffmpeg + libass)
                            Qwen3 0.6B/1.7B    / off)
                            on MLX with
                            forced aligner)

Or open Live Captions (⌘⇧L) and talk: it captures system audio with ScreenCaptureKit and streams a transcript into a notepad-shaped window in real time.

_{Drop videos, pick an engine, hit Start. Settings live in the native Inspector.}

Why I built it

I had a Python prototype using PyQt6 + MLX that shipped as a ~200 MB PyInstaller bundle with a 3-second cold start and a UI that felt Linux-y on macOS. I wanted to see how far a modern Swift rewrite could go if it followed SOLID strictly and shipped a native UI.

Result: native UI, sub-second launch, then about a year of feature growth (Qwen3 ASR, Live Captions, customisable subtitle styling) on top of the same SOLID core.

The numbers

	Python (original)	Swift (Scribe)
App binary (Mach-O)	~200 MB	69 MB
Cold launch	2–5 s	< 0.5 s
ASR runtime	MLX (GPU)	CoreML (ANE) + MLX
Tests	0	168 across 26 suites
Transcript languages	English	English (Parakeet) + 30+ via Qwen3 ASR
External dependencies	6	2 SPM (FluidAudio + speech-swift) + 2 binary (sherpa-onnx + ONNX Runtime, fetched on demand)

The full .app bundle is ~160 MB because it ships ffmpeg-full, dylibbundler-rewired Homebrew dylibs, the MLX Metal shader library, and an ad-hoc code signature so end users never have to install anything.

Pair-programmed with Claude Code

Scribe was built end-to-end with Claude Code (Opus 4.7 · 1M context) as an experiment in AI pair programming. I drove the product decisions and reviewed every change; Claude wrote most of the code, suggested refactors, and caught its own SOLID violations when I asked for an audit.

The build followed strict Red → Green → Refactor TDD: 168 tests across 26 suites were written before the implementation they test. When the audit flagged ProcessingViewModel for juggling five responsibilities and a Dependency Inversion violation, the refactor split it into four focused services without breaking a single test.

Architecture

UI (SwiftUI)
   │
   ▼
Core (pipeline, parsers, orchestration)
   │
   ▼
Protocols ◄── Infrastructure (FFmpeg, FluidAudio, MLX, sherpa-onnx, OpenAI, Google)
   │
   ▼
Domain (pure value types · zero imports)

Five SPM targets with strictly unidirectional dependencies. Zero singletons. The composition root assembles concrete types exactly once per pipeline run, so every component is replaceable and every layer is independently testable.

Features

Burn pipeline — speech recognition

User picks the engine in Settings; the pipeline injects the matching SpeechRecognizing conformer at runtime.

Parakeet TDT 0.6B v2 — English-only, FluidAudio on the Apple Neural Engine. ~120× realtime on M4 Pro. Default.
Qwen3-ASR 0.6B (4-bit MLX) — multilingual incl. zh/en code-switching. ~342 MB first-run download.
Qwen3-ASR 1.7B (4-bit MLX) — same coverage, higher accuracy, ~3× the runtime. ~700 MB first-run download.

Both Qwen3 paths run on soniqo/speech-swift (MLX-Swift). The companion Qwen3-ForcedAligner-0.6B model produces acoustic word-level timestamps so SRT cue boundaries land on real speech pauses; if alignment fails, the pipeline falls back to a char-weighted SentenceChunker.

Live Captions (⌘⇧L)

A separate window that captures system audio via ScreenCaptureKit and streams a transcript into a notepad-shaped scroll view in real time.

Nemotron 0.6B — true streaming (~1.1 s latency), English-only
Zipformer zh-XLarge — sherpa-onnx Mandarin-focused transducer (int8, 2025-06-30 release), larger than the original bilingual Zipformer for noticeably better Chinese WER. ~570 MB.
Paraformer zh-yue-en — sherpa-onnx streaming non-autoregressive Paraformer covering Mandarin + Cantonese + English in a shared decoder. ~999 MB.

Selection → instant translate popup. Highlight any line of the live transcript and a popover slides in with the translation: Apple's on-device Translation framework first, falling back to Google Translate when the language pack isn't installed (or after a 4-second watchdog if the framework silently hangs). Direction is auto-detected from CJK character ratio — Chinese selection → English popup, English selection → Chinese popup.

Toolbar exports the rolling transcript as SRT, plain text, or copies it to the clipboard. macOS Screen-Recording permission is requested on first start.

Translation

Off / OpenAI / Google — single 3-way picker; OpenAI mode reveals base URL / API key / model / system prompt / batching steppers
OpenAI-compatible APIs (OpenAI, OpenRouter, Azure, local proxies — anything that speaks /v1/chat/completions)
Google Translate free endpoint (no API key)
Decorator-pattern fallback: if OpenAI fails, try Google; if both fail, keep originals
Smart batch splitting with a multi-strategy separator-recovery heuristic for malformed responses

Subtitle styling

Concrete SubtitleStyle struct: font / size / colours / box style / vertical + horizontal margins. No preset abstraction layer — every field is directly editable.
macOS-bundled fonts only: New York (default serif), Helvetica Neue, Times, Hiragino Sans GB. Resolved through libass via fontsdir=/System/Library/Fonts.
ffmpeg's subtitles filter receives original_size=WxH, so FontSize is interpreted in real pixels rather than scaled against libass's implicit PlayResY=288 (the wall-of-text bug on portrait clips).
Margins keep cues inside the canvas; libass auto-wraps before the edge.

Video

Audio extraction with optional VideoToolbox hardware acceleration
Subtitle burn-in via FFmpeg's subtitles filter + libass
Bilingual SRT side-car (_en.srt, _bi.srt)
ffmpeg-full bundled in the release DMG so users never have to install it

Native macOS UI

SwiftUI .toolbar, .inspector, .dropDestination, .regularMaterial
Whole-window drop target — drag more videos in over the queue, no dedicated drop bar
List(selection:) + ⌫ to remove the selected queued item
Drag the finished .mp4 back out to Finder, or the SRT side-car
Video thumbnails, duration, resolution, file size (via AVAssetImageGenerator)
Settings auto-apply — every control writes through on .onChange, no Apply button
ProcessingViewModel.addVideos dedups against the existing queue
System notifications on completion, transient toasts for in-session feedback

Keyboard shortcuts


`⌘O`	Open videos
`⌘R`	Start processing
`⌘⇧L`	Open Live Captions
`⌘,`	Toggle settings
`⌘⇧C`	Copy Live Captions transcript
`⌘⇧⌫`	Clear Live Captions
`⌫`	Remove selected queue item
`⌘Q`	Quit

Install

From release

Download Scribe-<version>.dmg from the Releases page
Open the DMG, drag Scribe to Applications
First launch: right-click → Open (the build is ad-hoc signed, so Gatekeeper needs explicit permission once)

From source

git clone https://github.com/itchat/Scribe.git
cd Scribe
brew install ffmpeg-full dylibbundler   # ffmpeg ships libass; dylibbundler rewires its dylibs
./scripts/fetch-sherpa-onnx.sh          # one-time: vendor sherpa-onnx + ONNX Runtime xcframeworks
swift run Scribe                        # dev run
./scripts/build-app.sh --dmg            # package .app + .dmg into dist/

build-app.sh also runs scripts/build-mlx-metallib.sh to compile MLX's Metal shaders into Contents/MacOS/mlx.metallib — without this, the Qwen3 engines crash at runtime with "Failed to load the default metallib".

Requirements: macOS 15 (Sequoia) or later, Apple Silicon (M1+). Full Xcode is needed at build time (xcrun metal is not in the Command Line Tools).

Design highlights

A few pieces I'm happy with:

Pipeline is pure orchestration. VideoPipeline has six injected protocol dependencies and knows nothing about FFmpeg, OpenAI, or CoreML. The translationMode and skipSubtitleBurning flags are independent — skipping translation alone still burns the original-language SRT into the video.

Translator is a decorator. FallbackTranslator(primary: openAI, fallback: google) conforms to the same protocol as its children. You can nest it arbitrarily without touching the pipeline.

FFmpeg is three small types, not one big one. A Locator that finds the binary, a CommandBuilder (pure functions, 100% unit-testable), and a ProcessRunner with timeout + stderr handling. Extractor and Probe each pick what they need — ISP in practice.

Auto-save settings, all the way down. The Inspector has no Apply button. Every control wires .onChange (not .onSubmit) so changes persist when focus moves away — not only on Enter. ConfigService loads from disk on init so SwiftUI views see persisted values on first body render.

Subtitle timing is acoustic when possible. Qwen3 path runs the audio + transcript through Qwen3ForcedAligner to get word-level timestamps, then WordGroupingChunker groups words into reading-shaped cues at sentence terminators. When alignment is unavailable, SentenceChunker distributes time proportionally to chunk character count — the standard heuristic from aeneas / WhisperX.

Error handling is a single enum. ScribeError has ~15 cases, each carrying just the data that case needs. One switch in the pipeline decides whether to retry, surface the error, or fail softly — no exception hierarchy, no Any casts.

Tech stack

SwiftUI (macOS 15+) with Swift 6 concurrency (async/await, actor)
FluidAudio — CoreML Parakeet + Nemotron streaming
soniqo/speech-swift — MLX-Swift port of Qwen3-ASR + Qwen3-ForcedAligner
sherpa-onnx — ONNX Runtime streaming Zipformer for zh-en
FFmpeg (subprocess, bundled) — audio + video
swift-testing — all 168 tests
ScreenCaptureKit for live system-audio capture
URLSession for HTTP; Codable + JSON for config persistence (~/Library/Application Support/Scribe/)

Project layout

Scribe/
├── Package.swift                  # 5 targets, strict dependency graph
├── Sources/
│   ├── Domain/                    # value types, zero imports
│   ├── Protocols/                 # ~10 small interfaces
│   ├── Core/                      # pipeline, parsers, retry, batch
│   ├── Infrastructure/            # FFmpeg, ASR (~10 engines + helpers), translation, config
│   └── App/
│       ├── LiveCaptions/          # standalone ⌘⇧L window
│       ├── Services/              # ConfigService, ASRModelService, Toast, Notifier
│       ├── ViewModels/            # ProcessingViewModel, SettingsViewModel
│       └── Views/                 # ContentView, SettingsInspector, etc.
├── Tests/
│   ├── UnitTests/                 # Domain / Core / Infrastructure suites
│   └── IntegrationTests/          # VideoPipeline + Qwen3 E2E
├── scripts/
│   ├── build-app.sh               # release → .app → DMG
│   ├── build-mlx-metallib.sh      # xcrun metal → mlx.metallib next to binary
│   ├── fetch-sherpa-onnx.sh       # vendor sherpa-onnx + ONNX Runtime xcframeworks
│   ├── setup-signing-cert.sh      # generate stable self-signed cert (TCC-friendly)
│   └── make-icon.sh               # regenerate the app icon from SF Symbols
├── .github/workflows/             # CI on push; Release on tag
└── Resources/                     # Info.plist, AppIcon.icns

Releasing

git tag v1.1.0
git push origin v1.1.0

GitHub Actions tests → builds → attaches Scribe-v1.1.0.app.zip + Scribe-v1.1.0.dmg to the release.

License

MIT. See LICENSE.

_{Built by @itchat with Claude Code (Opus 4.7 · 1M context).}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scribe

What it does

Why I built it

The numbers

Pair-programmed with Claude Code

Architecture

Features

Burn pipeline — speech recognition

Live Captions (⌘⇧L)

Translation

Subtitle styling

Video

Native macOS UI

Keyboard shortcuts

Install

From release

From source

Design highlights

Tech stack

Project layout

Releasing

License

About

Uh oh!

Releases 4

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github		.github
Resources		Resources
Sources		Sources
Tests		Tests
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
Package.resolved		Package.resolved
Package.swift		Package.swift
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Scribe

What it does

Why I built it

The numbers

Pair-programmed with Claude Code

Architecture

Features

Burn pipeline — speech recognition

Live Captions (⌘⇧L)

Translation

Subtitle styling

Video

Native macOS UI

Keyboard shortcuts

Install

From release

From source

Design highlights

Tech stack

Project layout

Releasing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages