Audio transcription CLI built on Vertex AI Gemini — speaker inference, multi-language output, and structured JSON.
gem-transcribe is a focused "transcription foundation": one audio file in,
a structured transcript out. Minutes generation, summarization, and
action-item extraction live downstream in tools like
meeting-note.
- Speaker inference — by default, names are inferred from the audio
itself (self-introductions, direct address, third-party mentions). Speakers
whose names cannot be determined are labelled
Speaker A,Speaker B, etc. Provide--speaker-hint="Yamada,Sato"to give Gemini a closed list of candidate names - Multi-language output —
--lang=en,japroduces both the original and a translation in a single API call - Long audio support — local files are auto-uploaded to a GCS staging
bucket and removed after processing; pre-uploaded
gs://URIs are also accepted directly - Multiple output formats — JSON to stdout by default, plus
--format text|md|srt|vtt. With multi-language SRT/VTT and--output-file=meeting.srt --lang=en,ja, the tool writesmeeting.en.srtandmeeting.ja.srtautomatically.--output-diremits both.jsonand.txtfor the same basename - Progress on stderr — the CLI prints one-line milestones (upload,
transcribe start, elapsed time on completion) so long-running calls do
not appear frozen. Pass
--quietto suppress, or--verbosefor full INFO-level logs
uv tool install git+https://github.com/nlink-jp/gem-transcribe.git
# or, from a clone
uv sync --all-extrasRequires Python 3.11+.
-
Create a GCS bucket for staging uploads:
gsutil mb -l us-central1 gs://your-bucket
-
Configure ADC (Application Default Credentials):
gcloud auth application-default login
-
Create the config file at
~/.config/gem-transcribe/config.toml(seeconfig.example.tomlfor the full template):[gcp] project = "your-gcp-project" location = "us-central1" [model] name = "gemini-2.5-flash" [storage] staging_bucket = "gs://your-bucket/gem-transcribe/"
The IAM principal must hold
roles/aiplatform.userandroles/storage.objectAdminon the staging bucket.
# JSON to stdout (default)
gem-transcribe meeting.mp3
# Multi-language output
gem-transcribe interview.m4a --lang=en,ja
# Speaker name attribution
gem-transcribe meeting.mp3 --speaker-hint="Yamada,Sato,Tanaka"
# Both JSON and plain text into a directory
gem-transcribe meeting.mp3 --output-dir=./transcripts/
# Markdown timeline
gem-transcribe meeting.mp3 --format=md --output-file=meeting.md
# SRT subtitles
gem-transcribe meeting.mp3 --format=srt --output-file=meeting.srt
# Multi-language SRT — writes meeting.en.srt and meeting.ja.srt
gem-transcribe meeting.mp3 --lang=en,ja --format=srt --output-file=meeting.srt
# WebVTT subtitles (with <v Speaker> voice tag)
gem-transcribe meeting.mp3 --format=vtt --output-file=meeting.vtt
# Pre-uploaded audio in GCS
gem-transcribe gs://your-bucket/recordings/2026-05-15.mp3Segment start / end values come from Gemini's audio-token estimate, not
from sample-accurate decoding. Treat them as rough markers, not sync-grade
references. Concretely:
- Drift accumulates on long audio. Single-pass transcription of long recordings (roughly 20 minutes and up) drifts noticeably, with errors growing toward the end of the file. Short recordings (a few minutes) are usually within a second or two.
- Timestamps can exceed the actual audio duration. In real-world tests
with
gemini-2.5-pro, a 25-minute recording produced segments tagged past the 30-minute mark. Bothgemini-2.5-flashandgemini-2.5-proshow this behaviour; switching model does not reliably fix it. endis occasionally emitted as a duration instead of an absolute offset (a known Gemini 2.5 quirk). The orchestrator detects and rewrites these cases (end = start + end) and logs a warning under--verbose.
If you need sample-accurate alignment (e.g. burning subtitles onto video), do not consume these timestamps directly. Established workarounds — out of scope for this tool and intended for downstream pipelines:
- Measure the audio's true duration locally (
ffprobe) and clip / rescale segment timestamps against it. - Split the audio into shorter chunks before transcription and re-offset each chunk's timestamps. Bounds drift to the chunk length.
- Use a dedicated forced-alignment pass (e.g. WhisperX) on the produced text against the original audio.
The model occasionally emits a structurally broken segment, especially on long recordings. The request is constrained with a response schema to make this rare, but as a safety net each segment is validated independently:
- Default (salvage): invalid segments are dropped, a warning reports how
many (and roughly where), and a partial transcript is returned.
metadata.dropped_segmentsin the JSON output records the count. Losing a few seconds beats discarding a multi-minute transcription. --strict: fail the whole run if any segment is dropped (the old all-or-nothing behaviour), for pipelines that need a complete transcript.- An empty transcript (zero valid segments) is always an error.
Set the default in config.toml under [transcribe] strict = true|false;
override per run with --strict / --no-strict (CLI wins over config).
Priority (high → low):
- CLI flags
- Environment variables (
GEM_TRANSCRIBE_*) .envfile~/.config/gem-transcribe/config.toml- Built-in defaults
make test # uv run pytest tests/ -v
make lint # ruff check + format check
make build # uv build --out-dir dist/- docs/en/gem-transcribe-rfp.md — design RFP
- docs/ja/gem-transcribe-rfp.ja.md — 設計 RFP
MIT