gem-transcribe

Audio transcription CLI built on Vertex AI Gemini — speaker inference, multi-language output, and structured JSON.

gem-transcribe is a focused "transcription foundation": one audio file in, a structured transcript out. Minutes generation, summarization, and action-item extraction live downstream in tools like meeting-note.

Features

Speaker inference — by default, names are inferred from the audio itself (self-introductions, direct address, third-party mentions). Speakers whose names cannot be determined are labelled Speaker A, Speaker B, etc. Provide --speaker-hint="Yamada,Sato" to give Gemini a closed list of candidate names
Multi-language output — --lang=en,ja produces both the original and a translation in a single API call
Long audio support — local files are auto-uploaded to a GCS staging bucket and removed after processing; pre-uploaded gs:// URIs are also accepted directly
Multiple output formats — JSON to stdout by default, plus --format text|md|srt|vtt. With multi-language SRT/VTT and --output-file=meeting.srt --lang=en,ja, the tool writes meeting.en.srt and meeting.ja.srt automatically. --output-dir emits both .json and .txt for the same basename
Progress on stderr — the CLI prints one-line milestones (upload, transcribe start, elapsed time on completion) so long-running calls do not appear frozen. Pass --quiet to suppress, or --verbose for full INFO-level logs

Installation

uv tool install git+https://github.com/nlink-jp/gem-transcribe.git
# or, from a clone
uv sync --all-extras

Requires Python 3.11+.

Setup

Create a GCS bucket for staging uploads:

gsutil mb -l us-central1 gs://your-bucket

Configure ADC (Application Default Credentials):
```
gcloud auth application-default login
```
Create the config file at ~/.config/gem-transcribe/config.toml (see config.example.toml for the full template):
```
[gcp]
project = "your-gcp-project"
location = "us-central1"

[model]
name = "gemini-2.5-flash"

[storage]
staging_bucket = "gs://your-bucket/gem-transcribe/"
```
The IAM principal must hold roles/aiplatform.user and roles/storage.objectAdmin on the staging bucket.

Usage

# JSON to stdout (default)
gem-transcribe meeting.mp3

# Multi-language output
gem-transcribe interview.m4a --lang=en,ja

# Speaker name attribution
gem-transcribe meeting.mp3 --speaker-hint="Yamada,Sato,Tanaka"

# Both JSON and plain text into a directory
gem-transcribe meeting.mp3 --output-dir=./transcripts/

# Markdown timeline
gem-transcribe meeting.mp3 --format=md --output-file=meeting.md

# SRT subtitles
gem-transcribe meeting.mp3 --format=srt --output-file=meeting.srt

# Multi-language SRT — writes meeting.en.srt and meeting.ja.srt
gem-transcribe meeting.mp3 --lang=en,ja --format=srt --output-file=meeting.srt

# WebVTT subtitles (with <v Speaker> voice tag)
gem-transcribe meeting.mp3 --format=vtt --output-file=meeting.vtt

# Pre-uploaded audio in GCS
gem-transcribe gs://your-bucket/recordings/2026-05-15.mp3

Known limitations

Timestamp accuracy

Segment start / end values come from Gemini's audio-token estimate, not from sample-accurate decoding. Treat them as rough markers, not sync-grade references. Concretely:

Drift accumulates on long audio. Single-pass transcription of long recordings (roughly 20 minutes and up) drifts noticeably, with errors growing toward the end of the file. Short recordings (a few minutes) are usually within a second or two.
Timestamps can exceed the actual audio duration. In real-world tests with gemini-2.5-pro, a 25-minute recording produced segments tagged past the 30-minute mark. Both gemini-2.5-flash and gemini-2.5-pro show this behaviour; switching model does not reliably fix it.
end is occasionally emitted as a duration instead of an absolute offset (a known Gemini 2.5 quirk). The orchestrator detects and rewrites these cases (end = start + end) and logs a warning under --verbose.

If you need sample-accurate alignment (e.g. burning subtitles onto video), do not consume these timestamps directly. Established workarounds — out of scope for this tool and intended for downstream pipelines:

Measure the audio's true duration locally (ffprobe) and clip / rescale segment timestamps against it.
Split the audio into shorter chunks before transcription and re-offset each chunk's timestamps. Bounds drift to the chunk length.
Use a dedicated forced-alignment pass (e.g. WhisperX) on the produced text against the original audio.

Malformed segments (partial results)

The model occasionally emits a structurally broken segment, especially on long recordings. The request is constrained with a response schema to make this rare, but as a safety net each segment is validated independently:

Default (salvage): invalid segments are dropped, a warning reports how many (and roughly where), and a partial transcript is returned. metadata.dropped_segments in the JSON output records the count. Losing a few seconds beats discarding a multi-minute transcription.
--strict: fail the whole run if any segment is dropped (the old all-or-nothing behaviour), for pipelines that need a complete transcript.
An empty transcript (zero valid segments) is always an error.

Set the default in config.toml under [transcribe] strict = true|false; override per run with --strict / --no-strict (CLI wins over config).

Configuration

Priority (high → low):

CLI flags
Environment variables (GEM_TRANSCRIBE_*)
.env file
~/.config/gem-transcribe/config.toml
Built-in defaults

Build and test

make test     # uv run pytest tests/ -v
make lint     # ruff check + format check
make build    # uv build --out-dir dist/

Documentation

docs/en/gem-transcribe-rfp.md — design RFP
docs/ja/gem-transcribe-rfp.ja.md — 設計 RFP

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
docs		docs
src/gem_transcribe		src/gem_transcribe
tests		tests
.gitignore		.gitignore
.python-version		.python-version
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
Makefile		Makefile
README.ja.md		README.ja.md
README.md		README.md
config.example.toml		config.example.toml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

gem-transcribe

Features

Installation

Setup

Usage

Known limitations

Timestamp accuracy

Malformed segments (partial results)

Configuration

Build and test

Documentation

License

About

Uh oh!

Releases 3

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

gem-transcribe

Features

Installation

Setup

Usage

Known limitations

Timestamp accuracy

Malformed segments (partial results)

Configuration

Build and test

Documentation

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Contributors

Uh oh!

Languages