Forced Alignment Evaluation Toolkit

Computes standard phonetic alignment quality metrics by comparing an automatically aligned Praat TextGrid against a human-annotated reference.

Metrics

Metric	Description	Key Reference
Boundary Displacement	Absolute time difference (ms) between each reference boundary and the nearest hypothesis boundary. Reports mean, median, std, and percentage within 10/20/25/50/100 ms thresholds.	McAuliffe et al. (2017)
Intersection over Union (IoU)	Temporal overlap between matched segments, computed per phone/word.	Gonzalez et al. (2020)
Phone Error Rate (PER)	Levenshtein edit distance between phone label sequences, normalised by reference length.	Standard ASR evaluation

Project Structure

alignment-eval-project/
├── .vscode/
│   └── launch.json          # VS Code debug configurations
├── data/
│   └── examples/            # Demo TextGrid files (auto-generated)
├── outputs/
│   ├── logs/                # Timestamped log files
│   └── reports/             # Evaluation reports (txt + csv)
├── src/
│   ├── __init__.py          # Package metadata
│   ├── __main__.py          # python -m src entry point
│   ├── main.py              # CLI and evaluation pipeline
│   ├── loader.py            # TextGrid loading utilities
│   ├── metrics.py           # Metric computations
│   ├── reporting.py         # Report formatting and file output
│   ├── log_config.py        # Logging setup
│   └── demo.py              # Demo TextGrid generator
├── tests/                   # (placeholder for unit tests)
├── .gitignore
├── requirements.txt
└── README.md

Installation

# Clone the repository
git clone <repo-url>
cd alignment-eval-project

# Create a virtual environment (recommended)
python -m venv venv
source venv/bin/activate  # Linux/macOS
# or: venv\Scripts\activate  # Windows

# Install dependencies
pip install -r requirements.txt

Usage

Run the demo

python -m src --demo

This generates synthetic reference and hypothesis TextGrid files in data/examples/, runs the evaluation, prints results to the console, and saves reports to outputs/reports/ and logs to outputs/logs/.

Evaluate your own TextGrids

# Basic evaluation (phone tier)
python -m src --ref data/my_reference.TextGrid --hyp data/my_hypothesis.TextGrid --tier phones

# Word-level evaluation
python -m src --ref data/my_reference.TextGrid --hyp data/my_hypothesis.TextGrid --tier words

# Save report and CSV to outputs/reports/
python -m src --ref data/my_reference.TextGrid --hyp data/my_hypothesis.TextGrid --tier phones --save

# Exclude silence-adjacent boundaries
python -m src --ref data/my_reference.TextGrid --hyp data/my_hypothesis.TextGrid --tier phones --exclude-silence --save

VS Code

Open the project in VS Code and use the debug configurations in .vscode/launch.json:

Run Demo — runs with built-in example data
Evaluate Phones — evaluates the example phone tier
Evaluate Words — evaluates the example word tier
Evaluate (Pick Files) — prompts you for file paths and tier name

Programmatic usage

from src.main import evaluate_alignment

results = evaluate_alignment(
    ref_path="data/reference.TextGrid",
    hyp_path="data/hypothesis.TextGrid",
    tier_name="phones",
)

# Access individual metrics
bd = results["boundary_displacement"]
print(f"Median displacement: {bd['median_ms']:.1f} ms")
print(f"Within 25 ms: {bd['pct_within_25ms']:.0f}%")

iou = results["iou"]
print(f"Mean IoU: {iou['mean_iou']:.3f}")

per = results["phone_error_rate"]
print(f"PER: {per['per']:.1%}")

Interpretation Guide

Boundary displacement: MFA typically achieves ~12–17 ms median on English benchmarks (Buckeye, TIMIT). Human inter-annotator agreement is ~10–13 ms. If your aligner is within this range, it is performing at near-human level.
IoU: 1.0 = perfect overlap. Values above 0.8 are generally good for phone-level alignment.
PER: 0.0 = perfect label match. Analogous to Word Error Rate (WER) in ASR evaluation.

References

McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., & Sonderegger, M. (2017). Montreal Forced Aligner: Trainable text-speech alignment using Kaldi. Interspeech 2017, 498–502.
Gonzalez, S., Grama, J., & Travis, C. E. (2020). Comparing the performance of forced aligners used in sociophonetic research. Linguistics Vanguard, 6(1).
Kelley, M. C., et al. (2023). MAPS: A Mason-Alberta Phonetic Segmenter. Interspeech 2023.
Rousso, T., et al. (2024). Evaluating forced alignment tools. Speech Communication.

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Forced Alignment Evaluation Toolkit

Metrics

Project Structure

Installation

Usage

Run the demo

Evaluate your own TextGrids

VS Code

Programmatic usage

Interpretation Guide

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.vscode		.vscode
data/examples		data/examples
outputs		outputs
src		src
tests		tests
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Forced Alignment Evaluation Toolkit

Metrics

Project Structure

Installation

Usage

Run the demo

Evaluate your own TextGrids

VS Code

Programmatic usage

Interpretation Guide

References

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages