Diarization Simulation

Overview
Tutorial
Installation
- Requirements
Usage
Statistical Model
- Parameters
- Model Characteristics
Citation

A Python package for simulating speaker diarization with LENA and VTC from ground truth vocalization data.

Diarization algorithms segment and classify speech into predefined speaker categories (including child (CHI) other child (OCH), female adult (FEM), male adult (MAL)). In Child Development and Language Acquisition research, these segments are aggregated into vocalization counts (see below) measuring children's speech output and their speech input in naturalistic daylong recordings.

However, algorithms make errors (e.g. by mistaking speakers for one another) which propagate into the measured vocalization counts, introducing biases in downstream analyses. Simulating diarization algorithms can help assess the sensitivity of a statistical analysis to classification errors. For instance, simulations can help determine whether correlations between speakers' speech quantities are entirely consistent with spurious correlations due to classification errors.

Overview

Diarization Simulation helps you assess how classification errors in automated diarization algorithms — LENA and VTC — affect downstream analyses of vocalization counts. You provide a ground truth dataset (true vocalization counts per speaker and per recording), and the tool simulates what those algorithms would measure, drawing on pre-computed detection and confusion rates calibrated against ~30 hours of manual annotations. Across many simulated samples, you can then ask: how sensitive is my finding to classification errors? Or: is an observed result consistent with errors alone?

Ground truth data can come from your own sources, or be generated synthetically. For users working with a ChildProject corpus, the package also includes an optional truth-simulate tool that infers realistic vocalization distributions from manual annotations and generates synthetic ground truth accordingly — see Generating Ground Truth from a ChildProject Corpus.

Internally, the simulation loads your ground truth, applies algorithm-specific hyperparameters, and generates measured vocalization counts using a Poisson or Gamma approximation to account for underdispersion. See Statistical Model for details.

For a concrete end-to-end example, see the Tutorial.

Tutorial

A step-by-step tutorial walks through a complete worked example using this package. It is available in two formats:

tutorial.ipynb — Jupyter notebook using the Python API directly in an interactive mode
tutorial.Rmd — R Markdown calling diarization-simulate via the command-line interface (through system2())

Both follow the same structure and are aligned with Gautheron et al. (2025).

To help readers grasp the idea behind our proposal, we pick as our specific example whether an observed correlation between child vocalizations (CHI) and female adult vocalizations (FEM) — written R(CHI, FEM) — could be an artifact of diarization errors rather than a genuine relationship. This is a reasonable concern: if an algorithm systematically confuses CHI and FEM speech, recordings with many child vocalizations may also appear to have many female adult vocalizations, even if no true relationship exists.

The tutorial therefore poses the question: is the observed correlation consistent with what we would expect under the null hypothesis that the true correlation is zero? It does this by simulating what a diarization algorithm would measure if the ground truth had R(CHI, FEM) = 0, and checking whether the observed correlation falls within the range of simulated values.

This kind of sensitivity analysis is the primary use case for the package, and the tutorial is the fastest way to see the full workflow — from synthetic ground truth generation through to simulation and interpretation — end to end. For a 5-minute version, see the "Quick start" section at the top of the tutorial.

Installation

# Clone the repository
git clone https://github.com/LAAC-LSCP/diarization-simulation.git
cd diarization-simulation

# Install the package
pip install -e .

Requirements

You will need Python 3.8+ to run this package. Key dependencies include:

pandas
numpy
scipy
numba
tqdm

If you followed the Installation instructions, you should have those packages already.

For users working with a ChildProject corpus, the package also includes an optional truth-simulate tool which requires installation of the following packages, which are not installed by default:

cmdstanpy (see installation instructions here)
ChildProject (see installation instructions here

Usage

The package can be used via the command-line interface or the Python API.

Input Format

Both the CLI and Python API require the same columns in the input CSV:

observation: Unique identifier for each recording/observation
CHI: Child vocalization count
OCH: Other child vocalization count
FEM: Female adult vocalization count
MAL: Male adult vocalization count

Example:

observation,CHI,OCH,FEM,MAL
1,120,30,200,50
2,90,15,180,70
3,150,25,220,45

Command-line Interface

The main command line interface can be accessed through diarization-simulate:

diarization-simulate --truth path/to/truth.csv \
                    --output path/to/output.csv \
                    --algo vtc \
                    --samples 1000 \
                    --distribution poisson

Command-line Arguments

Argument	Description	Default
`--truth`	Path to the synthetic truth dataset (in csv format)	Required
`--output`	Location of the output file	Required
`--output-format`	Output file format (`csv`, `parquet`, or `npz`)	`csv`
`--algo`	Algorithm to simulate (`vtc` or `lena`)	Required
`--samples`	Number of samples per observation	1000
`--average-hyperpriors`	Use the mean value of the hyperpriors (mu and alpha)	False
`--unique-hyperpriors`	Use fixed hyperpriors (mu and alpha) throughout all samples	False
`--distribution`	Distribution for vocalization counts (`poisson` or `gamma`)	`poisson`
`--seed`	Random seed for reproducibility	None

Python API

import pandas as pd
from diarization_simulation import simulate_diarization

# Load your ground truth data
truth_data = pd.read_csv("truth.csv")

# Simulate detections
results = simulate_diarization(
    truth_data=truth_data,
    algorithm="vtc",
    distribution="poisson",
    n_samples=1000,
    random_seed=42
)

# Analyze results
mean_detections = results.groupby('observation')[['CHI', 'OCH', 'FEM', 'MAL']].mean()
print("Mean detections per observation:")
print(mean_detections)

To compare results across algorithms and distribution types:

# Run simulations with different parameters
algorithms = ["vtc", "lena"]
distributions = ["poisson", "gamma"]
results = {}

for algo in algorithms:
    for dist in distributions:
        key = f"{algo}_{dist}"
        results[key] = simulate_diarization(
            truth_data=truth_data,
            algorithm=algo,
            distribution=dist,
            n_samples=1000,
            random_seed=42
        )

# Compare R(CHI, FEM) across configurations
for key, result in results.items():
    correlation = result[['CHI', 'FEM']].corr().iloc[0, 1]
    print(f"{key}: CHI-FEM correlation = {correlation:.3f}")

Python API Arguments

Parameter	Type	Description	Default
`truth_data`	str or DataFrame	Path to CSV file or pandas DataFrame with truth data	Required
`algorithm`	str	Algorithm to simulate (`"vtc"` or `"lena"`)	`"vtc"`
`distribution`	str	Distribution type (`"poisson"` or `"gamma"`)	`"poisson"`
`n_samples`	int	Number of samples to generate per observation	1000
`hyperprior_mode`	str	Hyperprior handling (`"sample"`, `"average"`, `"unique"`)	`"sample"`
`random_seed`	int or None	Random seed for reproducibility	None
`verbose`	bool	Show progress bar	True

Hyperprior modes:

"sample": Each sample gets its own hyperpriors (captures algorithm uncertainty)
"average": Use mean hyperprior values (reduced variance)
"unique": Same hyperpriors for all samples (minimal variance)

Output Format

Both the CLI and Python API return the same structure:

sample: Sample number (0 to n_samples-1)
observation: Original observation identifier
CHI, OCH, FEM, MAL: Simulated vocalization counts per speaker

Example:

sample,observation,CHI,OCH,FEM,MAL
0,1,118,28,195,52
0,2,87,16,175,73
1,1,122,31,198,49
1,2,92,14,182,68

Generating Ground Truth from a ChildProject Corpus

If you have a ChildProject corpus with manual annotations, the package includes an optional truth-simulate tool that can generate synthetic ground truth by fitting a Bayesian hierarchical model to your annotation data, to infer a realistic speech distribution. This requires installing the optional dependencies listed in Requirements.

Command-line Interface for Ground Truth Generation

truth-simulate --corpus path/to/corpus \
               --annotator annotation_set_name \
               --output ground_truth.csv \
               --samples 1000

Command-line Arguments for `truth-simulate`

Argument	Description	Default
`--corpus`	Path to the input ChildProject corpus	Required
`--annotator`	Annotation set containing the manual annotations	Required
`--output`	Location of the output file	Required
`--recordings`	Path to a CSV file containing the list of recordings to include	None
`--samples`	Number of samples to generate	1000
`--mode`	Sample from the mode of the posterior distribution of hyperparameters	False
`--show-distribution`	Show the marginal distribution of speech for each speaker according to the manual annotations	False

The output CSV contains synthetic ground truth with columns recording_filename, observation, CHI, OCH, FEM, and MAL, with K×N rows where K is the number of recordings and N the number of samples requested.

How Ground Truth Generation Works

The truth-simulate tool uses a Bayesian hierarchical model to infer vocalization rate distributions from sparse manual annotations and then generates complete ground truth datasets. The process works as follows:

Load corpus data: Reads a ChildProject corpus containing recordings and manual annotations
Extract annotation statistics: Counts vocalizations per speaker type (CHI, OCH, FEM, MAL) in manually annotated segments
Fit hierarchical model: Uses Stan to fit a Gamma-Poisson model that estimates vocalization rates per speaker across the corpus
Generate samples: Produces synthetic ground truth vocalization counts for all recordings in the corpus

When using truth-simulate, the output CSV contains synthetic ground truth data with the same CHI, OCH, FEM, MAL columns as in Output Format, an additional recording_filename which contains the original recording filename, and observation being a unique identifier combining recording filename and sample number (e.g., "recording_001.wav,0").

Example output:

recording_filename,observation,CHI,OCH,FEM,MAL
recording_001.wav,"recording_001.wav,0",145,23,198,67
recording_002.wav,"recording_002.wav,0",112,18,176,45
recording_001.wav,"recording_001.wav,1",138,25,203,72
recording_002.wav,"recording_002.wav,1",119,16,181,49
...

As above, the output contains KxN rows where K is the number of recordings and N the number of samples requested.

Addapted Workflow

When generating ground truth data using truth-simulate, a complete simulation workflow will:

Generate ground truth from your corpus annotations:

truth-simulate --corpus /path/to/corpus \
               --annotator human_annotations \
               --output ground_truth.csv \
               --samples 100

Simulate diarization on the generated ground truth:

diarization-simulate --truth ground_truth.csv \
                    --output simulated_detections.csv \
                    --algo vtc \
                    --samples 100

In this example, the output simulated_detections.csv will contain 100x100xK rows, where K is the number of recordings in the dataset.

Statistical Model

The simulation uses a hierarchical model where:

Detection/confusion rates $\lambda_{ij}$ follow: $\lambda_{ij} \sim \mathrm{Gamma}(\alpha_{ij}, \mu_{ij}/\alpha_{ij})$
Detected vocalizations are generated using one of two distribution options:

The Poisson distribution:

$$\mathrm{Detected}_{ij} \sim \mathrm{Poisson}(\lambda_{ij} \cdot \mathrm{true}_{i})$$

$$\mathrm{Detected}_{j} = \sum_i \mathrm{Detected}_{ij}$$

The Gamma distribution:

$$\mathrm{Detected}_{ij} \sim \lfloor\mathrm{Gamma}(\alpha, \beta)+0.5\rfloor$$

$$\mathrm{Detected}_{j} = \sum_i \mathrm{Detected}_{ij}$$

With $\alpha$ and $\beta$ being fixed such that:

$$\mathbb{E}[\mathrm{Detected}_{ij}] = \lambda_{ij} \cdot \mathrm{true}_{i} \text{ and } \sigma[\mathrm{Detected}_{ij}] = \sqrt{\frac{\lambda_{ij} \cdot \mathrm{true}_{i}}{\tau}}$$

Parameters

Parameter	Description
$\lambda_{ij}$	Detection rate from speaker $i$ to detected speaker $j$
$\mathrm{true}_{i}$	True vocalization count for speaker $i$
$\tau$	Underdispersion parameter
$\alpha_{ij}, \mu_{ij}$	Shape and scale parameters for the detection rate prior
$\alpha, \beta$	Shape and rate parameters for the gamma detection model

Model Characteristics

The original model assumed a Generalized Poisson Distribution, given that the vocalization counts are underdispersed wrt the Poisson distribution. However, sampling from this distribution is a bit harder, and the simulation proposes two approximation schemes instead:

Poisson scheme: neglects the underdispersion of the count data
Gamma scheme: better captures the true variance but only approximate for small count data

Citation

If you use this package, please mention both of the following references:

@online{diarization-simulation,
author={Lucas Gautheron},
year=2025,
title={Diarization Simulation: A Python package for simulating speaker diarization with {LENA and VTC} from ground truth vocalization data},
url={https://github.com/LAAC-LSCP/diarization-simulation}
}

@misc{Gautheron2025,
  title = {Classification errors distort findings in automated speech processing: examples and solutions from child-development research},
  url = {http://dx.doi.org/10.31234/osf.io/u925y_v1},
  author = {Gautheron,  Lucas and Kidd,  Evan and Malko,  Anton and Lavechin,  Marvin and Cristia,  Alejandrina},
  year = {2025},
}

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
.github/workflows		.github/workflows
diarization_simulation		diarization_simulation
docs		docs
.gitignore		.gitignore
README.md		README.md
example_null_R_chi_fem.png		example_null_R_chi_fem.png
pyproject.toml		pyproject.toml
tutorial.Rmd		tutorial.Rmd
tutorial.ipynb		tutorial.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Diarization Simulation

Overview

Tutorial

Installation

Requirements

Usage

Input Format

Command-line Interface

Command-line Arguments

Python API

Python API Arguments

Output Format

Generating Ground Truth from a ChildProject Corpus

Command-line Interface for Ground Truth Generation

Command-line Arguments for `truth-simulate`

How Ground Truth Generation Works

Addapted Workflow

Statistical Model

Parameters

Model Characteristics

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Diarization Simulation

Overview

Tutorial

Installation

Requirements

Usage

Input Format

Command-line Interface

Command-line Arguments

Python API

Python API Arguments

Output Format

Generating Ground Truth from a ChildProject Corpus

Command-line Interface for Ground Truth Generation

Command-line Arguments for truth-simulate

How Ground Truth Generation Works

Addapted Workflow

Statistical Model

Parameters

Model Characteristics

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Command-line Arguments for `truth-simulate`

Packages