RelayS2S — Speech-to-Speech Model

This repository contains the data and training code for the speech-to-speech model described in RelayS2S: A Dual-Path Speculative Generation for Real-Time Dialogue.

Architecture Overview

Installation

git clone https://github.com/maiong25/relays2s.git
pip install -r requirements.txt

Requirements: Python ≥ 3.10, PyTorch ≥ 2.1, CUDA-capable GPU (tested on NVIDIA L40S 48 GB).

Dataset & Pretrained Checkpoints

Download the speech-to-speech dataset and our model checkpoints at:

https://huggingface.co/datasets/mailong225/speech_to_speech

Place the downloaded files so that your directory looks like:

├── data/
│   ├── train.jsonl
│   ├── valid.jsonl
│   ├── test.jsonl
│   ├── syn_dialogs/          # synthesized audio files
│   └── noises/               # TAU Urban Acoustic Scenes noise clips


checkpoints/
├── pretrained_models/
│   ├── speech_encoder.pt
│   └── adapter.pt
├── adapter/
│   └── best.ckpt
└── s2s/
    └── best.ckpt

Data Format

Each line in the JSONL files contains a single conversation with the following fields:

Field	Description
`conv_id`	Unique conversation identifier
`conv_len`	Total conversation duration in seconds
`num_samples`	Total number of audio samples
`sampling_rate`	Audio sample rate (16kHz)
`user_audio`	Path to the user's audio file (.wav)
`user_label`	List of user utterances with start/end timestamps and text
`assistant_audio`	Path to the assistant's audio file (.wav)
`assistant_label`	List of assistant utterances with start/end timestamps and text

Each utterance in user_label / assistant_label has:

start / end: timestamp in seconds
text: transcript
type: "standard" (normal speech), "backchannel" (e.g. "oh", "got it"), or "interrupted" (user barge-in)

Example data entry

{
  "conv_id": "3531e611927743a18cfaead4fe95ed11",
  "conv_len": 122.974,
  "num_samples": 1967583,
  "sampling_rate": 16000,
  "subset": "back_channels",
  "user_audio": "3531e611927743a18cfaead4fe95ed11_id08234_cajgMTmxOLk.wav",
  "user_label": [
    {
      "start": 0.0,
      "end": 10.596,
      "text": "Hey there! I was just thinking about the upcoming weekend and wanted to brainstorm some ideas for getting out of the city. I'm really craving some fresh air and maybe something a bit adventurous.",
      "type": "standard"
    },
    {
      "start": 27.614,
      "end": 40.246,
      "text": "Hmm, definitely more active. I was actually picturing something by the water, maybe even trying something new. I've always wanted to try water skiing, but I've never actually done it before. It seems like a lot of fun, but also a bit intimidating.",
      "type": "standard"
    },
    {
      "start": 56.336,
      "end": 68.002,
      "text": "That's good to know! I guess I just need to embrace the wipeouts then, haha. If I were to try it, what kind of gear would I need? I assume a wetsuit is a must, especially this time of year?",
      "type": "standard"
    },
    {
      "start": 85.558,
      "end": 97.652,
      "text": "Okay, wetsuit, life vest, skis, sunscreen. That makes sense. And what about after the activity? Is there anything specific I should plan for, like a change of clothes or a warm drink?",
      "type": "standard"
    },
    {
      "start": 113.646,
      "end": 118.834,
      "text": "That sounds like a perfect plan. I'm feeling much more confident about trying it now. Thanks for all the tips!",
      "type": "standard"
    }
  ],
  "assistant_audio": "3531e611927743a18cfaead4fe95ed11_id05535_pqxDKOc2ENo.wav",
  "assistant_label": [
    {
      "start": 10.796,
      "end": 27.414,
      "text": "Absolutely! I can help you brainstorm some options. What kind of adventure are you leaning towards? Are you thinking more active, like hiking or water sports, or something more relaxed, like a scenic drive or a visit to a charming town?",
      "type": "standard"
    },
    {
      "start": 35.874,
      "end": 36.674,
      "text": "oh",
      "type": "backchannel"
    },
    {
      "start": 40.446,
      "end": 56.136,
      "text": "Water skiing sounds like a fantastic idea for an active day out! It's definitely exhilarating once you get the hang of it. For a beginner, it's common to wipe out a few times initially, but most people pick it up surprisingly quickly with a good instructor and calm water conditions.",
      "type": "standard"
    },
    {
      "start": 68.202,
      "end": 85.358,
      "text": "Yes, a wetsuit is absolutely essential, especially if the water is still cool. You'll also need a life vest, which is usually provided by rental places or tour operators, and the skis themselves. Don't forget sunscreen, even on cooler days, as the sun reflects strongly off the water.",
      "type": "standard"
    },
    {
      "start": 89.536,
      "end": 90.336,
      "text": "got it",
      "type": "backchannel"
    },
    {
      "start": 97.852,
      "end": 113.446,
      "text": "Definitely bring a towel and a full change of dry, warm clothes for after you get out of the water. A warm drink, like hot chocolate or tea, would be perfect to warm you up, and maybe some snacks to refuel. Many water sports locations also have cafes or restaurants nearby.",
      "type": "standard"
    },
    {
      "start": 119.034,
      "end": 122.974,
      "text": "You're very welcome! I'm glad I could help. Have an amazing time on the water!",
      "type": "standard"
    }
  ]
}

Training

Training proceeds in three stages. All stages can run on 1× NVIDIA L40S GPUs (48 GB VRAM).

Step 0: Prepare Shar Datasets

Convert the raw JSONL + audio into Lhotse Shar shards with precomputed filterbank features. If a noise directory is provided, background noise is mixed in during feature extraction.

# Training set
python -m data.prepare_shar \
    --input-jsonl data/train.jsonl \
    --audios-dir data/syn_dialogs/ \
    --output-dir shars/train \
    --noise_dir data/noises \
    --num-workers 4

Preparation for Validation/Test set is similar

Step 1: Encoder–Adapter Pretraining

Freeze the LLM backbone and train only the speech encoder (upper layers) and CNN adapter, using speech transcription as the learning objective to align adapted representations with the LLM's text embedding space.

python train.py \
    config=configs/adapter.yaml \
    data.train_shar=shars/train \
    data.val_shar=shars/valid \
    speech_encoder.ckpt_path=checkpoints/pretrained_models/speech_encoder.pt \
    adapter.ckpt_path=checkpoints/pretrained_models/adapter.pt

The checkpoint will be saved to checkpoints/adapter/.

Step 2: Full-Duplex S2S Training

Unfreeze all parameters and train end-to-end on the synthetic duplex conversation data. Initialize from the adapter pretraining checkpoint.

python train.py \
    config=configs/s2s.yaml \
    data.train_shar=shars/train \
    data.val_shar=shars/valid \
    training.load_weights_from_ckpt=checkpoints/adapter/last.ckpt

The checkpoint will be saved to checkpoints/s2s/.

Configuration Overrides

Any value in the YAML configs can be overridden from the command line:

python train.py \
    config=configs/s2s.yaml \
    training.learning_rate=1e-5 \
    training.max_steps=500000 \
    training.grad_accum=4 \
    data.train_shar=shars/train \
    data.val_shar=shars/valid \
    wandb.run_name=s2s_lr1e5

Evaluation & Generation

Run the generation config to produce response generations and (optionally) log per-token hidden-state features for the prefix verifier:

python train.py \
    config=configs/generation.yaml \
    data.train_shar=shars/test \
    data.val_shar=shars/test \
    training.load_weights_from_ckpt=checkpoints/s2s/last.ckpt

This runs a validation pass that, for each [BOS] position in the test set:

Performs autoregressive decoding with zeroed speech features (simulating the speculative stream).
Records the generated response, dialogue context, and word-level timing.
Optionally saves per-token hidden states and logit features (first 20 tokens) for verifier training.

Outputs are written to checkpoints/generations/:

File	Description
`generations.jsonl`	One JSON object per generated response with `conv_id`, `context`, `response`, and `time_to_words`
`features/`	Per-sample `.pt` files containing hidden states and logit calibration signals (when `eval.log_features=true`)

Live interactive inference

Comming soon!

Project structure

.
├── configs/
│   ├── adapter.yaml          # Stage 1: encoder–adapter pretraining
│   ├── s2s.yaml              # Stage 2: full-duplex S2S training
│   ├── speech_encoder.yaml   # Conformer encoder hyperparameters
│   └── generation.yaml       # Response generation & feature logging
├── data/
│   ├── dataset.py            # Lhotse-based S2S dataset & collation
│   ├── prepare_shar.py       # JSONL → Shar conversion with fbank extraction
│   ├── noise_mixing.py       # Additive noise augmentation
│   └── split_train_symlinks.sh  # Split shards for multi-GPU
├── models/
│   ├── s2s_model.py          # S2SModel: encoder + adapter + LLM
│   ├── adapter.py            # CNN subsampling adapter
│   ├── masks.py              # Attention mask utilities
│   └── encoder/              # Streaming conformer speech encoder
├── modules/
│   ├── s2s_module.py         # PyTorch Lightning modules (S2S + Adapter)
│   ├── data_module.py        # Lightning DataModule with Lhotse samplers
│   └── feature_extractor.py  # Offline & streaming fbank extraction
├── metrics.py                # Turn-taking metric computation
├── utils.py                  # Seed, tokenizer setup, word-timing
├── train.py                  # Main training entry point
└── README.md

Citation

@misc{mai2026relays2sdualpathspeculativegeneration,
      title={RelayS2S: A Dual-Path Speculative Generation for Real-Time Dialogue}, 
      author={Long Mai},
      year={2026},
      eprint={2603.23346},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2603.23346}, 
}

Acknowledgement

Special thanks to the VITA team (Freeze-Omni) for providing the source code and pretrained weights for the streaming speech encoder and adapter.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RelayS2S — Speech-to-Speech Model

Architecture Overview

Installation

Dataset & Pretrained Checkpoints

Data Format

Training

Step 0: Prepare Shar Datasets

Step 1: Encoder–Adapter Pretraining

Step 2: Full-Duplex S2S Training

Configuration Overrides

Evaluation & Generation

Live interactive inference

Project structure

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
configs		configs
data		data
models		models
modules		modules
relays2s		relays2s
README.MD		README.MD
metrics.py		metrics.py
requirements.txt		requirements.txt
s2s_arch.png		s2s_arch.png
train.py		train.py
utils.py		utils.py

Folders and files

Latest commit

History

Repository files navigation

RelayS2S — Speech-to-Speech Model

Architecture Overview

Installation

Dataset & Pretrained Checkpoints

Data Format

Training

Step 0: Prepare Shar Datasets

Step 1: Encoder–Adapter Pretraining

Step 2: Full-Duplex S2S Training

Configuration Overrides

Evaluation & Generation

Live interactive inference

Project structure

Citation

Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages