Performance Benchmarks

The basis for the scheduling rules and the concurrent-decode cap in Prioritisation and Queuing. Numbers come from a methodical pass through every workflow step against three storage types, plus the live finding that informed the recent CPU-affinity work.

The raw CSVs and the original benchmark plan are not tracked in the repo (they were removed in e986c01 "Cleanup benchmarks" after the analysis was extracted), but everything below is reconstructed from that work and confirmed by ongoing observation on the reference machine.

Test system

CPU: AMD Ryzen 9 9950X3D — 16 physical cores / 32 SMT, two L3 cache instances (96 MiB V-Cache CCD + 32 MiB standard CCD)
RAM: ~126 GB

Storage:

Pool	Type	Filesystem
`hdd1bpool`	2× 16 TB HDD mirror	ZFS, 128K recordsize, zstd-3
`intel1tb`	Intel 960 GB SATA SSD	ZFS, 128K recordsize, zstd-3
`nvme2tb`	Kingston 2 TB NVMe	ZFS, 1M recordsize, zstd-3

Test file: Mercer_Tearooms_New2 — 16 GB .lds, PAL.
Method: monitor CPU%, RAM, per-device read/write MB/s and disk utilisation at 2 s intervals during each step. Repeated with caches dropped (echo 3 > /proc/sys/vm/drop_caches, ZFS ARC limited to 8 GB) to separate I/O-bound from algorithm-bound steps.

Per-step single-job results

ARC-cached baseline (CPU-bound view)

Step	FPS	Duration	Avg CPU	Peak CPU	RAM growth	Notes
Decode	~9.5	~14 min	7–8%	~22%	+13 GB	Algorithm-bound; CPU% is misleadingly low because vhs-decode is multi-process, not multi-threaded
Export	~170	~47 s	68%	100%	+20 GB	CPU-bound once I/O is out of the way
Compress	—	~6.5 min	10%	33%	+12.7 GB	CPU-only path (no GPU); low CPU%
Align	—	~15 s	7%	14%	~0	Trivial
Final mux	—	~14 s	13%	54%	+7.5 GB	Brief CPU spike

Uncached, real disk I/O

SATA SSD:

Step	Duration	Avg CPU	Read MB/s	Write MB/s	Disk util	Bottleneck
Decode	~12 min	7.5%	15	25	9%	Algorithm
Export	~1.8 min	28%	117	30	56%	I/O
Compress	~7 min	4%	27	22	13%	Algorithm
Align	~8 s	2.5%	4	12	4%	Trivial
Final mux	~38 s	5%	110	103	89%	I/O

HDD:

Step	Duration	Avg CPU	Read MB/s	Write MB/s	Disk util	Bottleneck
Decode	~16 min	6.8%	12	19	39%	Algorithm (HDD slower by ~33%)
Export	~1.9 min	38%	94	39	83%	I/O
Compress	~15 min	4%	13	11	61%	I/O (~2× slower than SSD)
Align	~6 s	5%	—	—	13%	Trivial
Final mux	~78 s	5%	48	64	94%	I/O

NVMe:

Step	Duration	Avg CPU	Read MB/s	Write MB/s	Disk util	Bottleneck
Decode	~12 min	7.3%	18	25	1%	Algorithm
Export	~58 s	77%	292	36	9%	CPU (storage no longer limits)
Compress	~6.5 min	4%	30	23	2%	Algorithm
Align	~8 s	5%	—	—	1%	Trivial
Final mux	~8 s	19%	388	379	16%	Storage-fast; ~5× faster than SATA

What this told us

Decode is algorithm-bound (consistent ~9.5 FPS, ~7% CPU) on every storage type. Storage doesn't help; what matters is the per-decode CPU + cache resources.
Export's bottleneck moves from I/O on HDD/SATA to CPU on NVMe — at NVMe speeds the encoder becomes the limit.
Compress is algorithm-bound on SSD/NVMe, becomes I/O-bound on HDD only.
Final mux is heavily I/O-bound on everything except NVMe.

Parallel-decode finding (HDD)

The single number that drove the original HDD scheduling rule:

HDD: 4 parallel decodes @ ~4.0 FPS each = 16 FPS total (vs 9.5 FPS for one).

So even though each decode slows down (9.5 → 4.0 FPS), the aggregate throughput goes up. Beyond 4, head-seek contention dominates and the total stops rising. This is why SCHEDULING_RULES["hdd"]["normal"] allows 4 light jobs concurrently.

The SSD/NVMe rules (8 light jobs) were extrapolated from the much lower per-decode disk utilisation (~9% on SATA SSD) — disk isn't going to be the limit, so the limit must come from somewhere else. Where it came from on SSD wasn't pinned down until the CPU work below.

The CPU side — where the "4 decodes" cap really comes from

The original benchmark planning estimated decode could run "10+ jobs" in parallel based purely on per-decode CPU% (~7%). That estimate didn't survive contact with reality, and the failure mode taught us something useful:

vhs-decode is multi-process, not multi-threaded. With -t 3 it spawns three worker processes plus a main, consuming about 3.25 cores worth (we measured 325% CPU on a running decode). Per-job CPU% as reported by ps understates the actual core occupancy.
The Zen X3D is two CCDs, two L3 caches. Cores 0–7 share the 96 MiB V-Cache L3; cores 8–15 share the 32 MiB standard L3. A cross-CCD hop (a thread on CCD 0 touching memory recently in CCD 1's L3) costs ~80 ns vs ~10 ns within a CCD.
Without pinning, the kernel scheduler bounces vhs-decode's 9 threads across all 32 logical CPUs. Two concurrent decodes routinely have threads on both CCDs, with each L3 evicting the other decode's working set. Per-decode FPS drops well beyond what the disk numbers predict.

So the actual ceiling for full-speed parallel decodes on this hardware is floor(physical_cores / 4) = 4 — one decode per physical-core quartet, ideally with each decode's quartet inside a single L3 group. That's what the toolkit now does, by pinning each decode to a disjoint physical-core set at job start (see Prioritisation and Queuing for the placement logic).

Why the formula uses 4 cores per decode

Component	Cores
vhs-decode main process	1
Worker processes (`-t 3`)	3
Total per decode	4

If -t is changed in the executor, the constant _CORES_PER_DECODE in job_queue_manager.py should change to match.

Cap on other CPUs

CPU example	Physical cores	Decode cap
4-core (Ryzen 5 / Core i3)	4	1
8-core (Ryzen 7 / Core i7)	8	2
12-core	12	3
16-core (this machine)	16	4
32-core (Threadripper / EPYC)	32	8

The cap is auto-detected at WCC startup and can be overridden via performance_settings.max_concurrent_decodes in config/config.json.

What the benchmarks did not capture

Worth being explicit about gaps so the rationale is honest:

No SSD parallel-decode benchmark was run with the original suite — the HDD "4 @ 4.0 FPS" was the only parallel measurement. The SSD rule of "8 light jobs" was extrapolated, not measured. The CPU work has since shown that even on SSD the cap-by-CPU-cores constraint binds before the disk one does, so the 8-light rule is effectively unreachable on a 16-core CPU.
No cross-CCD vs single-CCD comparison was run as a controlled experiment. The conclusion that pinning helps is inferred from kernel migration behaviour and reasoned-about hardware properties; the magnitude of the win is currently anecdotal ("per-decode fps drops noticeably with multiple concurrent decodes when unpinned").
No before/after pinning benchmark has been published.

Planned: before/after pinning benchmark

A controlled measurement of the CPU-affinity work is the next benchmark to run. The shape of the test:

Disable pinning temporarily (e.g. force affinity_allocator = None in JobQueueManager.__init__, or comment out the preexec_fn argument in _execute_vhs_decode_job).
Queue 2 concurrent decodes of a known capture. Record per-decode FPS over the steady-state portion of the run.
Repeat with 3 concurrent decodes, then 4.
Re-enable pinning. Repeat each of the 2 / 3 / 4 concurrent runs.
Compare per-decode FPS and total throughput, unpinned vs pinned.

What we expect to see, based on the cross-CCD reasoning:

2 concurrent: clear win for pinning (each decode gets a dedicated L3 cache; unpinned, both decodes can have threads on both CCDs).
3 concurrent: smaller but still measurable win (two decodes still get dedicated L3s, the third shares with one).
4 concurrent: smallest win or no win (cores are exhausted either way; pinning prevents cross-CCD migration but L3 is shared 2-and-2 between the four decodes regardless).

If anyone with comparable hardware (Zen 4 / Zen 5 with two CCDs, or Intel Hybrid with P+E clusters) wants to run this and post results, it would close the loop on the rationale below. The results would go in this section.

Storage rules now in code

For reference, the rules that the benchmarks led to, currently encoded in job_queue_manager.py::SCHEDULING_RULES:

Storage	Normal (no heavy job)	Heavy job running
HDD	4 light, 1 heavy	2 light, 0 heavy
SSD / NVMe	8 light, 1 heavy	4 light, 0 heavy
default	4 light, 1 heavy	2 light, 0 heavy

Light = vhs-decode, lds-compress, audio-align, checksum. Heavy = tbc-export, final-mux.

The decode cap from CPU topology applies on top of these.

Related pages

Prioritisation and Queuing — how the rules are applied at scheduling time
Progress Reporting — how the per-job FPS / MB/s the benchmarks measured are surfaced live in the matrix

DDD Capture Toolkit

Home

Getting Started

Features

Internals

Reference

Troubleshooting

Quick Reference

Workflow Commands:

1D - Decode project 1
1M - Compress project 1
1E - Export project 1
1A - Align audio
1F - Final mux
1X - Project settings
1mv - Validate compressed master (Tier 3)
hash 1 - Hash files lacking a recorded hash
check 1 - Re-hash and compare to log

Key Features:

PAL/NTSC auto-detect
Reverse field order (automatic)
Segment testing mode
Three-tier compress validation
Automatic checksums + per-project validation log

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Benchmarks

Performance Benchmarks

Test system

Per-step single-job results

ARC-cached baseline (CPU-bound view)

Uncached, real disk I/O

What this told us

Parallel-decode finding (HDD)

The CPU side — where the "4 decodes" cap really comes from

Why the formula uses 4 cores per decode

Cap on other CPUs

What the benchmarks did not capture

Planned: before/after pinning benchmark

Storage rules now in code

Related pages

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

DDD Capture Toolkit

Getting Started

Features

Internals

Reference

Quick Reference

Clone this wiki locally