Skip to content

Performance Benchmarks

Marshalleq edited this page May 24, 2026 · 2 revisions

Performance Benchmarks

The basis for the scheduling rules and the concurrent-decode cap in Prioritisation and Queuing. Numbers come from a methodical pass through every workflow step against three storage types, plus the live finding that informed the recent CPU-affinity work.

The raw CSVs and the original benchmark plan are not tracked in the repo (they were removed in e986c01 "Cleanup benchmarks" after the analysis was extracted), but everything below is reconstructed from that work and confirmed by ongoing observation on the reference machine.

Test system

  • CPU: AMD Ryzen 9 9950X3D — 16 physical cores / 32 SMT, two L3 cache instances (96 MiB V-Cache CCD + 32 MiB standard CCD)

  • RAM: ~126 GB

  • Storage:

    Pool Type Filesystem
    hdd1bpool 2× 16 TB HDD mirror ZFS, 128K recordsize, zstd-3
    intel1tb Intel 960 GB SATA SSD ZFS, 128K recordsize, zstd-3
    nvme2tb Kingston 2 TB NVMe ZFS, 1M recordsize, zstd-3
  • Test file: Mercer_Tearooms_New2 — 16 GB .lds, PAL.

  • Method: monitor CPU%, RAM, per-device read/write MB/s and disk utilisation at 2 s intervals during each step. Repeated with caches dropped (echo 3 > /proc/sys/vm/drop_caches, ZFS ARC limited to 8 GB) to separate I/O-bound from algorithm-bound steps.

Per-step single-job results

ARC-cached baseline (CPU-bound view)

Step FPS Duration Avg CPU Peak CPU RAM growth Notes
Decode ~9.5 ~14 min 7–8% ~22% +13 GB Algorithm-bound; CPU% is misleadingly low because vhs-decode is multi-process, not multi-threaded
Export ~170 ~47 s 68% 100% +20 GB CPU-bound once I/O is out of the way
Compress ~6.5 min 10% 33% +12.7 GB CPU-only path (no GPU); low CPU%
Align ~15 s 7% 14% ~0 Trivial
Final mux ~14 s 13% 54% +7.5 GB Brief CPU spike

Uncached, real disk I/O

SATA SSD:

Step Duration Avg CPU Read MB/s Write MB/s Disk util Bottleneck
Decode ~12 min 7.5% 15 25 9% Algorithm
Export ~1.8 min 28% 117 30 56% I/O
Compress ~7 min 4% 27 22 13% Algorithm
Align ~8 s 2.5% 4 12 4% Trivial
Final mux ~38 s 5% 110 103 89% I/O

HDD:

Step Duration Avg CPU Read MB/s Write MB/s Disk util Bottleneck
Decode ~16 min 6.8% 12 19 39% Algorithm (HDD slower by ~33%)
Export ~1.9 min 38% 94 39 83% I/O
Compress ~15 min 4% 13 11 61% I/O (~2× slower than SSD)
Align ~6 s 5% 13% Trivial
Final mux ~78 s 5% 48 64 94% I/O

NVMe:

Step Duration Avg CPU Read MB/s Write MB/s Disk util Bottleneck
Decode ~12 min 7.3% 18 25 1% Algorithm
Export ~58 s 77% 292 36 9% CPU (storage no longer limits)
Compress ~6.5 min 4% 30 23 2% Algorithm
Align ~8 s 5% 1% Trivial
Final mux ~8 s 19% 388 379 16% Storage-fast; ~5× faster than SATA

What this told us

  • Decode is algorithm-bound (consistent ~9.5 FPS, ~7% CPU) on every storage type. Storage doesn't help; what matters is the per-decode CPU + cache resources.
  • Export's bottleneck moves from I/O on HDD/SATA to CPU on NVMe — at NVMe speeds the encoder becomes the limit.
  • Compress is algorithm-bound on SSD/NVMe, becomes I/O-bound on HDD only.
  • Final mux is heavily I/O-bound on everything except NVMe.

Parallel-decode finding (HDD)

The single number that drove the original HDD scheduling rule:

HDD: 4 parallel decodes @ ~4.0 FPS each = 16 FPS total (vs 9.5 FPS for one).

So even though each decode slows down (9.5 → 4.0 FPS), the aggregate throughput goes up. Beyond 4, head-seek contention dominates and the total stops rising. This is why SCHEDULING_RULES["hdd"]["normal"] allows 4 light jobs concurrently.

The SSD/NVMe rules (8 light jobs) were extrapolated from the much lower per-decode disk utilisation (~9% on SATA SSD) — disk isn't going to be the limit, so the limit must come from somewhere else. Where it came from on SSD wasn't pinned down until the CPU work below.

The CPU side — where the "4 decodes" cap really comes from

The original benchmark planning estimated decode could run "10+ jobs" in parallel based purely on per-decode CPU% (~7%). That estimate didn't survive contact with reality, and the failure mode taught us something useful:

  • vhs-decode is multi-process, not multi-threaded. With -t 3 it spawns three worker processes plus a main, consuming about 3.25 cores worth (we measured 325% CPU on a running decode). Per-job CPU% as reported by ps understates the actual core occupancy.
  • The Zen X3D is two CCDs, two L3 caches. Cores 0–7 share the 96 MiB V-Cache L3; cores 8–15 share the 32 MiB standard L3. A cross-CCD hop (a thread on CCD 0 touching memory recently in CCD 1's L3) costs ~80 ns vs ~10 ns within a CCD.
  • Without pinning, the kernel scheduler bounces vhs-decode's 9 threads across all 32 logical CPUs. Two concurrent decodes routinely have threads on both CCDs, with each L3 evicting the other decode's working set. Per-decode FPS drops well beyond what the disk numbers predict.

So the actual ceiling for full-speed parallel decodes on this hardware is floor(physical_cores / 4) = 4 — one decode per physical-core quartet, ideally with each decode's quartet inside a single L3 group. That's what the toolkit now does, by pinning each decode to a disjoint physical-core set at job start (see Prioritisation and Queuing for the placement logic).

Why the formula uses 4 cores per decode

Component Cores
vhs-decode main process 1
Worker processes (-t 3) 3
Total per decode 4

If -t is changed in the executor, the constant _CORES_PER_DECODE in job_queue_manager.py should change to match.

Cap on other CPUs

CPU example Physical cores Decode cap
4-core (Ryzen 5 / Core i3) 4 1
8-core (Ryzen 7 / Core i7) 8 2
12-core 12 3
16-core (this machine) 16 4
32-core (Threadripper / EPYC) 32 8

The cap is auto-detected at WCC startup and can be overridden via performance_settings.max_concurrent_decodes in config/config.json.

What the benchmarks did not capture

Worth being explicit about gaps so the rationale is honest:

  • No SSD parallel-decode benchmark was run with the original suite — the HDD "4 @ 4.0 FPS" was the only parallel measurement. The SSD rule of "8 light jobs" was extrapolated, not measured. The CPU work has since shown that even on SSD the cap-by-CPU-cores constraint binds before the disk one does, so the 8-light rule is effectively unreachable on a 16-core CPU.
  • No cross-CCD vs single-CCD comparison was run as a controlled experiment. The conclusion that pinning helps is inferred from kernel migration behaviour and reasoned-about hardware properties; the magnitude of the win is currently anecdotal ("per-decode fps drops noticeably with multiple concurrent decodes when unpinned").
  • No before/after pinning benchmark has been published.

Planned: before/after pinning benchmark

A controlled measurement of the CPU-affinity work is the next benchmark to run. The shape of the test:

  1. Disable pinning temporarily (e.g. force affinity_allocator = None in JobQueueManager.__init__, or comment out the preexec_fn argument in _execute_vhs_decode_job).
  2. Queue 2 concurrent decodes of a known capture. Record per-decode FPS over the steady-state portion of the run.
  3. Repeat with 3 concurrent decodes, then 4.
  4. Re-enable pinning. Repeat each of the 2 / 3 / 4 concurrent runs.
  5. Compare per-decode FPS and total throughput, unpinned vs pinned.

What we expect to see, based on the cross-CCD reasoning:

  • 2 concurrent: clear win for pinning (each decode gets a dedicated L3 cache; unpinned, both decodes can have threads on both CCDs).
  • 3 concurrent: smaller but still measurable win (two decodes still get dedicated L3s, the third shares with one).
  • 4 concurrent: smallest win or no win (cores are exhausted either way; pinning prevents cross-CCD migration but L3 is shared 2-and-2 between the four decodes regardless).

If anyone with comparable hardware (Zen 4 / Zen 5 with two CCDs, or Intel Hybrid with P+E clusters) wants to run this and post results, it would close the loop on the rationale below. The results would go in this section.

Storage rules now in code

For reference, the rules that the benchmarks led to, currently encoded in job_queue_manager.py::SCHEDULING_RULES:

Storage Normal (no heavy job) Heavy job running
HDD 4 light, 1 heavy 2 light, 0 heavy
SSD / NVMe 8 light, 1 heavy 4 light, 0 heavy
default 4 light, 1 heavy 2 light, 0 heavy

Light = vhs-decode, lds-compress, audio-align, checksum. Heavy = tbc-export, final-mux.

The decode cap from CPU topology applies on top of these.

Related pages

DDD Capture Toolkit

Home

Getting Started

Features

Internals

Reference


Quick Reference

Workflow Commands:

  • 1D - Decode project 1
  • 1M - Compress project 1
  • 1E - Export project 1
  • 1A - Align audio
  • 1F - Final mux
  • 1X - Project settings
  • 1mv - Validate compressed master (Tier 3)
  • hash 1 - Hash files lacking a recorded hash
  • check 1 - Re-hash and compare to log

Key Features:

  • PAL/NTSC auto-detect
  • Reverse field order (automatic)
  • Segment testing mode
  • Three-tier compress validation
  • Automatic checksums + per-project validation log

Clone this wiki locally