-
Notifications
You must be signed in to change notification settings - Fork 0
Performance Benchmarks
The basis for the scheduling rules and the concurrent-decode cap in Prioritisation and Queuing. Numbers come from a methodical pass through every workflow step against three storage types, plus the live finding that informed the recent CPU-affinity work.
The raw CSVs and the original benchmark plan are not tracked in the repo (they were removed in e986c01 "Cleanup benchmarks" after the analysis was extracted), but everything below is reconstructed from that work and confirmed by ongoing observation on the reference machine.
-
CPU: AMD Ryzen 9 9950X3D — 16 physical cores / 32 SMT, two L3 cache instances (96 MiB V-Cache CCD + 32 MiB standard CCD)
-
RAM: ~126 GB
-
Storage:
Pool Type Filesystem hdd1bpool2× 16 TB HDD mirror ZFS, 128K recordsize, zstd-3 intel1tbIntel 960 GB SATA SSD ZFS, 128K recordsize, zstd-3 nvme2tbKingston 2 TB NVMe ZFS, 1M recordsize, zstd-3 -
Test file:
Mercer_Tearooms_New2— 16 GB.lds, PAL. -
Method: monitor CPU%, RAM, per-device read/write MB/s and disk utilisation at 2 s intervals during each step. Repeated with caches dropped (
echo 3 > /proc/sys/vm/drop_caches, ZFS ARC limited to 8 GB) to separate I/O-bound from algorithm-bound steps.
| Step | FPS | Duration | Avg CPU | Peak CPU | RAM growth | Notes |
|---|---|---|---|---|---|---|
| Decode | ~9.5 | ~14 min | 7–8% | ~22% | +13 GB | Algorithm-bound; CPU% is misleadingly low because vhs-decode is multi-process, not multi-threaded |
| Export | ~170 | ~47 s | 68% | 100% | +20 GB | CPU-bound once I/O is out of the way |
| Compress | — | ~6.5 min | 10% | 33% | +12.7 GB | CPU-only path (no GPU); low CPU% |
| Align | — | ~15 s | 7% | 14% | ~0 | Trivial |
| Final mux | — | ~14 s | 13% | 54% | +7.5 GB | Brief CPU spike |
SATA SSD:
| Step | Duration | Avg CPU | Read MB/s | Write MB/s | Disk util | Bottleneck |
|---|---|---|---|---|---|---|
| Decode | ~12 min | 7.5% | 15 | 25 | 9% | Algorithm |
| Export | ~1.8 min | 28% | 117 | 30 | 56% | I/O |
| Compress | ~7 min | 4% | 27 | 22 | 13% | Algorithm |
| Align | ~8 s | 2.5% | 4 | 12 | 4% | Trivial |
| Final mux | ~38 s | 5% | 110 | 103 | 89% | I/O |
HDD:
| Step | Duration | Avg CPU | Read MB/s | Write MB/s | Disk util | Bottleneck |
|---|---|---|---|---|---|---|
| Decode | ~16 min | 6.8% | 12 | 19 | 39% | Algorithm (HDD slower by ~33%) |
| Export | ~1.9 min | 38% | 94 | 39 | 83% | I/O |
| Compress | ~15 min | 4% | 13 | 11 | 61% | I/O (~2× slower than SSD) |
| Align | ~6 s | 5% | — | — | 13% | Trivial |
| Final mux | ~78 s | 5% | 48 | 64 | 94% | I/O |
NVMe:
| Step | Duration | Avg CPU | Read MB/s | Write MB/s | Disk util | Bottleneck |
|---|---|---|---|---|---|---|
| Decode | ~12 min | 7.3% | 18 | 25 | 1% | Algorithm |
| Export | ~58 s | 77% | 292 | 36 | 9% | CPU (storage no longer limits) |
| Compress | ~6.5 min | 4% | 30 | 23 | 2% | Algorithm |
| Align | ~8 s | 5% | — | — | 1% | Trivial |
| Final mux | ~8 s | 19% | 388 | 379 | 16% | Storage-fast; ~5× faster than SATA |
- Decode is algorithm-bound (consistent ~9.5 FPS, ~7% CPU) on every storage type. Storage doesn't help; what matters is the per-decode CPU + cache resources.
- Export's bottleneck moves from I/O on HDD/SATA to CPU on NVMe — at NVMe speeds the encoder becomes the limit.
- Compress is algorithm-bound on SSD/NVMe, becomes I/O-bound on HDD only.
- Final mux is heavily I/O-bound on everything except NVMe.
The single number that drove the original HDD scheduling rule:
HDD: 4 parallel decodes @ ~4.0 FPS each = 16 FPS total (vs 9.5 FPS for one).
So even though each decode slows down (9.5 → 4.0 FPS), the aggregate throughput goes up. Beyond 4, head-seek contention dominates and the total stops rising. This is why SCHEDULING_RULES["hdd"]["normal"] allows 4 light jobs concurrently.
The SSD/NVMe rules (8 light jobs) were extrapolated from the much lower per-decode disk utilisation (~9% on SATA SSD) — disk isn't going to be the limit, so the limit must come from somewhere else. Where it came from on SSD wasn't pinned down until the CPU work below.
The original benchmark planning estimated decode could run "10+ jobs" in parallel based purely on per-decode CPU% (~7%). That estimate didn't survive contact with reality, and the failure mode taught us something useful:
-
vhs-decode is multi-process, not multi-threaded. With
-t 3it spawns three worker processes plus a main, consuming about 3.25 cores worth (we measured 325% CPU on a running decode). Per-job CPU% as reported bypsunderstates the actual core occupancy. - The Zen X3D is two CCDs, two L3 caches. Cores 0–7 share the 96 MiB V-Cache L3; cores 8–15 share the 32 MiB standard L3. A cross-CCD hop (a thread on CCD 0 touching memory recently in CCD 1's L3) costs ~80 ns vs ~10 ns within a CCD.
- Without pinning, the kernel scheduler bounces vhs-decode's 9 threads across all 32 logical CPUs. Two concurrent decodes routinely have threads on both CCDs, with each L3 evicting the other decode's working set. Per-decode FPS drops well beyond what the disk numbers predict.
So the actual ceiling for full-speed parallel decodes on this hardware is floor(physical_cores / 4) = 4 — one decode per physical-core quartet, ideally with each decode's quartet inside a single L3 group. That's what the toolkit now does, by pinning each decode to a disjoint physical-core set at job start (see Prioritisation and Queuing for the placement logic).
| Component | Cores |
|---|---|
| vhs-decode main process | 1 |
Worker processes (-t 3) |
3 |
| Total per decode | 4 |
If -t is changed in the executor, the constant _CORES_PER_DECODE in job_queue_manager.py should change to match.
| CPU example | Physical cores | Decode cap |
|---|---|---|
| 4-core (Ryzen 5 / Core i3) | 4 | 1 |
| 8-core (Ryzen 7 / Core i7) | 8 | 2 |
| 12-core | 12 | 3 |
| 16-core (this machine) | 16 | 4 |
| 32-core (Threadripper / EPYC) | 32 | 8 |
The cap is auto-detected at WCC startup and can be overridden via performance_settings.max_concurrent_decodes in config/config.json.
Worth being explicit about gaps so the rationale is honest:
- No SSD parallel-decode benchmark was run with the original suite — the HDD "4 @ 4.0 FPS" was the only parallel measurement. The SSD rule of "8 light jobs" was extrapolated, not measured. The CPU work has since shown that even on SSD the cap-by-CPU-cores constraint binds before the disk one does, so the 8-light rule is effectively unreachable on a 16-core CPU.
- No cross-CCD vs single-CCD comparison was run as a controlled experiment. The conclusion that pinning helps is inferred from kernel migration behaviour and reasoned-about hardware properties; the magnitude of the win is currently anecdotal ("per-decode fps drops noticeably with multiple concurrent decodes when unpinned").
- No before/after pinning benchmark has been published.
A controlled measurement of the CPU-affinity work is the next benchmark to run. The shape of the test:
- Disable pinning temporarily (e.g. force
affinity_allocator = NoneinJobQueueManager.__init__, or comment out thepreexec_fnargument in_execute_vhs_decode_job). - Queue 2 concurrent decodes of a known capture. Record per-decode FPS over the steady-state portion of the run.
- Repeat with 3 concurrent decodes, then 4.
- Re-enable pinning. Repeat each of the 2 / 3 / 4 concurrent runs.
- Compare per-decode FPS and total throughput, unpinned vs pinned.
What we expect to see, based on the cross-CCD reasoning:
- 2 concurrent: clear win for pinning (each decode gets a dedicated L3 cache; unpinned, both decodes can have threads on both CCDs).
- 3 concurrent: smaller but still measurable win (two decodes still get dedicated L3s, the third shares with one).
- 4 concurrent: smallest win or no win (cores are exhausted either way; pinning prevents cross-CCD migration but L3 is shared 2-and-2 between the four decodes regardless).
If anyone with comparable hardware (Zen 4 / Zen 5 with two CCDs, or Intel Hybrid with P+E clusters) wants to run this and post results, it would close the loop on the rationale below. The results would go in this section.
For reference, the rules that the benchmarks led to, currently encoded in job_queue_manager.py::SCHEDULING_RULES:
| Storage | Normal (no heavy job) | Heavy job running |
|---|---|---|
| HDD | 4 light, 1 heavy | 2 light, 0 heavy |
| SSD / NVMe | 8 light, 1 heavy | 4 light, 0 heavy |
| default | 4 light, 1 heavy | 2 light, 0 heavy |
Light = vhs-decode, lds-compress, audio-align, checksum.
Heavy = tbc-export, final-mux.
The decode cap from CPU topology applies on top of these.
- Prioritisation and Queuing — how the rules are applied at scheduling time
- Progress Reporting — how the per-job FPS / MB/s the benchmarks measured are surfaced live in the matrix
- Project Flags
- Segment Mode
- Audio Synchronisation
- VHS Timecode Calibration
- Compress Validation
- Checksums and Verification
Workflow Commands:
-
1D- Decode project 1 -
1M- Compress project 1 -
1E- Export project 1 -
1A- Align audio -
1F- Final mux -
1X- Project settings -
1mv- Validate compressed master (Tier 3) -
hash 1- Hash files lacking a recorded hash -
check 1- Re-hash and compare to log
Key Features:
- PAL/NTSC auto-detect
- Reverse field order (automatic)
- Segment testing mode
- Three-tier compress validation
- Automatic checksums + per-project validation log