Releases: NeuroBrix/neurobrix
v0.2.0 — P-PRISM-NEVER-REFUSE v2 closed (14/16 cells)
NeuroBrix v0.2.0 — P-PRISM-NEVER-REFUSE v2 closed (2026-05-15)
162 commits since v0.1.6 (2026-04-20). Minor version bump per SemVer:
features added are rétrocompatibles, no breaking API changes. Major
acquis closes the P-PRISM-NEVER-REFUSE v2 mandate with 14/16 hardware ×
mode cells validated end-to-end on Sana 1.6B 4Kpx diffusion (4096×4096
output).
Highlights
- Doctrine R34 model-agnostic ratified and audited clean — zero
active hardcode-by-model-name violations across runtime, kernels,
strategies, dispatchers. - Doctrine R35 "Prism never refuses" implemented and empirically
validated — cascade strategy covers every legitimate hardware ×
model combination down tocpu_executionfallback. - 14/16 cells matrice hardware × mode validated end-to-end on
Sana 4Kpx (single 32 GiB, single 16 GiB, 2× 16 GiB; compiled,
sequential, triton, triton_sequential). The 2 ⏸ cells remain on
upstream-blocked CPU triton modes (triton-cpu PyPI wheel pending). - 18× speedup on 32g triton mode: 1443 s → 79.81 s wall on
Sana 4Kpx via DC-AE residual chain band-streaming (each of 9 VAE
chains streams band-by-band, peak transient drops from 6.7 GiB to
1.6 GiB per chain).
Hardware coverage
| Config × Mode | compiled | sequential | triton | triton-seq |
|---|---|---|---|---|
| Single GPU 32 GiB | ✓ 23.9 s | ✓ | ✓ 79.8 s | ✓ |
| Single GPU 16 GiB | ✓ | ✓ | ✓ 79.4 s | ✓ 84.5 s |
| Multi-GPU 2× 16 GiB | ✓ | ✓ | ✓ 81.8 s | ✓ 84.1 s |
| CPU pure | ✓ | ✓ | ⏸ S3 | ⏸ S3 |
Wall times measured on Sana 1.6B 4Kpx single-step at 4096×4096 spatial
output. CPU triton cells wait on upstream triton-cpu (meta-pytorch
project) shipping PyPI wheels.
Architecture
Hybrid CPU+GPU per-component placement
core/strategies/lazy_sequential.py + core/runtime/executor.py now
correctly route mixed-device plans through strategy.execute_component
so a CPU component consuming a GPU producer no longer stalls in
implicit transfer or raises device-mismatch. Unlocked 16g compiled +
sequential on models whose VAE alone exceeds the GPU budget.
Intra-component multi-device cascade
The single_gpu strategy correctly picks cuda:0 on 2× 16 GiB
hardware when a component fits there post-S5 chain tiling. No solver
extension required — the cascade naturally handles the case.
DC-AE residual chain tiling — band-streamed (R30 dual-branch)
- Compiled mode:
band_streamed_chain_torchmaterialises the residual
chain band-by-band in PyTorch ATen, writes back to T_base in place. - Triton mode:
band_streamed_chain_nbx(143 lines, R33-pure) mirrors
the algorithm using NBX wrappers + NBXTensor methods only. - Mode-aware dispatcher in
core/module/tiling_engine.pyselects the
variant pergraph_executor.mode. - Chain wrapper is default-ON on triton modes;
NBX_TRITON_CHAIN_WRAPPER=0
reverts to the c9d2581 skip path as emergency rollback.
Edge-padding + halo-offset wrapper math fix (_tiled_conv2d_spatial_*)
Two coupled bugs in the 4 tiled-conv wrappers (_tiled_conv2d_spatial_torch,
_tiled_conv2d_spatial_nbx, _fused_upsample_conv2d_torch,
_fused_upsample_conv2d_nbx):
- Edge-band double-padding (max(0, -in_read_start) already provides
pad_h; the extra+ pad_h if is_top_bandterm over-padded by
pad_h rows). - Internal-frontier halo offset on the conv_band read side
(output slice was offset by halo_top rows on every internal band).
Validated bit-exact viascripts/microtest_tiled_conv2d_small_scale.py
sweeping (kh ∈ {1,3,5}, pad_h ∈ {0,1,2}, tile_factor ∈ {1,2,4,8}) at
1024² and 2048² — cos=1.0000 max_abs=0.0000 universally.
NBXTensor.getitem negative-slice fix — ROOT FIX for chain wrapper
NBXTensor.__getitem__ slice path used k.start or 0 which silently
left negative starts untouched (-2 or 0 evaluates to -2, not 0).
For t[-N:], the resulting narrow(dim, -N, length) produced an
_offset = parent_offset - N*stride → data_ptr() pointed BEFORE the
cudaMalloc'd block → Triton's cuPointerGetAttribute rejected the
pointer with "Pointer argument (at 0) cannot be accessed from Triton
(cpu tensor?)".
The Sana 4Kpx VAE chain wrapper's halo_carry[:, :, -top_size:, :]
triggered this at the 3rd iteration of every 4096² chain. Fix restores
the universal Python/torch slice contract:
start = k.start if k.start is not None else 0
stop = k.stop if k.stop is not None else shape[dim]
if start < 0: start = max(0, shape[dim] + start)
if stop < 0: stop = shape[dim] + stop8-line fix, 4 cells unlocked. Latent bug across the codebase, now closed.
Triton-CPU integration stage 1
triton-cpu (meta-pytorch project) is not on PyPI today and requires
build-from-source. Stage 1 ships the install gate (raises clean
ImportError with actionable message when missing), coverage docs,
and the activation flow. CPU triton + CPU triton-sequential cells
remain ⏸ until upstream ships PyPI wheels — automatic unblock then.
Doctrine ratified
- R34 model-agnostic strict: structural pattern detection, no
hardcode-by-model-name in runtime / kernels / strategies / dispatchers. - R35 Prism never refuses: cascade fully implemented to
cpu_executionfallback; every legitimate plan finds a strategy. - R30 dualité runtime: the 4 execution modes (compiled, sequential,
triton, triton_sequential) cover the 14 cells symmetrically;
band_streamed_chain_*mirror restored chain coverage on triton path. - R33 zero torch in
triton/: preserved; chain wrapper NBX
variant uses only NBX wrappers + NBXTensor methods, no torch
import on the execute path. - R29 inspectable artefacts: every cell ✓ has a coherent red
apple PNG indocs/verdicts/p_triton_chain_cpu_pointer/and
docs/verdicts/p_nbx_tiled_conv2d_small_scale/.
Known limitations (transparency)
- 2/16 cells ⏸ : CPU pure × triton and CPU pure × triton_sequential
blocked on absence oftriton-cpuPyPI wheel
(meta-pytorch/triton-cpu, build-from-source only at v0.2.0 release
time). Backlog itemP-TRITON-CPU-UPSTREAM-WHEEL-FOLLOWUP
unblocks automatically when upstream ships wheels. - Op-level cross-device split (Gap B) out of scope of v2 mandate:
for models where a single op exceeds per-device VRAM. Backlog item
P-OP-LEVEL-CROSS-DEVICE-SPLITopens when a concrete model demands
it (no Sana 4Kpx need it now that VAE fits 16 GiB post-S5 tiling).
Validation artefacts
The mandate v2 verdict and per-cell R29 artefacts are committed:
docs/verdicts/p_prism_never_refuse_v2_closed.md— full
verdict, sub-chantier history, commit hashes, anti-régression table.docs/verdicts/p_triton_chain_cpu_pointer/— 6 coherent red
apple PNGs (32g compiled / 32g triton / 16g compiled / 16g triton /
16g triton-seq / 2× 16g triton / 2× 16g triton-seq).docs/verdicts/p_nbx_tiled_conv2d_small_scale/— microtest
logs + 16g compiled anti-regression PNG + 32g triton baseline PNG.
Key commits
Sub-chantiers and bugfixes that shaped this release:
| Sub-chantier / fix | Commits |
|---|---|
| S1 hybrid CPU+GPU dispatch | de5fb9e |
| S2 native sequential CPU (RoPE fix) | 8b4d020 |
| S3 triton-cpu install gate | b3e479f, d0974e6 |
| S5 residual chain detection + wrapper | 198ab1b → 33c6b21 |
| P-S5 depthwise tile-skip | 8af7848 |
| R30 chain skip on triton (interim) | c9d2581 |
| S4 closure (cascade docs) | f58f6cc |
| JSONL dump O(N) | da484ae |
| P-NBX-TILED-CONV2D-SMALL-SCALE | 176bc7e, 63edb03 |
| L1 NBX_LIVE_WATERMARK_TRACE | 993181f |
| L2 per-slot LIVE_DUMP_SLOT | 5a714a5 |
| L4 band_streamed_chain_nbx + dispatcher | f8a8ad8 |
| L4b NBXTensor isinstance + _set_device | f997479 |
| C1 NBX_CHAIN_DIAG per-chain device-state | 9ddcc7b |
| C3d ROOT FIX — NBXTensor.getitem | 8a6daf2 |
| C4 chain wrapper default-ON triton | 23de696 |
| MANDATE CLOSED verdict + R29 | 77571b7 |
Tag
p-prism-never-refuse-v2-closedposted on commit77571b7(mandate
closure), pushed tooriginandgitlab.v0.2.0semver release tag posted on this commit, pushed to both
remotes.
Upgrading from v0.1.x
No breaking API changes. The chain wrapper is default-ON on triton
modes (set NBX_TRITON_CHAIN_WRAPPER=0 to fall back to the v0.1.x
behaviour). Other internal changes are transparent to call sites.
NeuroBrix 0.1.6 — Multi-GPU triton hardened + Item 3 closed
Fifth release since 0.1.5 on April 15. Four substantive commits hardening the multi-GPU Triton path and closing the long-standing Item 3 perf/VRAM trade-off, plus a regression-harness overhaul.
Highlights
-
Multi-GPU Triton hardened — 163 of 211 kernel-launch sites were latently broken.
kernels/wrappers.py's documented_set_devicecontract was honoured at only 48 of 211 production launch sites. On single-GPU this was invisible (active device always matched the tensor), but any multi-GPU Prism plan with a component on a non-active device crashedargmax,bmm,addmm,softmax,rms_norm, and most elementwise wrappers. Fix promotes_set_deviceto a single source of truth innbx_tensor.pyand inserts the guard at every red site. 211 / 211 green. Phase 2 multi-GPU strategy matrix now 8 / 8 cells green. -
Zero3 block-wise ratchet pipelining (native + triton). Qwen3-30B-A3B-Thinking-2507 on a single 16 GB V100 now runs under bounded VRAM (two blocks of weights + KV + activations). Overlaps H2D(N+1) with compute(N) on a dedicated transfer stream. Prefill dramatically faster than the per-op slow path. Polymorphic over
torch.Tensor+CompiledSequenceandNBXTensor+TritonSequence. -
Item 3 closed — in-kernel
fp16 → fp32tile promotion formm/bmm/addmmon pre-Ampere (Path A'). NewPROMOTE_B: tl.constexprflag onmatmul_kernel/addmm_kernel: thebtile is cast toa.dtypeaftertl.load, beforetl.dot. Bit-exact. On TinyLlama-1.1Bv100-16g --triton: peak VRAM 4.74 → 2.54 GB (-46 %), decode 9.94 → 5.04 s on 8 tokens (-49 %), character-identical output across 30-token greedy decode. On Qwen3-30B zero3: -64 MB, -5 %. -
NBX_FORCE_STRATEGYenv var for deterministic Prism strategy selection. Short-circuits the score cascade. Unknown strategy →RuntimeErrorlisting the 9 valid values. Valid but unavailable for the device count →RuntimeError. Fits but under-performs the auto-selected one → still used (zero silent fallback). Enables per-strategy matrix regression testing. -
Regression harness tracked in git + 4
::nativeaudio fails resolved. Prior diagnoses had attributed the fails to "tokenizers env-drift" without sourced stderr; this release captured actual stderr and rooted 3 of them in interpreter selection (venv vs system Python user-site) plus one genuineaten::cudnn_batch_normregression on Kokoro-82M (markedxfailwith follow-up scoped indocs/follow-ups/). Newpytest_configurehook auto-detects$VIRTUAL_ENV. Harness goes from4 failed / 11 passed / 11 xfailed / 14 skippedto0 failed / 14 passed / 12 xfailed / 14 skipped.
Full changelog
See CHANGELOG.md §0.1.6.
Install
pip install neurobrix==0.1.6
Wheel sha256 7fca8534dd5c87d6dc455ad206618cda95f165460e74e8b6a944e8745a97e410, sdist sha256 be3258ebd8eae326d24059054ad59de2fb352df5e143bbf9f7bff558224fc673.
v0.1.0a12 — Complete Windows Compatibility
Complete Windows Compatibility
Full end-to-end audit and fix of all Windows-breaking code paths.
Fixed
os.kill(pid, 0)crash — replaced withctypes.windll.kernel32.OpenProcess()for process existence checksignal.SIGTERMhandler crash — guarded registration withhasattr()- ffmpeg video pipe deadlock — replaced manual pipe write with
communicate() - All daemon lifecycle operations (start, check, stop) now Windows-safe
Windows users can now run the full cold-run flow:
neurobrix import Sana/Sana_1600M_1024px_MultiLing --no-keep
neurobrix run --model Sana_1600M_1024px_MultiLing --prompt "a cat in a hat"pip install neurobrix==0.1.0a12v0.1.0a9 — Cross-Platform: Windows + macOS + Lazy Triton
Full Changelog: v0.1.0a7...v0.1.0a9
v0.1.0a8 — Cross-Platform + Lazy Triton
Cross-Platform Support + Lazy Triton
Consolidated release with cross-platform fixes and lazy Triton loading.
(Superseded by v0.1.0a9 which includes additional Windows fixes)
pip install neurobrix==0.1.0a8v0.1.0a7 — SANA-Video 720p
Video Family — SANA-Video 2B 720p
First video model support: SANA-Video 2B at 720p resolution.
Highlights
- 1280×704 resolution, 81 frames, 16fps video generation
- H.264 video codec (QuickTime/browser compatible)
- Per-channel latent denormalization in VAE handler (LTX2Video)
- DPM++ scheduler with flow-matching sigmas
pip install neurobrix==0.1.0a7
neurobrix run --model SANA-Video_2B_720p_diffusers --prompt "A golden retriever running through autumn leaves"v0.1.0a6 — Audio Family (11/11 models)
Audio Family — All 11 Models Working
Complete audio family support with 5 new flow handlers and universal AudioEngine.
Models
Whisper, Whisper V3 Turbo, Parakeet, Orpheus, Canary-Qwen, Kokoro-82M, VibeVoice-1.5B, Voxtral, OpenAudio-S1, Granite Speech, Chatterbox
Highlights
encoder_decoderflow: Whisper-style encoder → cross-attention decoderaudio_llmflow: audio-conditioned LLM (Voxtral, Granite Speech, Canary-Qwen)dual_arflow: DualAR semantic token generation (OpenAudio-S1)rnntflow: frame-by-frame greedy TDT decode (Parakeet)tts_llmflow: text → LM → DDPM diffusion → acoustic decoder (VibeVoice)- Universal hardware auto-detection —
--hardwareflag now optional - GPU detection for 10 vendors (NVIDIA, AMD, Intel, Apple, and more)
- Universal TilingEngine with accumulate-and-divide blending
- CLI
--audioargument for speech-to-text
pip install neurobrix==0.1.0a6v0.1.0a11 — Windows Path Fix + Self-Contained Containers
Windows Path Fix + Self-Contained Containers
Fixed
- Windows path separator bug in weight loader and NBX container —
str(path)gives backslashes on Windows, breaking weight file matching. Replaced with.as_posix()for cross-platform compatibility. - Runtime no longer reaches outside
~/.neurobrix/cache/for tokenizer files — all resources read from the NBX container - TTS output path now uses current working directory instead of hardcoded Linux path
pip install neurobrix==0.1.0a11v0.1.0a10 — Complete dependency audit + lazy imports
Full Changelog: v0.1.0a9...v0.1.0a10
v0.1.0a5 — Universal Audio Engine + ZERO Enforcement
Full Changelog: v0.1.0a4...v0.1.0a5