Skip to content

Releases: NeuroBrix/neurobrix

v0.2.0 — P-PRISM-NEVER-REFUSE v2 closed (14/16 cells)

15 May 03:38

Choose a tag to compare

NeuroBrix v0.2.0 — P-PRISM-NEVER-REFUSE v2 closed (2026-05-15)

162 commits since v0.1.6 (2026-04-20). Minor version bump per SemVer:
features added are rétrocompatibles, no breaking API changes. Major
acquis closes the P-PRISM-NEVER-REFUSE v2 mandate with 14/16 hardware ×
mode cells validated end-to-end on Sana 1.6B 4Kpx diffusion (4096×4096
output).

Highlights

  • Doctrine R34 model-agnostic ratified and audited clean — zero
    active hardcode-by-model-name violations across runtime, kernels,
    strategies, dispatchers.
  • Doctrine R35 "Prism never refuses" implemented and empirically
    validated — cascade strategy covers every legitimate hardware ×
    model combination down to cpu_execution fallback.
  • 14/16 cells matrice hardware × mode validated end-to-end on
    Sana 4Kpx (single 32 GiB, single 16 GiB, 2× 16 GiB; compiled,
    sequential, triton, triton_sequential). The 2 ⏸ cells remain on
    upstream-blocked CPU triton modes (triton-cpu PyPI wheel pending).
  • 18× speedup on 32g triton mode: 1443 s → 79.81 s wall on
    Sana 4Kpx via DC-AE residual chain band-streaming (each of 9 VAE
    chains streams band-by-band, peak transient drops from 6.7 GiB to
    1.6 GiB per chain).

Hardware coverage

Config × Mode compiled sequential triton triton-seq
Single GPU 32 GiB ✓ 23.9 s ✓ 79.8 s
Single GPU 16 GiB ✓ 79.4 s ✓ 84.5 s
Multi-GPU 2× 16 GiB ✓ 81.8 s ✓ 84.1 s
CPU pure ⏸ S3 ⏸ S3

Wall times measured on Sana 1.6B 4Kpx single-step at 4096×4096 spatial
output. CPU triton cells wait on upstream triton-cpu (meta-pytorch
project) shipping PyPI wheels.

Architecture

Hybrid CPU+GPU per-component placement

core/strategies/lazy_sequential.py + core/runtime/executor.py now
correctly route mixed-device plans through strategy.execute_component
so a CPU component consuming a GPU producer no longer stalls in
implicit transfer or raises device-mismatch. Unlocked 16g compiled +
sequential on models whose VAE alone exceeds the GPU budget.

Intra-component multi-device cascade

The single_gpu strategy correctly picks cuda:0 on 2× 16 GiB
hardware when a component fits there post-S5 chain tiling. No solver
extension required — the cascade naturally handles the case.

DC-AE residual chain tiling — band-streamed (R30 dual-branch)

  • Compiled mode: band_streamed_chain_torch materialises the residual
    chain band-by-band in PyTorch ATen, writes back to T_base in place.
  • Triton mode: band_streamed_chain_nbx (143 lines, R33-pure) mirrors
    the algorithm using NBX wrappers + NBXTensor methods only.
  • Mode-aware dispatcher in core/module/tiling_engine.py selects the
    variant per graph_executor.mode.
  • Chain wrapper is default-ON on triton modes; NBX_TRITON_CHAIN_WRAPPER=0
    reverts to the c9d2581 skip path as emergency rollback.

Edge-padding + halo-offset wrapper math fix (_tiled_conv2d_spatial_*)

Two coupled bugs in the 4 tiled-conv wrappers (_tiled_conv2d_spatial_torch,
_tiled_conv2d_spatial_nbx, _fused_upsample_conv2d_torch,
_fused_upsample_conv2d_nbx):

  • Edge-band double-padding (max(0, -in_read_start) already provides
    pad_h; the extra + pad_h if is_top_band term over-padded by
    pad_h rows).
  • Internal-frontier halo offset on the conv_band read side
    (output slice was offset by halo_top rows on every internal band).
    Validated bit-exact via scripts/microtest_tiled_conv2d_small_scale.py
    sweeping (kh ∈ {1,3,5}, pad_h ∈ {0,1,2}, tile_factor ∈ {1,2,4,8}) at
    1024² and 2048² — cos=1.0000 max_abs=0.0000 universally.

NBXTensor.getitem negative-slice fix — ROOT FIX for chain wrapper

NBXTensor.__getitem__ slice path used k.start or 0 which silently
left negative starts untouched (-2 or 0 evaluates to -2, not 0).
For t[-N:], the resulting narrow(dim, -N, length) produced an
_offset = parent_offset - N*stridedata_ptr() pointed BEFORE the
cudaMalloc'd block → Triton's cuPointerGetAttribute rejected the
pointer with "Pointer argument (at 0) cannot be accessed from Triton
(cpu tensor?)".

The Sana 4Kpx VAE chain wrapper's halo_carry[:, :, -top_size:, :]
triggered this at the 3rd iteration of every 4096² chain. Fix restores
the universal Python/torch slice contract:

start = k.start if k.start is not None else 0
stop = k.stop if k.stop is not None else shape[dim]
if start < 0: start = max(0, shape[dim] + start)
if stop < 0:  stop  = shape[dim] + stop

8-line fix, 4 cells unlocked. Latent bug across the codebase, now closed.

Triton-CPU integration stage 1

triton-cpu (meta-pytorch project) is not on PyPI today and requires
build-from-source. Stage 1 ships the install gate (raises clean
ImportError with actionable message when missing), coverage docs,
and the activation flow. CPU triton + CPU triton-sequential cells
remain ⏸ until upstream ships PyPI wheels — automatic unblock then.

Doctrine ratified

  • R34 model-agnostic strict: structural pattern detection, no
    hardcode-by-model-name in runtime / kernels / strategies / dispatchers.
  • R35 Prism never refuses: cascade fully implemented to
    cpu_execution fallback; every legitimate plan finds a strategy.
  • R30 dualité runtime: the 4 execution modes (compiled, sequential,
    triton, triton_sequential) cover the 14 cells symmetrically;
    band_streamed_chain_* mirror restored chain coverage on triton path.
  • R33 zero torch in triton/: preserved; chain wrapper NBX
    variant uses only NBX wrappers + NBXTensor methods, no torch
    import on the execute path.
  • R29 inspectable artefacts: every cell ✓ has a coherent red
    apple PNG in docs/verdicts/p_triton_chain_cpu_pointer/ and
    docs/verdicts/p_nbx_tiled_conv2d_small_scale/.

Known limitations (transparency)

  • 2/16 cells ⏸ : CPU pure × triton and CPU pure × triton_sequential
    blocked on absence of triton-cpu PyPI wheel
    (meta-pytorch/triton-cpu, build-from-source only at v0.2.0 release
    time). Backlog item P-TRITON-CPU-UPSTREAM-WHEEL-FOLLOWUP
    unblocks automatically when upstream ships wheels.
  • Op-level cross-device split (Gap B) out of scope of v2 mandate:
    for models where a single op exceeds per-device VRAM. Backlog item
    P-OP-LEVEL-CROSS-DEVICE-SPLIT opens when a concrete model demands
    it (no Sana 4Kpx need it now that VAE fits 16 GiB post-S5 tiling).

Validation artefacts

The mandate v2 verdict and per-cell R29 artefacts are committed:

  • docs/verdicts/p_prism_never_refuse_v2_closed.md — full
    verdict, sub-chantier history, commit hashes, anti-régression table.
  • docs/verdicts/p_triton_chain_cpu_pointer/ — 6 coherent red
    apple PNGs (32g compiled / 32g triton / 16g compiled / 16g triton /
    16g triton-seq / 2× 16g triton / 2× 16g triton-seq).
  • docs/verdicts/p_nbx_tiled_conv2d_small_scale/ — microtest
    logs + 16g compiled anti-regression PNG + 32g triton baseline PNG.

Key commits

Sub-chantiers and bugfixes that shaped this release:

Sub-chantier / fix Commits
S1 hybrid CPU+GPU dispatch de5fb9e
S2 native sequential CPU (RoPE fix) 8b4d020
S3 triton-cpu install gate b3e479f, d0974e6
S5 residual chain detection + wrapper 198ab1b33c6b21
P-S5 depthwise tile-skip 8af7848
R30 chain skip on triton (interim) c9d2581
S4 closure (cascade docs) f58f6cc
JSONL dump O(N) da484ae
P-NBX-TILED-CONV2D-SMALL-SCALE 176bc7e, 63edb03
L1 NBX_LIVE_WATERMARK_TRACE 993181f
L2 per-slot LIVE_DUMP_SLOT 5a714a5
L4 band_streamed_chain_nbx + dispatcher f8a8ad8
L4b NBXTensor isinstance + _set_device f997479
C1 NBX_CHAIN_DIAG per-chain device-state 9ddcc7b
C3d ROOT FIX — NBXTensor.getitem 8a6daf2
C4 chain wrapper default-ON triton 23de696
MANDATE CLOSED verdict + R29 77571b7

Tag

  • p-prism-never-refuse-v2-closed posted on commit 77571b7 (mandate
    closure), pushed to origin and gitlab.
  • v0.2.0 semver release tag posted on this commit, pushed to both
    remotes.

Upgrading from v0.1.x

No breaking API changes. The chain wrapper is default-ON on triton
modes (set NBX_TRITON_CHAIN_WRAPPER=0 to fall back to the v0.1.x
behaviour). Other internal changes are transparent to call sites.

NeuroBrix 0.1.6 — Multi-GPU triton hardened + Item 3 closed

20 Apr 14:05

Choose a tag to compare

Fifth release since 0.1.5 on April 15. Four substantive commits hardening the multi-GPU Triton path and closing the long-standing Item 3 perf/VRAM trade-off, plus a regression-harness overhaul.

Highlights

  • Multi-GPU Triton hardened — 163 of 211 kernel-launch sites were latently broken. kernels/wrappers.py's documented _set_device contract was honoured at only 48 of 211 production launch sites. On single-GPU this was invisible (active device always matched the tensor), but any multi-GPU Prism plan with a component on a non-active device crashed argmax, bmm, addmm, softmax, rms_norm, and most elementwise wrappers. Fix promotes _set_device to a single source of truth in nbx_tensor.py and inserts the guard at every red site. 211 / 211 green. Phase 2 multi-GPU strategy matrix now 8 / 8 cells green.

  • Zero3 block-wise ratchet pipelining (native + triton). Qwen3-30B-A3B-Thinking-2507 on a single 16 GB V100 now runs under bounded VRAM (two blocks of weights + KV + activations). Overlaps H2D(N+1) with compute(N) on a dedicated transfer stream. Prefill dramatically faster than the per-op slow path. Polymorphic over torch.Tensor + CompiledSequence and NBXTensor + TritonSequence.

  • Item 3 closed — in-kernel fp16 → fp32 tile promotion for mm / bmm / addmm on pre-Ampere (Path A'). New PROMOTE_B: tl.constexpr flag on matmul_kernel / addmm_kernel: the b tile is cast to a.dtype after tl.load, before tl.dot. Bit-exact. On TinyLlama-1.1B v100-16g --triton: peak VRAM 4.74 → 2.54 GB (-46 %), decode 9.94 → 5.04 s on 8 tokens (-49 %), character-identical output across 30-token greedy decode. On Qwen3-30B zero3: -64 MB, -5 %.

  • NBX_FORCE_STRATEGY env var for deterministic Prism strategy selection. Short-circuits the score cascade. Unknown strategy → RuntimeError listing the 9 valid values. Valid but unavailable for the device count → RuntimeError. Fits but under-performs the auto-selected one → still used (zero silent fallback). Enables per-strategy matrix regression testing.

  • Regression harness tracked in git + 4 ::native audio fails resolved. Prior diagnoses had attributed the fails to "tokenizers env-drift" without sourced stderr; this release captured actual stderr and rooted 3 of them in interpreter selection (venv vs system Python user-site) plus one genuine aten::cudnn_batch_norm regression on Kokoro-82M (marked xfail with follow-up scoped in docs/follow-ups/). New pytest_configure hook auto-detects $VIRTUAL_ENV. Harness goes from 4 failed / 11 passed / 11 xfailed / 14 skipped to 0 failed / 14 passed / 12 xfailed / 14 skipped.

Full changelog

See CHANGELOG.md §0.1.6.

Install

pip install neurobrix==0.1.6

Wheel sha256 7fca8534dd5c87d6dc455ad206618cda95f165460e74e8b6a944e8745a97e410, sdist sha256 be3258ebd8eae326d24059054ad59de2fb352df5e143bbf9f7bff558224fc673.

v0.1.0a12 — Complete Windows Compatibility

20 Mar 00:30

Choose a tag to compare

Complete Windows Compatibility

Full end-to-end audit and fix of all Windows-breaking code paths.

Fixed

  • os.kill(pid, 0) crash — replaced with ctypes.windll.kernel32.OpenProcess() for process existence check
  • signal.SIGTERM handler crash — guarded registration with hasattr()
  • ffmpeg video pipe deadlock — replaced manual pipe write with communicate()
  • All daemon lifecycle operations (start, check, stop) now Windows-safe

Windows users can now run the full cold-run flow:

neurobrix import Sana/Sana_1600M_1024px_MultiLing --no-keep
neurobrix run --model Sana_1600M_1024px_MultiLing --prompt "a cat in a hat"
pip install neurobrix==0.1.0a12

v0.1.0a9 — Cross-Platform: Windows + macOS + Lazy Triton

19 Mar 21:43

Choose a tag to compare

v0.1.0a8 — Cross-Platform + Lazy Triton

20 Mar 00:07

Choose a tag to compare

Cross-Platform Support + Lazy Triton

Consolidated release with cross-platform fixes and lazy Triton loading.

(Superseded by v0.1.0a9 which includes additional Windows fixes)

pip install neurobrix==0.1.0a8

v0.1.0a7 — SANA-Video 720p

19 Mar 05:11

Choose a tag to compare

Pre-release

Video Family — SANA-Video 2B 720p

First video model support: SANA-Video 2B at 720p resolution.

Highlights

  • 1280×704 resolution, 81 frames, 16fps video generation
  • H.264 video codec (QuickTime/browser compatible)
  • Per-channel latent denormalization in VAE handler (LTX2Video)
  • DPM++ scheduler with flow-matching sigmas
pip install neurobrix==0.1.0a7

neurobrix run --model SANA-Video_2B_720p_diffusers --prompt "A golden retriever running through autumn leaves"

v0.1.0a6 — Audio Family (11/11 models)

19 Mar 05:11

Choose a tag to compare

Audio Family — All 11 Models Working

Complete audio family support with 5 new flow handlers and universal AudioEngine.

Models

Whisper, Whisper V3 Turbo, Parakeet, Orpheus, Canary-Qwen, Kokoro-82M, VibeVoice-1.5B, Voxtral, OpenAudio-S1, Granite Speech, Chatterbox

Highlights

  • encoder_decoder flow: Whisper-style encoder → cross-attention decoder
  • audio_llm flow: audio-conditioned LLM (Voxtral, Granite Speech, Canary-Qwen)
  • dual_ar flow: DualAR semantic token generation (OpenAudio-S1)
  • rnnt flow: frame-by-frame greedy TDT decode (Parakeet)
  • tts_llm flow: text → LM → DDPM diffusion → acoustic decoder (VibeVoice)
  • Universal hardware auto-detection — --hardware flag now optional
  • GPU detection for 10 vendors (NVIDIA, AMD, Intel, Apple, and more)
  • Universal TilingEngine with accumulate-and-divide blending
  • CLI --audio argument for speech-to-text
pip install neurobrix==0.1.0a6

v0.1.0a11 — Windows Path Fix + Self-Contained Containers

20 Mar 00:07

Choose a tag to compare

Windows Path Fix + Self-Contained Containers

Fixed

  • Windows path separator bug in weight loader and NBX container — str(path) gives backslashes on Windows, breaking weight file matching. Replaced with .as_posix() for cross-platform compatibility.
  • Runtime no longer reaches outside ~/.neurobrix/cache/ for tokenizer files — all resources read from the NBX container
  • TTS output path now uses current working directory instead of hardcoded Linux path
pip install neurobrix==0.1.0a11

v0.1.0a10 — Complete dependency audit + lazy imports

19 Mar 21:59

Choose a tag to compare

v0.1.0a5 — Universal Audio Engine + ZERO Enforcement

11 Mar 03:19

Choose a tag to compare