15 May 03:38

2663c95

v0.2.0 — P-PRISM-NEVER-REFUSE v2 closed (14/16 cells) Latest

Latest

NeuroBrix v0.2.0 — P-PRISM-NEVER-REFUSE v2 closed (2026-05-15)

162 commits since v0.1.6 (2026-04-20). Minor version bump per SemVer:
features added are rétrocompatibles, no breaking API changes. Major
acquis closes the P-PRISM-NEVER-REFUSE v2 mandate with 14/16 hardware ×
mode cells validated end-to-end on Sana 1.6B 4Kpx diffusion (4096×4096
output).

Highlights

Doctrine R34 model-agnostic ratified and audited clean — zero
active hardcode-by-model-name violations across runtime, kernels,
strategies, dispatchers.
Doctrine R35 "Prism never refuses" implemented and empirically
validated — cascade strategy covers every legitimate hardware ×
model combination down to cpu_execution fallback.
14/16 cells matrice hardware × mode validated end-to-end on
Sana 4Kpx (single 32 GiB, single 16 GiB, 2× 16 GiB; compiled,
sequential, triton, triton_sequential). The 2 ⏸ cells remain on
upstream-blocked CPU triton modes (triton-cpu PyPI wheel pending).
18× speedup on 32g triton mode: 1443 s → 79.81 s wall on
Sana 4Kpx via DC-AE residual chain band-streaming (each of 9 VAE
chains streams band-by-band, peak transient drops from 6.7 GiB to
1.6 GiB per chain).

Hardware coverage

Config × Mode	compiled	sequential	triton	triton-seq
Single GPU 32 GiB	✓ 23.9 s	✓	✓ 79.8 s	✓
Single GPU 16 GiB	✓	✓	✓ 79.4 s	✓ 84.5 s
Multi-GPU 2× 16 GiB	✓	✓	✓ 81.8 s	✓ 84.1 s
CPU pure	✓	✓	⏸ S3	⏸ S3

Wall times measured on Sana 1.6B 4Kpx single-step at 4096×4096 spatial
output. CPU triton cells wait on upstream triton-cpu (meta-pytorch
project) shipping PyPI wheels.

Architecture

Hybrid CPU+GPU per-component placement

core/strategies/lazy_sequential.py + core/runtime/executor.py now
correctly route mixed-device plans through strategy.execute_component
so a CPU component consuming a GPU producer no longer stalls in
implicit transfer or raises device-mismatch. Unlocked 16g compiled +
sequential on models whose VAE alone exceeds the GPU budget.

Intra-component multi-device cascade

The single_gpu strategy correctly picks cuda:0 on 2× 16 GiB
hardware when a component fits there post-S5 chain tiling. No solver
extension required — the cascade naturally handles the case.

DC-AE residual chain tiling — band-streamed (R30 dual-branch)

Compiled mode: band_streamed_chain_torch materialises the residual
chain band-by-band in PyTorch ATen, writes back to T_base in place.
Triton mode: band_streamed_chain_nbx (143 lines, R33-pure) mirrors
the algorithm using NBX wrappers + NBXTensor methods only.
Mode-aware dispatcher in core/module/tiling_engine.py selects the
variant per graph_executor.mode.
Chain wrapper is default-ON on triton modes; NBX_TRITON_CHAIN_WRAPPER=0
reverts to the c9d2581 skip path as emergency rollback.

Edge-padding + halo-offset wrapper math fix (`_tiled_conv2d_spatial_*`)

Two coupled bugs in the 4 tiled-conv wrappers (_tiled_conv2d_spatial_torch,
_tiled_conv2d_spatial_nbx, _fused_upsample_conv2d_torch,
_fused_upsample_conv2d_nbx):

Edge-band double-padding (max(0, -in_read_start) already provides
pad_h; the extra + pad_h if is_top_band term over-padded by
pad_h rows).
Internal-frontier halo offset on the conv_band read side
(output slice was offset by halo_top rows on every internal band).
Validated bit-exact via scripts/microtest_tiled_conv2d_small_scale.py
sweeping (kh ∈ {1,3,5}, pad_h ∈ {0,1,2}, tile_factor ∈ {1,2,4,8}) at
1024² and 2048² — cos=1.0000 max_abs=0.0000 universally.

NBXTensor.getitem negative-slice fix — ROOT FIX for chain wrapper

NBXTensor.__getitem__ slice path used k.start or 0 which silently
left negative starts untouched (-2 or 0 evaluates to -2, not 0).
For t[-N:], the resulting narrow(dim, -N, length) produced an
_offset = parent_offset - N*stride → data_ptr() pointed BEFORE the
cudaMalloc'd block → Triton's cuPointerGetAttribute rejected the
pointer with "Pointer argument (at 0) cannot be accessed from Triton
(cpu tensor?)".

The Sana 4Kpx VAE chain wrapper's halo_carry[:, :, -top_size:, :]
triggered this at the 3rd iteration of every 4096² chain. Fix restores
the universal Python/torch slice contract:

start = k.start if k.start is not None else 0
stop = k.stop if k.stop is not None else shape[dim]
if start < 0: start = max(0, shape[dim] + start)
if stop < 0:  stop  = shape[dim] + stop

8-line fix, 4 cells unlocked. Latent bug across the codebase, now closed.

Triton-CPU integration stage 1

triton-cpu (meta-pytorch project) is not on PyPI today and requires
build-from-source. Stage 1 ships the install gate (raises clean
ImportError with actionable message when missing), coverage docs,
and the activation flow. CPU triton + CPU triton-sequential cells
remain ⏸ until upstream ships PyPI wheels — automatic unblock then.

Doctrine ratified

R34 model-agnostic strict: structural pattern detection, no
hardcode-by-model-name in runtime / kernels / strategies / dispatchers.
R35 Prism never refuses: cascade fully implemented to
cpu_execution fallback; every legitimate plan finds a strategy.
R30 dualité runtime: the 4 execution modes (compiled, sequential,
triton, triton_sequential) cover the 14 cells symmetrically;
band_streamed_chain_* mirror restored chain coverage on triton path.
R33 zero torch in triton/: preserved; chain wrapper NBX
variant uses only NBX wrappers + NBXTensor methods, no torch
import on the execute path.
R29 inspectable artefacts: every cell ✓ has a coherent red
apple PNG in docs/verdicts/p_triton_chain_cpu_pointer/ and
docs/verdicts/p_nbx_tiled_conv2d_small_scale/.

Known limitations (transparency)

2/16 cells ⏸ : CPU pure × triton and CPU pure × triton_sequential
blocked on absence of triton-cpu PyPI wheel
(meta-pytorch/triton-cpu, build-from-source only at v0.2.0 release
time). Backlog item P-TRITON-CPU-UPSTREAM-WHEEL-FOLLOWUP
unblocks automatically when upstream ships wheels.
Op-level cross-device split (Gap B) out of scope of v2 mandate:
for models where a single op exceeds per-device VRAM. Backlog item
P-OP-LEVEL-CROSS-DEVICE-SPLIT opens when a concrete model demands
it (no Sana 4Kpx need it now that VAE fits 16 GiB post-S5 tiling).

Validation artefacts

The mandate v2 verdict and per-cell R29 artefacts are committed:

docs/verdicts/p_prism_never_refuse_v2_closed.md — full
verdict, sub-chantier history, commit hashes, anti-régression table.
docs/verdicts/p_triton_chain_cpu_pointer/ — 6 coherent red
apple PNGs (32g compiled / 32g triton / 16g compiled / 16g triton /
16g triton-seq / 2× 16g triton / 2× 16g triton-seq).
docs/verdicts/p_nbx_tiled_conv2d_small_scale/ — microtest
logs + 16g compiled anti-regression PNG + 32g triton baseline PNG.

Key commits

Sub-chantiers and bugfixes that shaped this release:

Sub-chantier / fix	Commits
S1 hybrid CPU+GPU dispatch	`de5fb9e`
S2 native sequential CPU (RoPE fix)	`8b4d020`
S3 triton-cpu install gate	`b3e479f`, `d0974e6`
S5 residual chain detection + wrapper	`198ab1b` → `33c6b21`
P-S5 depthwise tile-skip	`8af7848`
R30 chain skip on triton (interim)	`c9d2581`
S4 closure (cascade docs)	`f58f6cc`
JSONL dump O(N)	`da484ae`
P-NBX-TILED-CONV2D-SMALL-SCALE	`176bc7e`, `63edb03`
L1 NBX_LIVE_WATERMARK_TRACE	`993181f`
L2 per-slot LIVE_DUMP_SLOT	`5a714a5`
L4 band_streamed_chain_nbx + dispatcher	`f8a8ad8`
L4b NBXTensor isinstance + _set_device	`f997479`
C1 NBX_CHAIN_DIAG per-chain device-state	`9ddcc7b`
C3d ROOT FIX — NBXTensor.getitem	`8a6daf2`
C4 chain wrapper default-ON triton	`23de696`
MANDATE CLOSED verdict + R29	`77571b7`

Upgrading from v0.1.x

No breaking API changes. The chain wrapper is default-ON on triton
modes (set NBX_TRITON_CHAIN_WRAPPER=0 to fall back to the v0.1.x
behaviour). Other internal changes are transparent to call sites.

Assets 3

20 Apr 14:05

benkelaya

v0.1.6

f92755d

NeuroBrix 0.1.6 — Multi-GPU triton hardened + Item 3 closed

Fifth release since 0.1.5 on April 15. Four substantive commits hardening the multi-GPU Triton path and closing the long-standing Item 3 perf/VRAM trade-off, plus a regression-harness overhaul.

Highlights

Multi-GPU Triton hardened — 163 of 211 kernel-launch sites were latently broken. kernels/wrappers.py's documented _set_device contract was honoured at only 48 of 211 production launch sites. On single-GPU this was invisible (active device always matched the tensor), but any multi-GPU Prism plan with a component on a non-active device crashed argmax, bmm, addmm, softmax, rms_norm, and most elementwise wrappers. Fix promotes _set_device to a single source of truth in nbx_tensor.py and inserts the guard at every red site. 211 / 211 green. Phase 2 multi-GPU strategy matrix now 8 / 8 cells green.
Zero3 block-wise ratchet pipelining (native + triton). Qwen3-30B-A3B-Thinking-2507 on a single 16 GB V100 now runs under bounded VRAM (two blocks of weights + KV + activations). Overlaps H2D(N+1) with compute(N) on a dedicated transfer stream. Prefill dramatically faster than the per-op slow path. Polymorphic over torch.Tensor + CompiledSequence and NBXTensor + TritonSequence.
Item 3 closed — in-kernel fp16 → fp32 tile promotion for mm / bmm / addmm on pre-Ampere (Path A'). New PROMOTE_B: tl.constexpr flag on matmul_kernel / addmm_kernel: the b tile is cast to a.dtype after tl.load, before tl.dot. Bit-exact. On TinyLlama-1.1B v100-16g --triton: peak VRAM 4.74 → 2.54 GB (-46 %), decode 9.94 → 5.04 s on 8 tokens (-49 %), character-identical output across 30-token greedy decode. On Qwen3-30B zero3: -64 MB, -5 %.
NBX_FORCE_STRATEGY env var for deterministic Prism strategy selection. Short-circuits the score cascade. Unknown strategy → RuntimeError listing the 9 valid values. Valid but unavailable for the device count → RuntimeError. Fits but under-performs the auto-selected one → still used (zero silent fallback). Enables per-strategy matrix regression testing.
Regression harness tracked in git + 4 ::native audio fails resolved. Prior diagnoses had attributed the fails to "tokenizers env-drift" without sourced stderr; this release captured actual stderr and rooted 3 of them in interpreter selection (venv vs system Python user-site) plus one genuine aten::cudnn_batch_norm regression on Kokoro-82M (marked xfail with follow-up scoped in docs/follow-ups/). New pytest_configure hook auto-detects $VIRTUAL_ENV. Harness goes from 4 failed / 11 passed / 11 xfailed / 14 skipped to 0 failed / 14 passed / 12 xfailed / 14 skipped.

Full changelog

See CHANGELOG.md §0.1.6.

Install

pip install neurobrix==0.1.6

Wheel sha256 7fca8534dd5c87d6dc455ad206618cda95f165460e74e8b6a944e8745a97e410, sdist sha256 be3258ebd8eae326d24059054ad59de2fb352df5e143bbf9f7bff558224fc673.

Assets 4

20 Mar 00:30

benkelaya

v0.1.0a12

cda22a3

v0.1.0a12 — Complete Windows Compatibility Pre-release

Pre-release

Complete Windows Compatibility

Full end-to-end audit and fix of all Windows-breaking code paths.

Fixed

os.kill(pid, 0) crash — replaced with ctypes.windll.kernel32.OpenProcess() for process existence check
signal.SIGTERM handler crash — guarded registration with hasattr()
ffmpeg video pipe deadlock — replaced manual pipe write with communicate()
All daemon lifecycle operations (start, check, stop) now Windows-safe

Windows users can now run the full cold-run flow:

neurobrix import Sana/Sana_1600M_1024px_MultiLing --no-keep
neurobrix run --model Sana_1600M_1024px_MultiLing --prompt "a cat in a hat"

pip install neurobrix==0.1.0a12

Assets 2

19 Mar 21:43

benkelaya

v0.1.0a9

6d190ca

v0.1.0a9 — Cross-Platform: Windows + macOS + Lazy Triton Pre-release

Pre-release

Full Changelog: v0.1.0a7...v0.1.0a9

Assets 2

20 Mar 00:07

benkelaya

v0.1.0a8

e7ad5d2

v0.1.0a8 — Cross-Platform + Lazy Triton Pre-release

Pre-release

Cross-Platform Support + Lazy Triton

Consolidated release with cross-platform fixes and lazy Triton loading.

(Superseded by v0.1.0a9 which includes additional Windows fixes)

pip install neurobrix==0.1.0a8

Assets 2

19 Mar 05:11

benkelaya

v0.1.0a7

4e08999

v0.1.0a7 — SANA-Video 720p Pre-release

Pre-release

Video Family — SANA-Video 2B 720p

First video model support: SANA-Video 2B at 720p resolution.

Highlights

1280×704 resolution, 81 frames, 16fps video generation
H.264 video codec (QuickTime/browser compatible)
Per-channel latent denormalization in VAE handler (LTX2Video)
DPM++ scheduler with flow-matching sigmas

pip install neurobrix==0.1.0a7

neurobrix run --model SANA-Video_2B_720p_diffusers --prompt "A golden retriever running through autumn leaves"

Assets 2

19 Mar 05:11

benkelaya

v0.1.0a6

62dbc9f

v0.1.0a6 — Audio Family (11/11 models) Pre-release

Pre-release

Audio Family — All 11 Models Working

Complete audio family support with 5 new flow handlers and universal AudioEngine.

Models

Whisper, Whisper V3 Turbo, Parakeet, Orpheus, Canary-Qwen, Kokoro-82M, VibeVoice-1.5B, Voxtral, OpenAudio-S1, Granite Speech, Chatterbox

Highlights

encoder_decoder flow: Whisper-style encoder → cross-attention decoder
audio_llm flow: audio-conditioned LLM (Voxtral, Granite Speech, Canary-Qwen)
dual_ar flow: DualAR semantic token generation (OpenAudio-S1)
rnnt flow: frame-by-frame greedy TDT decode (Parakeet)
tts_llm flow: text → LM → DDPM diffusion → acoustic decoder (VibeVoice)
Universal hardware auto-detection — --hardware flag now optional
GPU detection for 10 vendors (NVIDIA, AMD, Intel, Apple, and more)
Universal TilingEngine with accumulate-and-divide blending
CLI --audio argument for speech-to-text

pip install neurobrix==0.1.0a6

Assets 2

20 Mar 00:07

benkelaya

v0.1.0a11

541faa7

v0.1.0a11 — Windows Path Fix + Self-Contained Containers Pre-release

Pre-release

Windows Path Fix + Self-Contained Containers

Fixed

Windows path separator bug in weight loader and NBX container — str(path) gives backslashes on Windows, breaking weight file matching. Replaced with .as_posix() for cross-platform compatibility.
Runtime no longer reaches outside ~/.neurobrix/cache/ for tokenizer files — all resources read from the NBX container
TTS output path now uses current working directory instead of hardcoded Linux path

pip install neurobrix==0.1.0a11

Assets 2

19 Mar 21:59

benkelaya

v0.1.0a10

376d75b

v0.1.0a10 — Complete dependency audit + lazy imports Pre-release

Pre-release

Full Changelog: v0.1.0a9...v0.1.0a10

Assets 2

11 Mar 03:19

benkelaya

v0.1.0a5

db202d9

v0.1.0a5 — Universal Audio Engine + ZERO Enforcement Pre-release

Pre-release

Full Changelog: v0.1.0a4...v0.1.0a5

Assets 2

Releases: NeuroBrix/neurobrix

v0.2.0 — P-PRISM-NEVER-REFUSE v2 closed (14/16 cells)

NeuroBrix v0.2.0 — P-PRISM-NEVER-REFUSE v2 closed (2026-05-15)

Highlights

Hardware coverage

Architecture

Hybrid CPU+GPU per-component placement

Intra-component multi-device cascade

DC-AE residual chain tiling — band-streamed (R30 dual-branch)

Edge-padding + halo-offset wrapper math fix (_tiled_conv2d_spatial_*)

NBXTensor.getitem negative-slice fix — ROOT FIX for chain wrapper

Triton-CPU integration stage 1

Doctrine ratified

Known limitations (transparency)

Validation artefacts

Key commits

Tag

Upgrading from v0.1.x

Uh oh!

NeuroBrix 0.1.6 — Multi-GPU triton hardened + Item 3 closed

Highlights

Full changelog

Install

Uh oh!

v0.1.0a12 — Complete Windows Compatibility

Complete Windows Compatibility

Fixed

Uh oh!

v0.1.0a9 — Cross-Platform: Windows + macOS + Lazy Triton

Uh oh!

v0.1.0a8 — Cross-Platform + Lazy Triton

Cross-Platform Support + Lazy Triton

Uh oh!

v0.1.0a7 — SANA-Video 720p

Video Family — SANA-Video 2B 720p

Highlights

Uh oh!

v0.1.0a6 — Audio Family (11/11 models)

Audio Family — All 11 Models Working

Models

Highlights

Uh oh!

v0.1.0a11 — Windows Path Fix + Self-Contained Containers

Windows Path Fix + Self-Contained Containers

Fixed

Uh oh!

v0.1.0a10 — Complete dependency audit + lazy imports

Uh oh!

v0.1.0a5 — Universal Audio Engine + ZERO Enforcement

Uh oh!

Edge-padding + halo-offset wrapper math fix (`_tiled_conv2d_spatial_*`)