feat: Mamba-3 MLX port + pipeline/benchmark correctness fixes by skyoo2003 · Pull Request #14 · skyoo2003/bit-axon

skyoo2003 · 2026-04-20T10:03:16Z

Summary

Mamba-3 MLX block (new): src/bit_axon/layers/mamba3.py — Pure MLX implementation of the Exponential-Trapezoidal 3-term recurrence from arXiv:2603.15569. Complex SSM expressed as real tensors + cumulative 2×2 rotations via a RoPE trick (no complex64 needed). Single SISO/MIMO codepath. Chunked parallel scan reuses Mamba-2 segsum (Appendix A.1). Prefill vs per-token decode numerical parity (diff ~1e-8). AxonSSMBlock / AxonSSMMoEBlock updated to use Mamba3.
Qwen2.5-3B embedding port (new): src/bit_axon/porting/qwen_embedding.py — Truncates Qwen embeddings to d_source with std-rescale to match downstream variance. Skips truncation when d_source_model < 512 (over-aggressive truncation caused NaN).
Pipeline bug fix: The old save_adapter_only + load_and_merge path merged only LoRA weights on top of a random base, losing all trained embedding/norm parameters → merged PPL ~exp(log(vocab)) ~150K. Now uses merge_adapters(model) + save_merged_model(model,...) to preserve the full parameter set. PPL measurement moved before NF4 quantization.
SWA KV cache bug fix: BitAxonModel.__call__ now calls _create_caches() during prefill; SlidingWindowAttention prefill/decode paths separated (prefill seeds cache only, attends with full K/V — trimming to window_size was turning entire causal mask rows to -inf, causing softmax NaN regression).
Evaluation harness: MC extract_answer relaxed to word-boundary [A-D], GSM8K #### format prioritized, GenerateConfig.stop_strings added, math.exp overflow guard in perplexity.py. Qwen2.5-3B baseline remeasured: GSM8K 1% → 71.5%, ARC-c 14.3% → 45.3%, ARC-e 16.7% → 62.7%, MMLU 14.2% → 46%.
Training hparams + UX: LR 1e-3 → 1e-4, LoRA scale 20 → 8, newline progress every 50 steps / every 10% benchmark (print(flush=True) to bypass rich buffering).
Misc: ORPO dataset name fix (ultrafeedback_binarized_cleaned 404), BitAxonConfig.large() preset + 13 new mamba3_* fields, auto-promote vocab_size < tok.vocab_size (prevents OOB NaN), stage-upload CLI + LICENSE/NOTICE copy + param count M/B rendering, .gitignore additions for pipeline_output/ and reference_impl/.

Test plan

Mamba3 block prefill vs decode numerical parity (diff ~1e-8)
BitAxonModel full prefill vs decode parity (including SWA fix, diff ~1e-7)
Small/Medium pipeline end-to-end pass (500/300 SFT + 200/150 ORPO)
Qwen2.5-3B baseline before/after harness fix comparison complete
Large pipeline run (follow-up outside this session)
Post HF upload model card verification

- Mamba-3 SISO+MIMO block (arXiv:2603.15569) in pure MLX; replaces AxonSSM in sandwich architecture. - Pipeline merge bug: save trained full params instead of only LoRA (PPL ~150K -> ~12K). - SWA prefill/decode cache bug: create_caches on prefill; split prefill/decode paths in SWA. - Qwen2.5-3B embedding port (std-rescaled, d_source>=512 only). - Benchmark harness: relax MC extractor, add stop_strings for GSM8K, guard math.exp overflow. - Pipeline UX: per-50-step SFT/ORPO progress, per-10% benchmark progress via print(flush=True). - Config/hparams: LR 1e-3->1e-4, LoRA scale 20->8, vocab_size promotion, large() preset. - ORPO dataset rename: ultrafeedback_binarized_cleaned -> ultrafeedback_binarized. - .gitignore: pipeline_output/, reference_impl/.

sourcery-ai · 2026-04-20T10:03:23Z

Reviewer's Guide

Introduces a new pure-MLX Mamba-3 sequence block and wires it into BitAxon blocks, adds a Qwen2.5-3B embedding port for better initialization, fixes several pipeline and sliding-window attention cache correctness bugs, improves the evaluation harness and perplexity handling, tweaks training hyperparameters and CLI UX (including large config and benchmark flags), and adds a staging-only Hugging Face upload path plus scripts for baselines and full pipeline runs.

File-Level Changes

Change	Details	Files
Replace Axon SSM with new Mamba-3 block and plumb its configuration	Add pure-MLX Mamba3 implementation with chunked scalar-decay scan, RoPE-based complex SSM handling, and optional MIMO rank extension Introduce new mamba3_* configuration fields and a BitAxonConfig.large() preset Wire AxonSSMBlock and AxonSSMMoEBlock to construct Mamba3 instances via a shared helper	`src/bit_axon/layers/mamba3.py` `src/bit_axon/config.py` `src/bit_axon/layers/block.py`
Improve training pipeline initialization, merging semantics, and evaluation ordering	Allow selecting a large config, auto-bump vocab_size to match tokenizer, and optionally port Qwen2.5-3B embeddings into the model with std-rescaling and safety guards Reduce learning rate and LoRA scale consistently across SFT and ORPO stages and add periodic flushed progress logging Fix adapter merge to operate on the live trained model via merge_adapters + save_merged_model, then compute perplexity and benchmarks on the pre-quantized model before finally quantizing for deployment	`src/bit_axon/cli/pipeline.py` `src/bit_axon/cli/main.py` `src/bit_axon/porting/qwen_embedding.py`
Fix sliding-window attention and model cache semantics for prefill vs decode parity	Ensure BitAxonModel allocates KV caches on prefill so subsequent decode sees the same context Split SlidingWindowAttention prefill and decode paths to seed cache without truncating attention context on prefill, avoiding NaNs from fully-masked rows	`src/bit_axon/model.py` `src/bit_axon/layers/swa.py`
Harden evaluation harness, perplexity computation, and MC/Math answer extraction	Add stop_strings support to generation and use it in GSM8K to stop on question delimiters, while improving logging cadence in benchmark evaluation Relax multiple-choice extract_answer to accept word-boundary A–D and extend GSM8K extraction to prioritize the canonical #### answer format Guard compute_perplexity against exp overflow and allow passing benchmark limits/disable flag via CLI	`src/bit_axon/inference/generate.py` `src/bit_axon/evaluation/benchmark.py` `src/bit_axon/evaluation/tasks.py` `src/bit_axon/evaluation/perplexity.py` `src/bit_axon/cli/main.py` `src/bit_axon/cli/pipeline.py`
Add Qwen baseline evaluation script and ORPO dataset fix	Provide a script and adapter to run the Bit-Axon benchmark harness against mlx-lm Qwen2.5-3B for reproducible baselines with configurable limits Fix ORPO preset to point at the correct ultrafeedback dataset id	`scripts/eval_qwen_baseline.py` `src/bit_axon/cli/_datasets.py`
Improve Hugging Face upload staging, model card parameter formatting, and CLI UX	Factor out a stage_upload_dir helper that prepares an upload folder (weights, config, tokenizer, LICENSE/NOTICE, model card) without pushing, and expose it via a new CLI command Change parameter count rendering to scale through K/M/B instead of only B so small configs are represented accurately, and make upload more robust to tokenizer download errors and malformed benchmark inputs	`src/bit_axon/cli/upload.py` `src/bit_axon/cli/main.py`
Add end-to-end pipeline helper script and repo hygiene updates	Introduce a shell script to run small/medium/large pipelines with sensible defaults and stage upload folders for each size Update .gitignore to exclude pipeline_output and reference_impl artifacts	`scripts/pipeline_all.sh` `.gitignore`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey - I've found 2 issues

Prompt for AI Agents

Please address the comments from this code review:

## Individual Comments

### Comment 1
<location path="src/bit_axon/cli/upload.py" line_range="260" />
<code_context>
+    # Copy LICENSE / NOTICE from the repo root so the HF repo carries the
+    # same legal files the project ships with, not just the license field
+    # declared in the model card frontmatter.
+    repo_root = Path(__file__).resolve().parents[3]
+    for fname in ("LICENSE", "NOTICE"):
+        src = repo_root / fname
</code_context>
<issue_to_address>
**issue (bug_risk):** Repository root discovery likely off by one directory level

Given `__file__` is `.../src/bit_axon/cli/upload.py`, `parents[3]` resolves to `src/`, not the project root. That means LICENSE/NOTICE are searched under `src/` and will never be copied from the actual repo root, defeating the intention of this block. Please update to use `parents[4]` (or another explicit project-root resolution) so the legal files are correctly discovered and copied.
</issue_to_address>

### Comment 2
<location path="src/bit_axon/cli/upload.py" line_range="269-273" />
<code_context>
     bench_dict: dict[str, float] | None = None
     if benchmark_results is not None:
         bench_dict = {}
         for pair in benchmark_results.split(","):
+            if "=" not in pair:
+                continue
             name, acc = pair.strip().split("=")
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Silently skipping malformed benchmark entries can make CLI misconfiguration harder to notice

Malformed entries in `--benchmark-results` (missing `=`) are currently dropped, which can hide typos like `mmlu-0.45` instead of `mmlu=0.45`. Please either emit a clear warning for skipped entries or fail fast on the first malformed pair so users know their input wasn’t parsed as intended.

```suggestion
        bench_dict = {}
        for pair in benchmark_results.split(","):
            if "=" not in pair:
                raise ValueError(
                    f"Malformed benchmark result entry: {pair!r}. "
                    "Expected format 'name=score', e.g. 'mmlu=0.45'."
                )
            name, acc = pair.strip().split("=")
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

parents[3] resolved to src/ instead of project root; use parents[4]. Silent skip of malformed --benchmark-results replaced with ValueError.

skyoo2003 · 2026-04-20T10:36:30Z

@sourcery-ai review

sourcery-ai

Hey - I've found 1 issue, and left some high level feedback:

The _scalar_decay_scan implementation in mamba3.py falls back to a Python for loop when chunk_size >= T, which could become a performance bottleneck if users set a large chunk_size; consider always using a vectorized chunked path (even for a single chunk) to avoid per-token Python overhead.
In stage_upload_dir, resolving the repo root via Path(__file__).resolve().parents[4] is fairly brittle to directory layout changes; using a more explicit root discovery mechanism (e.g. anchoring on a known package/module or walking up until finding .git/pyproject.toml) would make this more robust.
The benchmark stop_strings handling in _generate_sync decodes the full generated sequence on every token, which is O(n²); given your own comment about this, you might want to implement the suggested optimization (buffering and decoding every K tokens) before this grows into a noticeable bottleneck for longer completions.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- The `_scalar_decay_scan` implementation in `mamba3.py` falls back to a Python `for` loop when `chunk_size >= T`, which could become a performance bottleneck if users set a large `chunk_size`; consider always using a vectorized chunked path (even for a single chunk) to avoid per-token Python overhead.
- In `stage_upload_dir`, resolving the repo root via `Path(__file__).resolve().parents[4]` is fairly brittle to directory layout changes; using a more explicit root discovery mechanism (e.g. anchoring on a known package/module or walking up until finding `.git`/`pyproject.toml`) would make this more robust.
- The benchmark `stop_strings` handling in `_generate_sync` decodes the full generated sequence on every token, which is O(n²); given your own comment about this, you might want to implement the suggested optimization (buffering and decoding every K tokens) before this grows into a noticeable bottleneck for longer completions.

## Individual Comments

### Comment 1
<location path="src/bit_axon/cli/upload.py" line_range="208-217" />
<code_context>


-def upload_cmd(
+def stage_upload_dir(
     model_path: str,
     repo_id: str,
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Staging upload folders ignores missing tokenizer.json, which can silently produce unusable artifacts.

`stage_upload_dir` now catches all exceptions from `_download_tokenizer_files` but still reports success, and `stage-upload` doesn’t re‑validate `tokenizer.json`. Unlike `upload_cmd` (which validates later), this means `stage-upload` can produce a directory that appears ready but is missing a tokenizer. Please either (a) have `stage_upload_dir` fail or emit a clear warning when `tokenizer.json` is absent, or (b) add a `stage-upload` option to treat missing tokenizer files as an error, so local/no-push workflows don’t quietly create broken repos.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

…tokenizer guard - mamba3: remove Python for-loop fallback, always use vectorized chunked scan - upload: replace brittle parents[4] with _find_repo_root() walking for pyproject.toml/.git - upload: fail fast when tokenizer.json is missing after staging - generate: buffer stop_strings decode check to every 16 tokens (O(n) instead of O(n²))

MLX Metal device crashes with SIGABRT after ~130 tests in a single process on CI runners. Run test_generate.py in a fresh process to avoid Metal resource exhaustion from prior test allocations.

… tests - test_enumerate_target_keys_count: 66→78 to match Mamba3's 13 params/SSM (adds B_bias, C_bias, B_norm, C_norm, mimo_x, mimo_z, mimo_o) - CI: also isolate test_compile.py (full model forward) in separate pytest process to prevent MLX Metal NaN flakiness

…n CI - test_generate.py: replace BitAxonModel with FakeModel that returns deterministic mx.array logits without Metal GPU computation - test_compile.py: skip TestModelCompileNoRegression and TestCompileBitExact on CI (MLX Metal forward passes flaky on runners) - ci.yml: restore single pytest invocation with CI=true env var

github-actions · 2026-04-28T00:49:58Z

This PR has had no activity for 7 days and has been marked as stale. If it is still being worked on, please push a commit or add a comment. Otherwise, it will be closed in 14 days.

github-actions · 2026-05-13T00:53:52Z

This PR was closed because it has been inactive for 14 days since being marked as stale. Feel free to reopen if still relevant.

github-actions Bot added bug Something isn't working documentation Improvements or additions to documentation enhancement New feature or request labels Apr 20, 2026

sourcery-ai Bot reviewed Apr 20, 2026

View reviewed changes

Comment thread src/bit_axon/cli/upload.py Outdated

Comment thread src/bit_axon/cli/upload.py

fix: repo root off-by-one and fail-fast on malformed benchmark entries

d7ddf9c

parents[3] resolved to src/ instead of project root; use parents[4]. Silent skip of malformed --benchmark-results replaced with ValueError.

sourcery-ai Bot reviewed Apr 20, 2026

View reviewed changes

Comment thread src/bit_axon/cli/upload.py

skyoo2003 added 2 commits April 20, 2026 19:40

fix(ci): isolate MLX inference tests in separate pytest process

6c6f7fc

MLX Metal device crashes with SIGABRT after ~130 tests in a single process on CI runners. Run test_generate.py in a fresh process to avoid Metal resource exhaustion from prior test allocations.

github-actions Bot added the github-actions GitHub Actions related label Apr 20, 2026

skyoo2003 self-assigned this Apr 20, 2026

skyoo2003 added 2 commits April 20, 2026 19:56

github-actions Bot added the stale No activity for 14+ days label Apr 28, 2026

github-actions Bot closed this May 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Mamba-3 MLX port + pipeline/benchmark correctness fixes#14

feat: Mamba-3 MLX port + pipeline/benchmark correctness fixes#14
skyoo2003 wants to merge 6 commits into
mainfrom
feat/mamba3-port-and-pipeline-fixes

skyoo2003 commented Apr 20, 2026 •

edited

Loading

Uh oh!

sourcery-ai Bot commented Apr 20, 2026 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

skyoo2003 commented Apr 20, 2026

Uh oh!

sourcery-ai Bot left a comment

Uh oh!

Uh oh!

github-actions Bot commented Apr 28, 2026

Uh oh!

github-actions Bot commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

skyoo2003 commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

sourcery-ai Bot commented Apr 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

skyoo2003 commented Apr 20, 2026

Uh oh!

sourcery-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions Bot commented Apr 28, 2026

Uh oh!

github-actions Bot commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

skyoo2003 commented Apr 20, 2026 •

edited

Loading

sourcery-ai Bot commented Apr 20, 2026 •

edited

Loading