Skip to content

Latest commit

 

History

History
144 lines (107 loc) · 4.27 KB

File metadata and controls

144 lines (107 loc) · 4.27 KB

Contributing

DwarfStar4 changes should be tested against the failure mode they can realistically affect. The project has two regression tracks: correctness and speed. Please include the commands you ran, the machine/backend, the model quant, and any notable failures in the PR or commit notes.

Do not send PRs affecting one or more inference backends without checking if the resulting code is still correct and fast. The only acceptable regression speed is when an important correctness bug is fixed and it requires some speed penalty.

Correctness Regression Tests

Build the default backend first:

make clean
make

The C test runner is ds4_test. Running it without arguments is equivalent to --all:

make test

Useful narrower checks:

./ds4_test --server
./ds4_test --logprob-vectors
./ds4_test --long-context
./ds4_test --tool-call-quality
./ds4_test --metal-kernels

What they cover:

  • --server: request parsing, chat rendering, streaming, tool-call parsing, thinking controls, KV disk-cache bookkeeping, and other server-side logic. This is the best quick check for API and prompt-rendering changes.
  • --logprob-vectors: compares local token bytes and top-logprob slices against official DeepSeek V4 Flash continuation vectors. This catches tokenizer, template, attention, and logits regressions.
  • --long-context: runs a long-context story fact-recall regression from tests/long_context_story_prompt.txt. The model must retrieve spelled-out person-number assignments from a long prose prompt and return Name=number lines that the test parses.
  • --tool-call-quality: exercises actual model behavior for DSML tool-call emission in both fast and exact paths.
  • --metal-kernels: isolated Metal kernel numeric checks.

The runner defaults to ds4flash.gguf. Override paths when needed:

DS4_TEST_MODEL=/path/to/model.gguf ./ds4_test --logprob-vectors
DS4_TEST_VECTOR_FILE=/path/to/official.vec ./ds4_test --logprob-vectors
DS4_TEST_LONG_PROMPT=/path/to/prompt.txt ./ds4_test --long-context

For CUDA-specific changes, test on a CUDA machine:

make
make cuda-regression

For CPU portability, at least verify that the CPU target still builds:

make cpu

The CPU backend is a reference/debug path, not the production performance target. Remember that executing the CPU path on Metal can crash the system because of a kernel bug in macOS.

Quality Checks For Quantization Changes

For GGUF or quantization work, use the official-continuation scorer in gguf-tools/quality-testing. The test compares how much probability a local GGUF assigns to official DeepSeek V4 Flash continuations, token by token.

Build the scorer:

make -C gguf-tools quality-score

Then score old and new GGUFs against the same manifest and compare:

gguf-tools/quality-testing/score_official OLD.gguf \
  gguf-tools/quality-testing/data/manifest.tsv /tmp/old.tsv 4096

gguf-tools/quality-testing/score_official NEW.gguf \
  gguf-tools/quality-testing/data/manifest.tsv /tmp/new.tsv 4096

python3 gguf-tools/quality-testing/compare_scores.py /tmp/old.tsv /tmp/new.tsv

Lower avg_nll is better. See gguf-tools/quality-testing/README.md for collecting or refreshing official continuations.

Speed Regression Tests

Use ds4-bench for throughput regressions. It reports instantaneous prefill and generation speed at context frontiers, not one whole-run average. Prefill is incremental: each row measures only the newly processed suffix since the previous frontier.

Default linear sweep:

./ds4-bench \
  -m ds4flash.gguf \
  --prompt-file speed-bench/promessi_sposi.txt \
  --ctx-start 2048 \
  --ctx-max 65536 \
  --step-incr 2048 \
  --gen-tokens 128 \
  --csv /tmp/ds4-speed.csv

Use the same machine, backend, model file, context sweep, power/thermal state, and background load when comparing two commits. For backend work, run at least one before/after CSV and compare both prefill_tps and gen_tps. Generation is greedy and skips EOS so each frontier gets the same number of generated tokens.

To generate a graph for a CSV:

python3 speed-bench/plot_speed.py /tmp/ds4-speed.csv --title "Machine t/s"

Reporting sessions bugs

For debugging a failing generation, keep the trace:

./ds4-server --trace /tmp/ds4-trace.txt ...