Skip to content

feat(ENG-12699): TypeScript parity and synced ONNX bundle#8

Merged
hiskudin merged 4 commits into
mainfrom
feat/python-0.6.1-ts-parity-onnx
Apr 22, 2026
Merged

feat(ENG-12699): TypeScript parity and synced ONNX bundle#8
hiskudin merged 4 commits into
mainfrom
feat/python-0.6.1-ts-parity-onnx

Conversation

@hiskudin

@hiskudin hiskudin commented Apr 21, 2026

Copy link
Copy Markdown
Collaborator

Summary

This release aligns stackone-defender Python with the current @stackone/defender TypeScript behavior and refreshes the bundled MiniLM ONNX assets so byte-for-byte hashes match the TS repo.

What changed

  • Config and sanitizer: Dangerous-key filtering, fractional cumulative risk thresholds, and traversal hardening (__proto__, constructor, prototype).
  • Tier 2 and ONNX: Packed-chunk Tier 2 flow, density-adjusted scoring, and memory-bounded ONNX batch chunking (max batch size 32).
  • Optional SFE: use_sfe, bundled sfe/model.ftz, and fasttext-wheel optional extra.
  • Types / API: DefenseResult.fields_dropped, truncated_at_depth, and related create_config merges.
  • Models: minilm-full-aug/ (model_quantized.onnx, config.json, tokenizer.json, tokenizer_config.json) copied from defender so Python matches TS.

Testing

uv run pytest188 passed.

Made with Cursor


Summary by cubic

Aligns stackone-defender Python with @stackone/defender 0.6.1 and syncs the MiniLM ONNX bundle. Meets ENG-12699 parity with packed-chunk Tier 2, optional SFE preprocessing, and traversal hardening for safer, more accurate detection.

  • New Features

    • Tier 2: sentence packing with token-bounded chunks, memory-safe batch chunking (32 max), and density-adjusted scoring.
    • Optional SFE: use_sfe flag with bundled sfe/model.ftz; install via stackone-defender[sfe]; uses fasttext-ng; fails open if unavailable.
    • Sanitizer/traversal: drops dangerous keys (__proto__, constructor, prototype), adds fractional cumulative-risk thresholds and stack-depth cap.
    • API/Types: DefenseResult.fields_dropped, DefenseResult.truncated_at_depth, and SanitizationMetadata.dangerous_keys_removed; improved create_config merges; MiniLM artifacts synced to match TS.
  • Bug Fixes

    • Tier 2 scoping mirrors TS: when tier2_fields is None, use Tier 1 risky_field_names; otherwise all strings.
    • ONNX token counting excludes padding to keep chunk splitting accurate.
    • Cumulative risk thresholds now merge with defaults and support partial custom dicts; SFE predictor loading is thread-safe.

Written for commit bf173ac. Summary will update on new commits.

…ndle

- Port dangerous-key filtering, fractional cumulative risk, and traversal config.
- Add packed-chunk Tier 2 flow, density adjustment, and ONNX batch chunking.
- Add optional SFE (fasttext) with bundled model and extras.
- Sync minilm-full-aug artifacts (quantized ONNX, tokenizer, config) with @stackone/defender.
- Bump version and release metadata; update changelog and README.

Made-with: Cursor
Copilot AI review requested due to automatic review settings April 21, 2026 12:47
@hiskudin hiskudin changed the title Release 0.6.1: TypeScript parity and synced ONNX bundle feat(ENG-12699): TypeScript parity and synced ONNX bundle Apr 21, 2026
fasttext-wheel 0.9.2 has no cp313 wheels; resolving it in the dev group
forced a broken sdist build on GitHub Actions. Remove it from dev deps
(SFE tests use mocks). Gate the [sfe] extra with a Python version marker
and document 3.13 behavior in the README.

Made-with: Cursor

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Release 0.6.1 updates the Python stackone-defender package to match the current TypeScript @stackone/defender behavior, including refreshed bundled MiniLM ONNX assets and new preprocessing/scoring behavior.

Changes:

  • Adds optional SFE preprocessing (use_sfe) with bundled FastText model support (fail-open when unavailable).
  • Updates Tier 2 flow to packed-chunk batching, density-adjusted scoring, and ONNX batch chunking to bound memory.
  • Hardens traversal/sanitization: dangerous key filtering (__proto__, constructor, prototype) and fractional cumulative-risk thresholds.

Reviewed changes

Copilot reviewed 20 out of 23 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
uv.lock Adds fasttext-wheel (and transitive deps) and bumps project version to 0.6.1 with new extras metadata.
pyproject.toml Bumps version to 0.6.1; adds sfe extra and dev dependency for fasttext-wheel.
README.md Documents SFE extra/usage; updates Tier 2 description; documents new DefenseResult fields.
CHANGELOG.md Adds 0.6.1 release notes and breaking-change callouts.
.release-please-manifest.json Updates manifest version to 0.6.1.
src/stackone_defender/config.py Introduces DANGEROUS_KEYS + MAX_TRAVERSAL_DEPTH; deep-copies defaults; adds fractional cumulative-risk thresholds.
src/stackone_defender/types.py Extends config/metadata/result types for fractional thresholds, dangerous-key reporting, and new result fields.
src/stackone_defender/core/tool_result_sanitizer.py Filters dangerous keys during traversal; adjusts cumulative risk accounting to support fractional thresholds.
src/stackone_defender/core/prompt_defense.py Adds use_sfe; switches Tier 2 to chunk prep + batched chunk scoring; reports fields_dropped/truncated_at_depth.
src/stackone_defender/sfe/preprocess.py New SFE preprocessing implementation with predictor caching and depth-bounded traversal.
src/stackone_defender/sfe/init.py Exports SFE public API.
src/stackone_defender/classifiers/onnx_classifier.py Adds bounded batch chunking and token counting/max-length helpers.
src/stackone_defender/classifiers/tier2_classifier.py Adds chunk preparation + packed-sentence chunking path; batch chunk passthrough API.
src/stackone_defender/init.py Exposes SFE symbols at package top-level.
src/stackone_defender/models/minilm-full-aug/config.json Syncs bundled model metadata with TS assets.
src/stackone_defender/models/minilm-full-aug/tokenizer_config.json Syncs tokenizer config with TS assets.
tests/test_tier2_classifier.py Adds tests for prepare_chunks skipping and chunk-batch passthrough.
tests/test_onnx_classifier.py Adds test coverage for ONNX batch chunking behavior.
tests/test_sfe.py New tests for SFE preprocessing and PromptDefense integration (fields_dropped).
tests/test_integration.py Adds dangerous-key removal test; updates Tier 2 scoping tests to new chunk-based flow.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/stackone_defender/core/prompt_defense.py Outdated
Comment thread src/stackone_defender/core/tool_result_sanitizer.py
Comment thread tests/test_integration.py Outdated

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 issues found across 23 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="README.md">

<violation number="1" location="README.md:168">
P2: Use `field(default_factory=list)` instead of `[]` for the dataclass list default in the README snippet.</violation>
</file>

<file name="src/stackone_defender/core/tool_result_sanitizer.py">

<violation number="1" location="src/stackone_defender/core/tool_result_sanitizer.py:277">
P2: Accessing `medium_fraction`/`patterns_fraction` without defaults can raise `KeyError` for valid custom threshold dicts that omit the new keys.</violation>
</file>

<file name="src/stackone_defender/config.py">

<violation number="1" location="src/stackone_defender/config.py:90">
P2: `tool_overrides` is shallow-copied, so nested lists are shared with global defaults and can be mutated across configs.</violation>
</file>

<file name="src/stackone_defender/classifiers/onnx_classifier.py">

<violation number="1" location="src/stackone_defender/classifiers/onnx_classifier.py:134">
P2: `count_tokens` returns padded sequence length, not the actual token count, because tokenizer padding is enabled globally.</violation>
</file>

<file name="src/stackone_defender/sfe/preprocess.py">

<violation number="1" location="src/stackone_defender/sfe/preprocess.py:66">
P2: TOCTOU race: the lock is released between the cache-miss check and the model load, so concurrent threads can each load the model redundantly. Hold the lock across the full check-and-populate block to prevent duplicate expensive loads.</violation>

<violation number="2" location="src/stackone_defender/sfe/preprocess.py:200">
P2: Depth tracking is inconsistent between `_extract_fields` (arrays don't increment `depth`) and `_filter_by_paths` (arrays do increment `depth`). For deeply nested array structures, fields extracted for drop-classification may not be reachable by the filter, so they silently survive. Either both functions should count array levels the same way, or `_filter_by_paths` should mirror `_extract_fields` by using a separate `stack_depth` parameter.</violation>
</file>

<file name="src/stackone_defender/classifiers/tier2_classifier.py">

<violation number="1" location="src/stackone_defender/classifiers/tier2_classifier.py:133">
P1: `count_tokens` always returns 256 because the tokenizer has `enable_padding(length=256)` set, so `len(encoding.ids)` includes padding tokens. Since `get_max_length()` also returns 256, the condition `total_tokens <= model_max_len` is always true and the entire chunk-splitting branch below is dead code. The same applies to `prepare_chunks` and `_pack_sentences`.

`count_tokens` should strip padding tokens before returning, e.g. by counting non-pad ids or using `len(encoding.tokens)` without padding, or by temporarily disabling padding for the count.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread src/stackone_defender/classifiers/tier2_classifier.py
Comment thread README.md Outdated
Comment thread src/stackone_defender/core/tool_result_sanitizer.py
Comment thread src/stackone_defender/config.py Outdated
Comment thread src/stackone_defender/classifiers/onnx_classifier.py Outdated
Comment thread src/stackone_defender/sfe/preprocess.py
Comment thread src/stackone_defender/sfe/preprocess.py Outdated

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 issues found across 4 files (changes from recent commits).

Requires human review: Auto-approval blocked by 7 unresolved issues from previous reviews.

fasttext-wheel lacks reliable cp313 wheels and fails sdist builds on CI.
fasttext-ng provides the same fasttext import namespace, supports Python 3.11+,
and declares numpy>=2.3. Add it to dev so SFE-related tests run with the
real module when available; refresh lockfile and docs.

Made-with: Cursor
- Tier 2 string extraction: when tier2_fields is None, scope to Tier 1
  risky_field_names when present; else all strings. Align integration test.
- ONNX count_tokens: sum attention_mask so padded length does not disable
  chunk splitting; add regression test.
- Cumulative escalation: merge defaults into sanitizer thresholds; use .get
  with defaults in _should_escalate for partial custom dicts.
- create_config: deep-copy tool_overrides list values.
- SFE: hold predictor lock across import/load; align list depth in filter/compact.
- README DefenseResult snippet: field(default_factory=list).
- Tier2Config docstring: clarify None vs empty list semantics.

Made-with: Cursor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 issues found across 5 files (changes from recent commits).

Requires human review: Auto-approval blocked by 7 unresolved issues from previous reviews.

@cubic-dev-ai cubic-dev-ai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 issues found across 9 files (changes from recent commits).

Requires human review: Significant alignment PR with core logic changes, new dependencies, and API shape modifications. Not a low-risk or trivial update.

@hiskudin hiskudin merged commit 0449800 into main Apr 22, 2026
8 checks passed
@hiskudin hiskudin deleted the feat/python-0.6.1-ts-parity-onnx branch April 22, 2026 15:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants