Skip to content

MemoryManager.remember(sync=True) dominates bulk ingest; YARA p95 plyara tail #72

@rolandpg

Description

@rolandpg

Spun out from PR #70 (Phase 3: detection rules first-class), Phase 4
performance benchmark findings #1 and #2.

Finding #1 — sync=True is the bulk-ingest bottleneck

Both Sigma and YARA ingest currently call mm.remember(..., sync=True)
so the note is persisted + vector-indexed + enrichment-flushed inline
before the caller gets control back. Under bulk ingest (SigmaHQ ~3k
rules, CCCS-Yara ~400 rules) this is the dominant cost.

Phase 4 bench numbers (paste from the perf report when wiring this up):

  • Sigma: ingest_rules_dir on 4 fixtures — ~4s wall, ~95% in
    remember(sync=True).
  • YARA: same pattern — parse is microseconds, persistence is seconds.

Finding #2 — YARA p95 plyara tail

plyara.Plyara().parse_string has a fat tail under repeated
invocations on large rule files. p50 is fine; p95 can exceed p50 by 10x
on ~50kB multi-rule files (observed on the 3 CCCS-Yara fixtures under a
tight loop).

Ask

Benchmark mm.remember(..., sync=False) + explicit mm.flush() at
end of ingest_rules_dir, and/or introduce a bulk=True path on
MemoryManager that defers the vector index write. Add a CI bench (pytest
bench plugin is OK) that fails if p95 exceeds a threshold on the
fixtures tree.

For plyara: cache the Plyara() instance per directory walk (it's
already created per-call, so we're paying grammar-compile cost 400x on
CCCS-Yara). Confirm thread-safety before sharing across workers.

Deliberately NOT in PR #70

Changing the sync/async boundary risks regressions in existing ingest
paths (OpenCTI sync, enrichment worker). Wants its own PR with real
before/after numbers and a regression bench.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions