Skip to content

Pensar - auto fix for Unbounded N-gram Processing Resource Exhaustion#11

Open
pensarappdev[bot] wants to merge 1 commit into
mainfrom
pensar-auto-fix-dYkW
Open

Pensar - auto fix for Unbounded N-gram Processing Resource Exhaustion#11
pensarappdev[bot] wants to merge 1 commit into
mainfrom
pensar-auto-fix-dYkW

Conversation

@pensarappdev
Copy link
Copy Markdown

@pensarappdev pensarappdev Bot commented May 7, 2025

Secured with Pensar

The only identified security vulnerability was a resource exhaustion/DoS in the create_topics() function due to unbounded n-gram dictionary growth. The following changes were made exclusively in that function:

  • Introduced constants: MAX_NGRAMS_TOTAL (maximum total unique n-grams tracked across the corpus) and MAX_NGRAMS_PER_DOC (maximum unique n-grams per document).
  • Before adding a new unigram or bigram to the tracking dictionaries, the function checks if adding a new unique n-gram would exceed these caps:
    • If an n-gram is already being tracked, its count is increased.
    • If it's a new n-gram and the cap is reached, it is skipped.
    • Per-document, increments a counter and stops further addition when reaching the per-document cap.
  • This limits both per-document and global memory use by malicious, extremely large, or highly unique input corpora, preventing memory exhaustion and denial of service.
  • Minor: If documents is empty, do not attempt to pop(-1), which could throw an error.
  • Added restoring of rarest_ngrams and rarest_ngram to default values in the empty cluster case to match the exception code tweak in the proposed fix.

No other parts of the file were changed, and there were no dependency (package) issues to address. Formatting was preserved.

More Details
Type Identifier Message Severity Link
Application CWE-400, CWE-248 create_topics counts every unigram and bigram across all supplied documents without any hard cap or size check. A malicious user can submit an extremely large corpus—or documents containing millions of unique n-grams—to force the function to allocate arbitrarily large defaultdict structures. This leads to excessive CPU and memory consumption, potentially exhausting system resources and crashing the service. The pattern constitutes Uncontrolled Resource Consumption (CWE-400). No existing safeguard, streaming, or pruning logic is present to mitigate this risk. medium Link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants