Skip to content

Pensar - auto fix for 2 issues (CWE-22, ML02)#30

Open
pensarappstaging[bot] wants to merge 1 commit into
mainfrom
pensar-auto-fix-Sjb3
Open

Pensar - auto fix for 2 issues (CWE-22, ML02)#30
pensarappstaging[bot] wants to merge 1 commit into
mainfrom
pensar-auto-fix-Sjb3

Conversation

@pensarappstaging
Copy link
Copy Markdown

Secured with Pensar

  1. Directory Traversal Vulnerability Fix:

    • Added a _sanitize_filename method to the Profile class. This method removes directory traversal sequences (../), slashes, and any characters not in the safe set (a-zA-Z0-9._-). It also truncates overly long usernames.
    • Replaced all filename constructions using user input (in both create_social_profile_tweepy and create_social_profile_sns) with calls to this sanitizer. Now, filenames for storing tweets cannot escape the outdir or contain unsafe characters, mitigating directory traversal.
  2. ML Pipeline Data Poisoning Fix:

    • Introduced a static method _is_valid_tweet_content, which applies a variety of structural and heuristic checks to each tweet before it is accepted for downstream ML processing:
      • Checks content type, length bounds (30–400), disallows excessive URLs/mentions/hashtags, blocks control/invisible unicode, and filters on detected repeated content or high character repetition.
      • Disallows texts with more than 30% non-latin/non-punctuation characters.
    • For tweets loaded from disk, only those passing _is_valid_tweet_content are appended to all_tweets; otherwise, they're logged and discarded.
    • For newly scraped tweets, before addition to all_tweets and to disk, content is cleaned and validated; only tweets passing the check are processed further.
    • This blocks typical data poisoning (ML02) vectors for adversarial tweets.
More Details
Type Identifier Message Severity Link
Application
CWE-22
The file path is constructed by directly concatenating the user-supplied variables outdir and user into an f-string without any sanitisation or normalisation. If either of these variables can be influenced by an attacker (e.g., outdir="../../../../etc"), the code will happily write to arbitrary locations on disk, enabling directory-traversal attacks, overwriting of sensitive files, or creation of rogue files. No checks such as os.path.abspath comparison, whitelist validation, or traversal filtering are performed.
medium
Link
Application
ML02
The pipeline ingests tweet content directly from external and unauthenticated sources (self.utils.user_lookup_sns) and immediately feeds that text into an embedding model (CohereEmbedder) and downstream clustering (KMeans) without any provenance, schema validation, or anomaly detection. An attacker controlling a Twitter account (or compromising one) can submit crafted tweets containing adversarial or back-door triggers that will poison the embedding space and skew clustering/topic generation. This is a textbook OWASP ML Top 10 ‘ML02 – Data Poisoning’ risk.
medium
Link

1. Unsanitized File Path Construction Leading to Directory Traversal (CWE-22)
2. ML Model Data Poisoning via Unvalidated Tweet Content (ML02)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants