Pensar - auto fix for 2 issues (CWE-22, ML02) by pensarappstaging[bot] · Pull Request #30 · Yuvanesh-ux/Nexus

pensarappstaging · 2025-05-09T16:35:58Z

Directory Traversal Vulnerability Fix:
- Added a _sanitize_filename method to the Profile class. This method removes directory traversal sequences (../), slashes, and any characters not in the safe set (a-zA-Z0-9._-). It also truncates overly long usernames.
- Replaced all filename constructions using user input (in both create_social_profile_tweepy and create_social_profile_sns) with calls to this sanitizer. Now, filenames for storing tweets cannot escape the outdir or contain unsafe characters, mitigating directory traversal.
ML Pipeline Data Poisoning Fix:
- Introduced a static method _is_valid_tweet_content, which applies a variety of structural and heuristic checks to each tweet before it is accepted for downstream ML processing:
  - Checks content type, length bounds (30–400), disallows excessive URLs/mentions/hashtags, blocks control/invisible unicode, and filters on detected repeated content or high character repetition.
  - Disallows texts with more than 30% non-latin/non-punctuation characters.
- For tweets loaded from disk, only those passing _is_valid_tweet_content are appended to all_tweets; otherwise, they're logged and discarded.
- For newly scraped tweets, before addition to all_tweets and to disk, content is cleaned and validated; only tweets passing the check are processed further.
- This blocks typical data poisoning (ML02) vectors for adversarial tweets.

More Details

Type	Identifier	Message	Severity	Link
Application	CWE-22	The file path is constructed by directly concatenating the user-supplied variables `outdir` and `user` into an f-string without any sanitisation or normalisation. If either of these variables can be influenced by an attacker (e.g., outdir="../../../../etc"), the code will happily write to arbitrary locations on disk, enabling directory-traversal attacks, overwriting of sensitive files, or creation of rogue files. No checks such as `os.path.abspath` comparison, whitelist validation, or traversal filtering are performed.	medium	Link
Application	ML02	The pipeline ingests tweet content directly from external and unauthenticated sources (`self.utils.user_lookup_sns`) and immediately feeds that text into an embedding model (`CohereEmbedder`) and downstream clustering (`KMeans`) without any provenance, schema validation, or anomaly detection. An attacker controlling a Twitter account (or compromising one) can submit crafted tweets containing adversarial or back-door triggers that will poison the embedding space and skew clustering/topic generation. This is a textbook OWASP ML Top 10 ‘ML02 – Data Poisoning’ risk.	medium	Link

1. Unsanitized File Path Construction Leading to Directory Traversal (CWE-22) 2. ML Model Data Poisoning via Unvalidated Tweet Content (ML02)

Fix 2 security issues:

d4fe91f

1. Unsanitized File Path Construction Leading to Directory Traversal (CWE-22) 2. ML Model Data Poisoning via Unvalidated Tweet Content (ML02)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pensar - auto fix for 2 issues (CWE-22, ML02)#30

Pensar - auto fix for 2 issues (CWE-22, ML02)#30
pensarappstaging[bot] wants to merge 1 commit into
mainfrom
pensar-auto-fix-Sjb3

pensarappstaging Bot commented May 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants

Conversation

pensarappstaging Bot commented May 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

0 participants