Module: AI-Powered Criminology Tools
Author: Ahmad Raza
A full-stack forensics application that uses a Naive Bayes AI model to classify suspicious messages as:
| Badge | Verdict | Threshold |
|---|---|---|
| 🔴 | Critical Scam | AI confidence ≥ 70% |
| 🟡 | Suspicious | AI confidence 40–69% |
| 🟢 | Legitimate | AI confidence < 40% |
Every analysis is logged to a persistent CSV case database, and a formal PDF Forensic Report can be downloaded for each case.
scam_analysis/
│
├── model_trainer.py # Train the AI model → saves spam_model.pkl
├── database_manager.py # All CSV/TXT file I/O operations
├── pdf_report_generator.py # ReportLab PDF forensic report builder
├── app.py # Main Streamlit UI (3 pages)
│
├── spam.csv # Training dataset (label, message)
├── spam_model.pkl # Generated by model_trainer.py (do not edit)
├── crime_database.csv # Auto-generated: append-only case ledger
├── session_report.txt # Auto-generated: per-session audit summary
│
├── requirements.txt # Python dependencies
└── README.md # This file
python -m venv venv
# Windows
venv\Scripts\activate
# macOS / Linux
source venv/bin/activatepip install -r requirements.txtpython model_trainer.pyExpected output:
============================================================
AR Forensics & CyberSecurity Labs — Model Training Session
============================================================
[1/5] Loading & validating dataset …
✓ Loaded 5572 records | Dropped 0 nulls
✓ Class distribution:
ham 4825
spam 747
[2/5] Cleaning evidence text …
[3/5] Splitting into training and evaluation sets …
✓ Training samples : 4179
✓ Test samples : 1393
[4/5] Training Forensic AI Pipeline …
[5/5] Evaluating model performance …
ACCURACY : 98.28%
----------------------------------------
...classification report...
✓ Model saved → 'spam_model.pkl'
streamlit run app.pyThe app opens at http://localhost:8501
- Live KPI cards: Total Cases, Critical Scams, Suspicious, Legitimate
- Pie Chart — verdict distribution across all cases
- Bar Chart — red-flag keyword frequency in the evidence database
- Recent cases table
- Paste any suspicious message text into the evidence field
- Quick-fill examples provided (scam, suspicious, legitimate)
- AI returns colour-coded verdict + confidence gauge
- Download PDF Forensic Report (ReportLab, A4, court-ready format)
- Search cases by keyword in evidence text
- Search by Case ID (e.g.,
AR-A3F1B2C4) - Filter by verdict type
- Detail viewer: expand any case to inspect full evidence + download its PDF
- Export filtered results as CSV
Each PDF includes:
- AR Forensics & CyberSecurity Labs Letter Head
- Case metadata (Case ID, Timestamp, AI Engine, Model version)
- Full evidence text (Exhibit A)
- AI Verdict panel with colour-coded badge
- Analyst interpretation notes
- Digital signature block for physical signing
Run the database manager in isolation to verify I/O:
python database_manager.pyspam.csv supports two formats automatically:
Standard format (manual/custom datasets):
label,message
spam,"WINNER!! You have been selected..."
ham,"Hey, are you coming to the lecture..."Kaggle format (SMS Spam Collection — recommended):
v1,v2
spam,"WINNER!! You have been selected..."
ham,"Hey, are you coming to the lecture..."The trainer auto-remaps v1 → label and v2 → message, and drops any extra columns (v3, v4, v5).
To use the Kaggle dataset:
- Download from:
https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset - Place it in the project root
- Run
python model_trainer.py— no other changes needed
Expected dataset stats (Kaggle):
| Value | |
|---|---|
| Total records | 5,572 |
| Ham (legitimate) | 4,825 (87%) |
| Spam | 747 (13%) |
| Training samples (75%) | 4,179 |
| Test samples (25%) | 1,393 |
| Variable | Purpose |
|---|---|
crime_evidence_text |
Raw input message from investigator |
cleaned_evidence |
Sanitised text after noise removal |
forensic_ai_model |
Loaded sklearn Pipeline (TF-IDF + NaiveBayes) |
scam_probability |
P(scam) from predict_proba() |
ai_verdict |
Final classification: CRITICAL SCAM / SUSPICIOUS / LEGITIMATE |
assigned_case_id |
UUID-based case identifier (e.g. AR-A3F1B2C4) |
crime_dataset |
Full pandas DataFrame loaded from CSV |
forensic_ml_pipeline |
sklearn Pipeline object |
pdf_byte_buffer |
In-memory BytesIO object for PDF generation |
| Scenario | Handler |
|---|---|
| Model file not found | st.error() with instructions to run trainer |
| Dataset file missing | FileNotFoundError with descriptive message |
| CSV columns wrong | ValueError listing expected vs found columns |
| Database write fails | IOError surfaced to Streamlit UI |
| PDF generation fails | try-except in Crime Records detail view |
| Empty database reads | Returns empty DataFrame with correct columns |
| Library | Role |
|---|---|
scikit-learn |
ML pipeline: TF-IDF vectorizer + MultinomialNB |
joblib |
Model serialisation / deserialisation |
streamlit |
Web UI framework |
pandas |
Data manipulation and CSV I/O |
matplotlib |
Pie chart + bar chart visualisations |
reportlab |
Professional A4 PDF report generation |
AI-Driven Criminal Scam Analysis & Case Tracking System — For Academic Use & Learning Purposes Only