Skip to content

ahmad12583719/Spam_Detection

Repository files navigation

🔬 AI-Driven Criminal Scam Analysis & Case Tracking System

Module: AI-Powered Criminology Tools
Author: Ahmad Raza


📋 Project Overview

A full-stack forensics application that uses a Naive Bayes AI model to classify suspicious messages as:

Badge Verdict Threshold
🔴 Critical Scam AI confidence ≥ 70%
🟡 Suspicious AI confidence 40–69%
🟢 Legitimate AI confidence < 40%

Every analysis is logged to a persistent CSV case database, and a formal PDF Forensic Report can be downloaded for each case.


📁 File Structure

scam_analysis/
│
├── model_trainer.py        # Train the AI model → saves spam_model.pkl
├── database_manager.py     # All CSV/TXT file I/O operations
├── pdf_report_generator.py # ReportLab PDF forensic report builder
├── app.py                  # Main Streamlit UI (3 pages)
│
├── spam.csv                # Training dataset (label, message)
├── spam_model.pkl          # Generated by model_trainer.py (do not edit)
├── crime_database.csv      # Auto-generated: append-only case ledger
├── session_report.txt      # Auto-generated: per-session audit summary
│
├── requirements.txt        # Python dependencies
└── README.md               # This file

⚙️ Setup Instructions

1. Create a Virtual Environment (Recommended)

python -m venv venv

# Windows
venv\Scripts\activate

# macOS / Linux
source venv/bin/activate

2. Install Dependencies

pip install -r requirements.txt

3. Train the AI Model (run once)

python model_trainer.py

Expected output:

============================================================
  AR Forensics & CyberSecurity Labs — Model Training Session
============================================================

[1/5] Loading & validating dataset …
  ✓ Loaded 5572 records  |  Dropped 0 nulls
  ✓ Class distribution:
    ham     4825
    spam    747

[2/5] Cleaning evidence text …
[3/5] Splitting into training and evaluation sets …
  ✓ Training samples : 4179
  ✓ Test samples     : 1393

[4/5] Training Forensic AI Pipeline …
[5/5] Evaluating model performance …

  ACCURACY  : 98.28%
  ----------------------------------------
  ...classification report...

  ✓ Model saved  →  'spam_model.pkl'

4. Launch the Streamlit App

streamlit run app.py

The app opens at http://localhost:8501


🖥️ Application Pages

📊 Dashboard

  • Live KPI cards: Total Cases, Critical Scams, Suspicious, Legitimate
  • Pie Chart — verdict distribution across all cases
  • Bar Chart — red-flag keyword frequency in the evidence database
  • Recent cases table

🔬 New Analysis

  • Paste any suspicious message text into the evidence field
  • Quick-fill examples provided (scam, suspicious, legitimate)
  • AI returns colour-coded verdict + confidence gauge
  • Download PDF Forensic Report (ReportLab, A4, court-ready format)

📁 Crime Records

  • Search cases by keyword in evidence text
  • Search by Case ID (e.g., AR-A3F1B2C4)
  • Filter by verdict type
  • Detail viewer: expand any case to inspect full evidence + download its PDF
  • Export filtered results as CSV

📄 Forensic PDF Report Contents

Each PDF includes:

  1. AR Forensics & CyberSecurity Labs Letter Head
  2. Case metadata (Case ID, Timestamp, AI Engine, Model version)
  3. Full evidence text (Exhibit A)
  4. AI Verdict panel with colour-coded badge
  5. Analyst interpretation notes
  6. Digital signature block for physical signing

🧪 Self-Test Commands

Run the database manager in isolation to verify I/O:

python database_manager.py

🗂️ Dataset Format

spam.csv supports two formats automatically:

Standard format (manual/custom datasets):

label,message
spam,"WINNER!! You have been selected..."
ham,"Hey, are you coming to the lecture..."

Kaggle format (SMS Spam Collection — recommended):

v1,v2
spam,"WINNER!! You have been selected..."
ham,"Hey, are you coming to the lecture..."

The trainer auto-remaps v1 → label and v2 → message, and drops any extra columns (v3, v4, v5).

To use the Kaggle dataset:

  1. Download from: https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset
  2. Place it in the project root
  3. Run python model_trainer.py — no other changes needed

Expected dataset stats (Kaggle):

Value
Total records 5,572
Ham (legitimate) 4,825 (87%)
Spam 747 (13%)
Training samples (75%) 4,179
Test samples (25%) 1,393

🔑 Key Variable Name Reference (Viva Prep)

Variable Purpose
crime_evidence_text Raw input message from investigator
cleaned_evidence Sanitised text after noise removal
forensic_ai_model Loaded sklearn Pipeline (TF-IDF + NaiveBayes)
scam_probability P(scam) from predict_proba()
ai_verdict Final classification: CRITICAL SCAM / SUSPICIOUS / LEGITIMATE
assigned_case_id UUID-based case identifier (e.g. AR-A3F1B2C4)
crime_dataset Full pandas DataFrame loaded from CSV
forensic_ml_pipeline sklearn Pipeline object
pdf_byte_buffer In-memory BytesIO object for PDF generation

🛡️ Error Handling Coverage

Scenario Handler
Model file not found st.error() with instructions to run trainer
Dataset file missing FileNotFoundError with descriptive message
CSV columns wrong ValueError listing expected vs found columns
Database write fails IOError surfaced to Streamlit UI
PDF generation fails try-except in Crime Records detail view
Empty database reads Returns empty DataFrame with correct columns

📚 Technologies Used

Library Role
scikit-learn ML pipeline: TF-IDF vectorizer + MultinomialNB
joblib Model serialisation / deserialisation
streamlit Web UI framework
pandas Data manipulation and CSV I/O
matplotlib Pie chart + bar chart visualisations
reportlab Professional A4 PDF report generation

AI-Driven Criminal Scam Analysis & Case Tracking System — For Academic Use & Learning Purposes Only

About

A Python-based Spam Detection System that uses Natural Language Processing (NLP) and Machine Learning to accurately classify messages as "Spam" or "Ham" (legitimate)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages