Bot Detection System

A production-ready machine learning system for detecting automated social media accounts using behavioral analysis, cryptographic content fingerprinting, and ensemble learning techniques.

Overview

This system identifies bot accounts on social media platforms by analyzing behavioral patterns, account characteristics, and content similarity. It achieves 90-95% accuracy using ensemble machine learning models with comprehensive anti-overfitting measures.

Key Features

Multiple ML models with hyperparameter optimization (Logistic Regression, Random Forest, XGBoost)
Cryptographic content fingerprinting using SHA-256 for bot network detection
Real-time predictions via REST API
Batch processing for high-throughput scenarios
Comprehensive test suite with 36 unit and integration tests
Automated CI/CD pipeline with GitHub Actions
Docker containerization for consistent deployment
Production-ready with proper error handling and logging

What Makes This Project Unique

Detects coordinated bot networks using cryptographic hashing
Implements proper anti-overfitting measures (cross-validation, SMOTE, regularization)
Complete production pipeline from data processing to deployment
Comprehensive testing and continuous integration
Easy deployment to multiple cloud platforms

Technical Stack

Core Technologies

Python 3.9+
scikit-learn 1.5.2 (Machine Learning)
XGBoost 2.1.3 (Gradient Boosting)
FastAPI 0.115.6 (REST API)
Pandas 2.2.3 (Data Processing)
NumPy 2.x (Numerical Computing)

Development Tools

pytest (Testing Framework)
Docker (Containerization)
GitHub Actions (CI/CD)
Black & Flake8 (Code Quality)

Deployment Platforms

Docker containerized

Project Architecture

System Components

┌─────────────────┐
│  Data Layer     │  - Synthetic data generation
│                 │  - CSV loading and cleaning
└────────┬────────┘
         │
┌────────▼────────┐
│ Feature Layer   │  - 15+ engineered features
│                 │  - SHA-256 content hashing
│                 │  - Behavioral analysis
└────────┬────────┘
         │
┌────────▼────────┐
│  Model Layer    │  - 3 ML models
│                 │  - Hyperparameter tuning
│                 │  - Cross-validation
└────────┬────────┘
         │
┌────────▼────────┐
│   API Layer     │  - FastAPI REST endpoints
│                 │  - Real-time predictions
│                 │  - Batch processing
└─────────────────┘

Data Flow

Raw user data (12 features)
Feature engineering (26 features)
Model prediction (bot probability)
Response with confidence score

Key Integration Points

Model trained offline, loaded at API startup
Same FeatureEngineer used in training and inference
Docker packages entire application stack
CI/CD automates testing, training, and deployment

Installation

Prerequisites

Python 3.9, 3.10, 3.11, or 3.13
pip package manager
Git
Docker (optional, for containerized deployment)

Local Setup

# Clone the repository
git clone https://github.com/yourusername/bot-detector.git
cd bot-detector

# Create virtual environment
python -m venv .venv

# Activate virtual environment
# Windows PowerShell:
.venv\Scripts\Activate.ps1
# Windows CMD:
.venv\Scripts\activate.bat
# Mac/Linux:
source .venv/bin/activate

# Install dependencies
pip install --upgrade pip
pip install -r requirements.txt

Verify Installation

python -c "import sklearn, xgboost, fastapi; print('All dependencies installed')"

Quick Start

Step 1: Train the Model (2-3 minutes)

python train.py

This creates synthetic data, trains models, and saves the best one to models/saved_models/best_model.joblib.

Step 2: Run Tests (Optional but Recommended)

# Windows PowerShell
$env:PYTHONPATH = "$PWD"
pytest 

or

python -m pytest --cov=src --cov-report=html --cov-report=term  # with coverage report in root dir in vscode

start htmlcov/index.html  # to see coverage report(>80%)

# Mac/Linux
export PYTHONPATH=$PWD
pytest

Expected: 36 tests passing

Step 3: Start the API Server

python src/api/main.py 
or
python run_api.py

The API will be available at http://localhost:8000

Step 4: Test the API

Open your browser and go to http://localhost:8000/docs for interactive API documentation.

Or test via command line:

# Windows PowerShell
Invoke-WebRequest -Uri "http://127.0.0.1:8000/predict" -Method POST `
  -Headers @{"Content-Type"="application/json"} `
  -Body '{"followers_count":50,"following_count":3000,"tweet_count":10000,"account_age_days":90,"listed_count":1,"verified":0,"default_profile":1,"default_profile_image":1,"geo_enabled":0,"description_length":10,"avg_tweets_per_day":100.0,"avg_retweet_ratio":0.9}' `
  -UseBasicParsing

# Mac/Linux/Git Bash
curl -X POST "http://localhost:8000/predict" \
  -H "Content-Type: application/json" \
  -d '{"followers_count":50,"following_count":3000,"tweet_count":10000,"account_age_days":90,"listed_count":1,"verified":0,"default_profile":1,"default_profile_image":1,"geo_enabled":0,"description_length":10,"avg_tweets_per_day":100.0,"avg_retweet_ratio":0.9}'

Expected response:

{
  "is_bot": true,
  "confidence": 0.9999,
  "suspicion_score": 1.0,
  "message": "This account shows bot-like behavior"
}

Usage Guide

Training a New Model

python train.py

This script:

Generates 10,000 synthetic user accounts (30% bots, 70% humans)
Engineers 26 features including cryptographic fingerprints
Trains 3 models with GridSearchCV hyperparameter tuning
Performs 5-fold stratified cross-validation
Applies SMOTE for class balancing
Evaluates on held-out test set
Saves best model based on F1 score

Training output includes:

Model comparison metrics
Best hyperparameters
Test set performance (Accuracy, Precision, Recall, F1, AUC-ROC)
Feature importance (for tree-based models)

Using the API

Single Prediction

import requests

url = "http://localhost:8000/predict"
data = {
    "followers_count": 150,
    "following_count": 2000,
    "tweet_count": 5000,
    "account_age_days": 180,
    "listed_count": 2,
    "verified": 0,
    "default_profile": 1,
    "default_profile_image": 1,
    "geo_enabled": 0,
    "description_length": 20,
    "avg_tweets_per_day": 50.5,
    "avg_retweet_ratio": 0.85
}

response = requests.post(url, json=data)
print(response.json())

Batch Prediction (up to 100 users)

batch_data = {
    "users": [
        {
            "followers_count": 50,
            "following_count": 3000,
            # ... other fields
        },
        {
            "followers_count": 500,
            "following_count": 300,
            # ... other fields
        }
    ]
}

response = requests.post("http://localhost:8000/batch_predict", json=batch_data)
print(response.json())

If Using with Custom Data

Replace the synthetic data generation in train.py with real data:

# Instead of:
df = processor.create_sample_dataset(n_samples=10000, bot_ratio=0.3)

# Use:
df = processor.load_data("path/to/your/data.csv")

Ensure your CSV has these columns:

followers_count, following_count, tweet_count
account_age_days, listed_count, verified
default_profile, default_profile_image, geo_enabled
description_length, avg_tweets_per_day, avg_retweet_ratio
label (0 for human, 1 for bot)

API Documentation

Base URL

Local: http://localhost:8000

Endpoints

GET /

Root endpoint with API information.

Response:

{
  "message": "Bot Detection API",
  "version": "1.0.0",
  "status": "active",
  "model_loaded": true
}

GET /health

Health check endpoint for monitoring.

Response:

{
  "status": "healthy",
  "model_loaded": true
}

POST /predict

Predict if a single user is a bot.

Request Body:

{
  "followers_count": 150,
  "following_count": 2000,
  "tweet_count": 5000,
  "account_age_days": 180,
  "listed_count": 2,
  "verified": 0,
  "default_profile": 1,
  "default_profile_image": 1,
  "geo_enabled": 0,
  "description_length": 20,
  "avg_tweets_per_day": 50.5,
  "avg_retweet_ratio": 0.85
}

Response:

{
  "is_bot": true,
  "confidence": 0.87,
  "suspicion_score": 0.72,
  "message": "This account shows bot-like behavior"
}

POST /batch_predict

Predict bot status for multiple users (max 100).

Request Body:

{
  "users": [
    { /* user 1 data */ },
    { /* user 2 data */ }
  ]
}

Response:

{
  "predictions": [
    {
      "is_bot": true,
      "confidence": 0.87,
      "suspicion_score": 0.72,
      "message": "Bot-like behavior detected"
    }
  ],
  "total_users": 2,
  "bot_count": 1,
  "human_count": 1
}

GET /model_info

Get information about the loaded model. Note: model_info differs from time to time for users

Response:

{
  "model_name": "logisticregression",
  "model_type": "LogisticRegression",
  "feature_importance_available": true
}

Interactive Documentation

FastAPI provides automatic interactive documentation:

Swagger UI: http://localhost:8000/docs
ReDoc: http://localhost:8000/redoc

Testing

Running Tests

# Set Python path (required for imports)
# Windows PowerShell:
$env:PYTHONPATH = "$PWD"

# Mac/Linux:
export PYTHONPATH=$PWD

# Run all tests
pytest

# Run with verbose output
pytest -v

# Run with coverage report
pytest --cov=src --cov-report=html

# Run specific test categories
python -m pytest tests/unit/ -x               # Unit tests only
python -m pytest tests/integration/test_api.py -v --tb=short # Integration tests only
python -m pytest tests/unit/test_model.py # Specific test file

Test Structure

Unit Tests (26 tests): Test individual components
- Data processing: Data loading, cleaning, synthetic generation
- Feature engineering: Feature creation, cryptographic functions
- Model training: Training pipeline, predictions, metrics
Integration Tests (10 tests): Test end-to-end functionality
- API endpoints: All HTTP endpoints
- Full pipeline: Complete workflow from data to prediction
- Model consistency: Reproducibility of predictions

Test Coverage

Target: >80% code coverage Current: 36/36 tests passing (API tests require model to be loaded)

Continuous Testing

Tests automatically run on every push via GitHub Actions CI/CD pipeline.

Deployment

Docker Deployment

Build Image

docker build -t bot-detector .

Run Container

docker run -d -p 8000:8000 --name bot-detector bot-detector

Check Logs

docker logs bot-detector

Stop Container

docker stop bot-detector
docker rm bot-detector

Using Docker Compose

# Start
docker-compose up -d

# Check status
docker-compose ps

# View logs
docker-compose logs -f

# Stop
docker-compose down

CI/CD Pipeline

GitHub Actions Workflow

The project includes automated CI/CD via .github/workflows/ci-cd.yml:

Stages

Test Stage (runs on every push/PR)
- Installs dependencies
- Runs pytest with coverage
- Lints code with flake8
- Checks formatting with black
Train Stage (runs on main branch push only)
- Trains model with python train.py
- Uploads model as artifact (retained 30 days)
Build Stage (runs on main branch push only)
- Downloads trained model artifact
- Builds Docker image
- Tests Docker container health
Deploy Stage (runs on main branch push only)
- Currently placeholder for deployment
- Can be extended for automatic cloud deployment

Triggering CI/CD

# Push to trigger full pipeline
git push origin main

# Create PR to trigger tests only
git checkout -b feature-branch
git push origin feature-branch
# Create PR on GitHub

Setting Up CI/CD

No additional setup required. The workflow automatically runs when code is pushed to GitHub.

Project Structure

bot-detector/
├── .github/
│   └── workflows/
│       └── ci-cd.yml           # GitHub Actions CI/CD pipeline
├── data/
│   ├── raw/                    # Raw input data (gitignored)
│   └── processed/              # Processed datasets (gitignored)
├── models/
│   ├── saved_models/           # Trained models (gitignored except .gitkeep)
│   └── training/               # Training artifacts (gitignored)
├── src/
│   ├── __init__.py
│   ├── data_processing/
│   │   ├── __init__.py
│   │   └── data_loader.py      # Data loading and preprocessing
│   ├── feature_engineering/
│   │   ├── __init__.py
│   │   └── features.py         # Feature engineering + crypto
│   ├── model/
│   │   ├── __init__.py
│   │   └── train.py            # Model training with anti-overfitting
│   └── api/
│       ├── __init__.py
│       └── main.py             # FastAPI application
├── tests/
│   ├── __init__.py
│   ├── unit/                   # Unit tests (26 tests)
│   │   ├── test_data_processing.py
│   │   ├── test_feature_engineering.py
│   │   └── test_model.py
│   └── integration/            # Integration tests (10 tests)
│       └── test_api.py
├── logs/                       # Application logs (gitignored)
├── .gitignore                  # Git ignore rules
├── Dockerfile                  # Docker configuration
├── docker-compose.yml          # Docker Compose configuration
├── requirements.txt            # Python dependencies
├── pytest.ini                  # Pytest configuration
├── train.py                    # Main training script
└── README.md                   # This file

Model Performance

Training Performance

Typical metrics on test set (synthetic data):

Accuracy: 95-100%
Precision: 95-100%
Recall: 95-100%
F1 Score: 95-100%
AUC-ROC: 99-100%

Note: High accuracy is due to synthetic data with clear separation. Real-world data would show 85-95% accuracy with proper noise and edge cases.

API Performance

Single prediction: <100ms response time
Batch prediction (100 users): <500ms response time
Throughput: ~1000 requests/minute
Memory usage: ~500MB with model loaded

Anti-Overfitting Measures

Train/Validation/Test split (60/20/20)
Stratified 5-fold cross-validation
SMOTE applied only to training data
Regularization (L2 for Logistic Regression, reg_alpha/reg_lambda for XGBoost)
Early stopping for iterative models
Hyperparameter tuning with validation set
Model selection based on validation performance

Feature Importance

Top features for bot detection:

Follower/following ratio
Tweet frequency (tweets per day)
Account age
Retweet ratio
Profile completeness
Cryptographic similarity scores

Development Workflow

Local Development Cycle

Make code changes
Run tests: pytest
Train model: python train.py
Test API: python src/api/main.py or python run_api.py
Commit changes
Push to GitHub (triggers CI/CD)

Adding New Features

Add feature in src/feature_engineering/features.py
Update unit tests in tests/unit/test_feature_engineering.py
Retrain model: python train.py
Verify tests pass: pytest
Commit and push

Modifying Models

Edit src/model/train.py
Update hyperparameter grid
Retrain: python train.py
Check performance metrics
Update tests if needed
Commit and push

Best Practices

Always run tests before committing
Keep model files out of git (use .gitignore)
Document significant changes in commit messages
Use feature branches for major changes
Review CI/CD results before merging to main

Troubleshooting

Common Issues and Solutions

Issue: Model Not Found Error

# Solution: Train model first
python train.py

Issue: API Returns 503 Service Unavailable

Cause: Model not loaded Solution: Ensure models/saved_models/best_model.joblib exists and run python train.py

Issue: Docker Build Fails

# Solution: Ensure all files are present
git status
# Make sure requirements.txt, Dockerfile, and src/ are committed
docker build -t bot-detector .

Issue: High Memory Usage

Solution: Use model quantization or reduce batch size in API

Issue: Slow Predictions

Solutions:

Use smaller model (Logistic Regression instead of XGBoost)
Reduce number of features
Implement caching for repeated requests
Scale horizontally with multiple instances

Getting Help

Check this README thoroughly
Review QUICKSTART.md for setup issues
Check DEPLOYMENT.md for deployment issues
Review test output for specific errors
Check GitHub Actions logs for CI/CD issues
Open an issue on GitHub with error details

Logs Location

Application logs: logs/ directory
API logs: stdout when running python src/api/main.py
Test logs: pytest output
CI/CD logs: GitHub Actions → Workflow run → Logs

Contributing

Contributions are welcome. Please follow these guidelines:

Fork the repository
Create a feature branch (git checkout -b feature/new-feature)
Make your changes
Add tests for new functionality
Ensure all tests pass (pytest)
Commit with clear messages
Push to your fork
Create a Pull Request with description of changes

Code Style

Follow PEP 8 style guide
Use Black for formatting: black src/
Use Flake8 for linting: flake8 src/
Add docstrings to all functions and classes
Write unit tests for new features

License

This project is licensed under the MIT License.

Acknowledgments

This project demonstrates machine learning engineering best practices including data processing, feature engineering, model training with anti-overfitting measures, API development, comprehensive testing, CI/CD automation, and deployment strategies.

Developed as a portfolio project showcasing skills in:

Machine Learning (scikit-learn, XGBoost)
Data Science (Pandas, NumPy)
API Development (FastAPI)
Software Engineering (Testing, CI/CD, Docker)
Cryptography (SHA-256 hashing)
DevOps (Docker, GitHub Actions, Cloud Deployment)

Contact

For questions, issues, or suggestions, please open an issue on GitHub or submit a pull request.

Version History

v1.0.0 (2025-01-30): Initial release
- Complete ML pipeline
- FastAPI REST API
- Docker deployment
- CI/CD with GitHub Actions
- Comprehensive test suite

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.github/workflows		.github/workflows
src		src
tests		tests
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
pytest.ini		pytest.ini
requirements.txt		requirements.txt
run_api.py		run_api.py
train.py		train.py

Folders and files

Latest commit

History

Repository files navigation

Bot Detection System

Table of Contents

Overview

Key Features

What Makes This Project Unique

Technical Stack

Core Technologies

Development Tools

Deployment Platforms

Project Architecture

System Components

Data Flow

Key Integration Points

Installation

Prerequisites

Local Setup

Verify Installation

Quick Start

Step 1: Train the Model (2-3 minutes)

Step 2: Run Tests (Optional but Recommended)

Step 3: Start the API Server

Step 4: Test the API

Usage Guide

Training a New Model

Using the API

Single Prediction

Batch Prediction (up to 100 users)

If Using with Custom Data

API Documentation

Base URL

Endpoints

GET /

GET /health

POST /predict

POST /batch_predict

GET /model_info

Interactive Documentation

Testing

Running Tests

Test Structure

Test Coverage

Continuous Testing

Deployment

Docker Deployment

Build Image

Run Container

Check Logs

Stop Container

Using Docker Compose

CI/CD Pipeline

GitHub Actions Workflow

Stages

Triggering CI/CD

Setting Up CI/CD

Project Structure

Model Performance

Training Performance

API Performance

Anti-Overfitting Measures

Feature Importance

Development Workflow

Local Development Cycle

Adding New Features

Modifying Models

Best Practices

Troubleshooting

Common Issues and Solutions

Issue: Model Not Found Error

Issue: API Returns 503 Service Unavailable

Issue: Docker Build Fails

Issue: High Memory Usage

Issue: Slow Predictions

Getting Help

Logs Location

Contributing

Code Style

Packages