Skip to content

atlasruntime/threat-investigation-workbench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Threat Investigation Workbench

Python SQLite Streamlit Synthetic Data

A synthetic trust & safety technical investigation workbench for SQL-driven threat investigations, abuse-detection concepts, entity correlation, timeline reconstruction, graph analysis, escalation triage, and analyst workflow design.

This project is educational. It uses fictional telemetry, fictional entities, fictional scenarios, and generalized detection logic. It is not an operational abuse-detection system.

Table of Contents

Why This Project Exists

Trust & safety and platform integrity work often depend on careful investigation rather than a single detection result. Analysts and investigators need to understand how accounts, infrastructure, reports, content events, and enforcement history fit together over time.

This repository demonstrates that workflow in a safe, synthetic environment:

  • SQL as an explainable investigative artifact.
  • Local telemetry modeling with SQLite.
  • Case triage and escalation workflow.
  • Entity pivoting across users, devices, IPs, reports, and enforcement context.
  • Timeline reconstruction for operational reasoning.
  • Graph/network visualization for cluster review.

The goal is to show investigative thinking and tooling design, not to publish real-world detection tradecraft.

Feature Summary

Area What it demonstrates
Synthetic telemetry Deterministic fictional users, logins, content events, reports, and enforcements
SQL detections Readable, commented SQL for generalized investigation patterns
Risk scoring Deterministic, explainable prioritization without ML or AI scoring
Case generation Consolidates detection results into reviewable case rows
Streamlit workbench Dashboard, case queue, escalation workflow, case detail, notes, timeline, and graph
Timeline reconstruction Chronological review of account, activity, report, linkage, and enforcement events
Entity graph Minimal visual pivoting across users, shared infrastructure, and enforcement context
Documentation Case studies, screenshot checklist, safety disclaimer, and architecture notes

Architecture

flowchart TD
    A["Synthetic telemetry generator<br/>users, logins, content, reports, enforcements"] --> B["CSV files<br/>data/synthetic_raw/"]
    B --> C["SQLite build<br/>db/workbench.sqlite"]
    C --> D["SQL detections<br/>detections/sql/*.sql"]
    D --> E["Detection outputs<br/>data/processed/*_results.csv"]
    E --> F["Deterministic risk scoring<br/>detections/python/risk_scoring.py"]
    F --> G["Generated cases<br/>data/processed/generated_cases.csv"]
    G --> H["Streamlit workbench<br/>dashboard, queue, escalation, detail, timeline, graph"]
    H --> I["Local workflow state<br/>data/processed/case_workflow_state.json"]
Loading

The same diagram is also available as a standalone Mermaid file at docs/architecture_diagram.mmd.

Design principles:

  • Synthetic data only.
  • SQLite remains the local source of truth.
  • SQL detections are readable and explainable.
  • No ML or AI scoring.
  • Workflow state is lightweight and local.
  • Generated data and local state are ignored by Git.

Screenshots

Dashboard overview Case queue Escalation workflow Case detail Timeline reconstruction Investigation graph

Investigation Workflow Walkthrough

  1. Generate synthetic telemetry and build the local SQLite database.
  2. Run SQL detections to produce investigator-readable CSV results.
  3. Generate consolidated cases with deterministic risk scores and recommended priorities.
  4. Open the Streamlit workbench.
  5. Use the Dashboard to understand case volume and priority distribution.
  6. Use the Escalation Queue to filter by priority or recommended action.
  7. Use Escalation Workflow to review routing labels, synthetic review indicators, SLA status, handoff state, and analyst notes.
  8. Open Case Detail to review summary, evidence tags, findings, and workflow state.
  9. Add analyst notes, disposition status, reviewer name, and review timestamp.
  10. Use Entity Correlation and Investigation Graph to pivot across linked users and infrastructure.
  11. Use Timeline Reconstruction to understand event sequence.

Trust & Safety Concepts Demonstrated

  • SQL-driven investigation and detection review.
  • Abuse triage and prioritization.
  • Evidence consolidation across multiple weak signals.
  • Entity correlation and graph-based pivots.
  • Shared infrastructure analysis in synthetic data.
  • Timeline reconstruction and sequence reasoning.
  • Workflow disposition tracking.
  • Escalation queue design.
  • Escalation routing and analyst handoff concepts.
  • Documentation discipline and analyst handoff.
  • Safety-conscious abstraction of investigation concepts.

Synthetic Data

The generator creates deterministic fictional telemetry from a random seed:

  • users.csv
  • login_events.csv
  • content_events.csv
  • reports.csv
  • enforcements.csv

The dataset includes mostly benign synthetic behavior plus intentionally seeded fictional scenarios for portfolio demonstration. Generated CSVs, processed outputs, workflow state, and SQLite databases are ignored by Git.

SQL Detection Examples

The detections/sql/ directory contains generalized, educational SQL:

  • shared_infra.sql: runner-compatible shared-device detection.
  • shared_infrastructure.sql: readable shared-infrastructure portfolio example.
  • ban_evasion.sql: fictional linkage to banned/suspended synthetic accounts.
  • velocity_spike.sql: concentrated activity in fictional young accounts.
  • coordinated_activity.sql: repeated synthetic content-hash correlation.
  • suspicious_prompting.sql: abstract prompt-abuse proxy using fictional labels.

These examples are deliberately simplified. They should not be treated as operational thresholds, production classifiers, or enforcement logic.

Case Studies

Concise synthetic walkthroughs live in docs/case_studies/:

Each case study covers the scenario summary, telemetry involved, detections triggered, analyst reasoning, escalation decision, and recommended action.

Folder Structure

app/
  streamlit_app.py              # Streamlit analyst console
db/
  schema.sql                    # SQLite schema
detections/
  python/risk_scoring.py        # Deterministic scoring helpers
  sql/*.sql                     # Educational SQL detections
docs/
  architecture_diagram.mmd
  case_studies/
  screenshots/
  strategic_disclaimer.md
scripts/
  generate_synthetic_data.py
  build_db.py
  run_detections.py
  generate_cases.py
data/
  synthetic_raw/                # Generated, ignored
  processed/                    # Generated, ignored

Local Setup

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Run Commands

python scripts/generate_synthetic_data.py --seed 42 --users 1000
python scripts/build_db.py
python scripts/run_detections.py
python scripts/generate_cases.py
streamlit run app/streamlit_app.py

With the local virtual environment:

.venv/bin/python scripts/generate_synthetic_data.py --seed 42 --users 1000
.venv/bin/python scripts/build_db.py
.venv/bin/python scripts/run_detections.py
.venv/bin/python scripts/generate_cases.py
.venv/bin/streamlit run app/streamlit_app.py

Adjusting Dataset Size

The synthetic telemetry generator supports configurable dataset sizes by adjusting the number of users.

Example (generate 5000 users):

python scripts/generate_synthetic_data.py --users 5000 --seed 42

This allows larger synthetic datasets for testing timeline reconstruction, entity correlation, graph analysis, and SQL detections at different scales.

Deterministic Synthetic Data

The optional --seed parameter makes synthetic data generation reproducible across runs. Changing the seed generates a different synthetic telemetry dataset.

Example (any integer can be chosen for the seed value; in this case 17):

python scripts/generate_synthetic_data.py --seed 17 --users 1000

Ethical and Synthetic-Use Disclaimer

This repository is a synthetic educational project.

  • It contains no real telemetry.
  • It contains no real users, accounts, incidents, investigations, or enforcement decisions.
  • It does not implement real-world operational detection logic.
  • It does not include production thresholds, sensitive investigative tradecraft, or deployable enforcement automation.
  • All labels, entities, scores, cases, and scenarios are fictional and simplified.
  • It is not intended for production deployment, monitoring, surveillance, or real-world platform enforcement.

About

Synthetic threat, trust & safety investigation workbench demonstrating SQL-driven abuse detection, escalation workflows, entity correlation, and analyst tooling.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages