Threat Investigation Workbench

A synthetic trust & safety technical investigation workbench for SQL-driven threat investigations, abuse-detection concepts, entity correlation, timeline reconstruction, graph analysis, escalation triage, and analyst workflow design.

This project is educational. It uses fictional telemetry, fictional entities, fictional scenarios, and generalized detection logic. It is not an operational abuse-detection system.

Why This Project Exists
Feature Summary
Architecture
Screenshots
Investigation Workflow Walkthrough
Trust & Safety Concepts Demonstrated
Synthetic Data
SQL Detection Examples
Case Studies
Folder Structure
Local Setup
Run Commands
Adjusting Dataset Size
Deterministic Synthetic Data
Ethical and Synthetic-Use Disclaimer

Why This Project Exists

Trust & safety and platform integrity work often depend on careful investigation rather than a single detection result. Analysts and investigators need to understand how accounts, infrastructure, reports, content events, and enforcement history fit together over time.

This repository demonstrates that workflow in a safe, synthetic environment:

SQL as an explainable investigative artifact.
Local telemetry modeling with SQLite.
Case triage and escalation workflow.
Entity pivoting across users, devices, IPs, reports, and enforcement context.
Timeline reconstruction for operational reasoning.
Graph/network visualization for cluster review.

The goal is to show investigative thinking and tooling design, not to publish real-world detection tradecraft.

Feature Summary

Area	What it demonstrates
Synthetic telemetry	Deterministic fictional users, logins, content events, reports, and enforcements
SQL detections	Readable, commented SQL for generalized investigation patterns
Risk scoring	Deterministic, explainable prioritization without ML or AI scoring
Case generation	Consolidates detection results into reviewable case rows
Streamlit workbench	Dashboard, case queue, escalation workflow, case detail, notes, timeline, and graph
Timeline reconstruction	Chronological review of account, activity, report, linkage, and enforcement events
Entity graph	Minimal visual pivoting across users, shared infrastructure, and enforcement context
Documentation	Case studies, screenshot checklist, safety disclaimer, and architecture notes

Architecture

flowchart TD
    A["Synthetic telemetry generator<br/>users, logins, content, reports, enforcements"] --> B["CSV files<br/>data/synthetic_raw/"]
    B --> C["SQLite build<br/>db/workbench.sqlite"]
    C --> D["SQL detections<br/>detections/sql/*.sql"]
    D --> E["Detection outputs<br/>data/processed/*_results.csv"]
    E --> F["Deterministic risk scoring<br/>detections/python/risk_scoring.py"]
    F --> G["Generated cases<br/>data/processed/generated_cases.csv"]
    G --> H["Streamlit workbench<br/>dashboard, queue, escalation, detail, timeline, graph"]
    H --> I["Local workflow state<br/>data/processed/case_workflow_state.json"]

The same diagram is also available as a standalone Mermaid file at docs/architecture_diagram.mmd.

Design principles:

Synthetic data only.
SQLite remains the local source of truth.
SQL detections are readable and explainable.
No ML or AI scoring.
Workflow state is lightweight and local.
Generated data and local state are ignored by Git.

Screenshots

Investigation Workflow Walkthrough

Generate synthetic telemetry and build the local SQLite database.
Run SQL detections to produce investigator-readable CSV results.
Generate consolidated cases with deterministic risk scores and recommended priorities.
Open the Streamlit workbench.
Use the Dashboard to understand case volume and priority distribution.
Use the Escalation Queue to filter by priority or recommended action.
Use Escalation Workflow to review routing labels, synthetic review indicators, SLA status, handoff state, and analyst notes.
Open Case Detail to review summary, evidence tags, findings, and workflow state.
Add analyst notes, disposition status, reviewer name, and review timestamp.
Use Entity Correlation and Investigation Graph to pivot across linked users and infrastructure.
Use Timeline Reconstruction to understand event sequence.

Trust & Safety Concepts Demonstrated

SQL-driven investigation and detection review.
Abuse triage and prioritization.
Evidence consolidation across multiple weak signals.
Entity correlation and graph-based pivots.
Shared infrastructure analysis in synthetic data.
Timeline reconstruction and sequence reasoning.
Workflow disposition tracking.
Escalation queue design.
Escalation routing and analyst handoff concepts.
Documentation discipline and analyst handoff.
Safety-conscious abstraction of investigation concepts.

Synthetic Data

The generator creates deterministic fictional telemetry from a random seed:

users.csv
login_events.csv
content_events.csv
reports.csv
enforcements.csv

The dataset includes mostly benign synthetic behavior plus intentionally seeded fictional scenarios for portfolio demonstration. Generated CSVs, processed outputs, workflow state, and SQLite databases are ignored by Git.

SQL Detection Examples

The detections/sql/ directory contains generalized, educational SQL:

shared_infra.sql: runner-compatible shared-device detection.
shared_infrastructure.sql: readable shared-infrastructure portfolio example.
ban_evasion.sql: fictional linkage to banned/suspended synthetic accounts.
velocity_spike.sql: concentrated activity in fictional young accounts.
coordinated_activity.sql: repeated synthetic content-hash correlation.
suspicious_prompting.sql: abstract prompt-abuse proxy using fictional labels.

These examples are deliberately simplified. They should not be treated as operational thresholds, production classifiers, or enforcement logic.

Case Studies

Concise synthetic walkthroughs live in docs/case_studies/:

Each case study covers the scenario summary, telemetry involved, detections triggered, analyst reasoning, escalation decision, and recommended action.

Folder Structure

app/
  streamlit_app.py              # Streamlit analyst console
db/
  schema.sql                    # SQLite schema
detections/
  python/risk_scoring.py        # Deterministic scoring helpers
  sql/*.sql                     # Educational SQL detections
docs/
  architecture_diagram.mmd
  case_studies/
  screenshots/
  strategic_disclaimer.md
scripts/
  generate_synthetic_data.py
  build_db.py
  run_detections.py
  generate_cases.py
data/
  synthetic_raw/                # Generated, ignored
  processed/                    # Generated, ignored

Local Setup

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Run Commands

python scripts/generate_synthetic_data.py --seed 42 --users 1000
python scripts/build_db.py
python scripts/run_detections.py
python scripts/generate_cases.py
streamlit run app/streamlit_app.py

With the local virtual environment:

.venv/bin/python scripts/generate_synthetic_data.py --seed 42 --users 1000
.venv/bin/python scripts/build_db.py
.venv/bin/python scripts/run_detections.py
.venv/bin/python scripts/generate_cases.py
.venv/bin/streamlit run app/streamlit_app.py

Adjusting Dataset Size

The synthetic telemetry generator supports configurable dataset sizes by adjusting the number of users.

Example (generate 5000 users):

python scripts/generate_synthetic_data.py --users 5000 --seed 42

This allows larger synthetic datasets for testing timeline reconstruction, entity correlation, graph analysis, and SQL detections at different scales.

Deterministic Synthetic Data

The optional --seed parameter makes synthetic data generation reproducible across runs. Changing the seed generates a different synthetic telemetry dataset.

Example (any integer can be chosen for the seed value; in this case 17):

python scripts/generate_synthetic_data.py --seed 17 --users 1000

Ethical and Synthetic-Use Disclaimer

This repository is a synthetic educational project.

It contains no real telemetry.
It contains no real users, accounts, incidents, investigations, or enforcement decisions.
It does not implement real-world operational detection logic.
It does not include production thresholds, sensitive investigative tradecraft, or deployable enforcement automation.
All labels, entities, scores, cases, and scenarios are fictional and simplified.
It is not intended for production deployment, monitoring, surveillance, or real-world platform enforcement.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.github/workflows		.github/workflows
app		app
db		db
detections		detections
docs		docs
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Threat Investigation Workbench

Table of Contents

Why This Project Exists

Feature Summary

Architecture

Screenshots

Investigation Workflow Walkthrough

Trust & Safety Concepts Demonstrated

Synthetic Data

SQL Detection Examples

Case Studies

Folder Structure

Local Setup

Run Commands

Adjusting Dataset Size

Deterministic Synthetic Data

Ethical and Synthetic-Use Disclaimer

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Threat Investigation Workbench

Table of Contents

Why This Project Exists

Feature Summary

Architecture

Screenshots

Investigation Workflow Walkthrough

Trust & Safety Concepts Demonstrated

Synthetic Data

SQL Detection Examples

Case Studies

Folder Structure

Local Setup

Run Commands

Adjusting Dataset Size

Deterministic Synthetic Data

Ethical and Synthetic-Use Disclaimer

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages