A synthetic trust & safety technical investigation workbench for SQL-driven threat investigations, abuse-detection concepts, entity correlation, timeline reconstruction, graph analysis, escalation triage, and analyst workflow design.
This project is educational. It uses fictional telemetry, fictional entities, fictional scenarios, and generalized detection logic. It is not an operational abuse-detection system.
- Why This Project Exists
- Feature Summary
- Architecture
- Screenshots
- Investigation Workflow Walkthrough
- Trust & Safety Concepts Demonstrated
- Synthetic Data
- SQL Detection Examples
- Case Studies
- Folder Structure
- Local Setup
- Run Commands
- Adjusting Dataset Size
- Deterministic Synthetic Data
- Ethical and Synthetic-Use Disclaimer
Trust & safety and platform integrity work often depend on careful investigation rather than a single detection result. Analysts and investigators need to understand how accounts, infrastructure, reports, content events, and enforcement history fit together over time.
This repository demonstrates that workflow in a safe, synthetic environment:
- SQL as an explainable investigative artifact.
- Local telemetry modeling with SQLite.
- Case triage and escalation workflow.
- Entity pivoting across users, devices, IPs, reports, and enforcement context.
- Timeline reconstruction for operational reasoning.
- Graph/network visualization for cluster review.
The goal is to show investigative thinking and tooling design, not to publish real-world detection tradecraft.
| Area | What it demonstrates |
|---|---|
| Synthetic telemetry | Deterministic fictional users, logins, content events, reports, and enforcements |
| SQL detections | Readable, commented SQL for generalized investigation patterns |
| Risk scoring | Deterministic, explainable prioritization without ML or AI scoring |
| Case generation | Consolidates detection results into reviewable case rows |
| Streamlit workbench | Dashboard, case queue, escalation workflow, case detail, notes, timeline, and graph |
| Timeline reconstruction | Chronological review of account, activity, report, linkage, and enforcement events |
| Entity graph | Minimal visual pivoting across users, shared infrastructure, and enforcement context |
| Documentation | Case studies, screenshot checklist, safety disclaimer, and architecture notes |
flowchart TD
A["Synthetic telemetry generator<br/>users, logins, content, reports, enforcements"] --> B["CSV files<br/>data/synthetic_raw/"]
B --> C["SQLite build<br/>db/workbench.sqlite"]
C --> D["SQL detections<br/>detections/sql/*.sql"]
D --> E["Detection outputs<br/>data/processed/*_results.csv"]
E --> F["Deterministic risk scoring<br/>detections/python/risk_scoring.py"]
F --> G["Generated cases<br/>data/processed/generated_cases.csv"]
G --> H["Streamlit workbench<br/>dashboard, queue, escalation, detail, timeline, graph"]
H --> I["Local workflow state<br/>data/processed/case_workflow_state.json"]
The same diagram is also available as a standalone Mermaid file at docs/architecture_diagram.mmd.
Design principles:
- Synthetic data only.
- SQLite remains the local source of truth.
- SQL detections are readable and explainable.
- No ML or AI scoring.
- Workflow state is lightweight and local.
- Generated data and local state are ignored by Git.
- Generate synthetic telemetry and build the local SQLite database.
- Run SQL detections to produce investigator-readable CSV results.
- Generate consolidated cases with deterministic risk scores and recommended priorities.
- Open the Streamlit workbench.
- Use the Dashboard to understand case volume and priority distribution.
- Use the Escalation Queue to filter by priority or recommended action.
- Use Escalation Workflow to review routing labels, synthetic review indicators, SLA status, handoff state, and analyst notes.
- Open Case Detail to review summary, evidence tags, findings, and workflow state.
- Add analyst notes, disposition status, reviewer name, and review timestamp.
- Use Entity Correlation and Investigation Graph to pivot across linked users and infrastructure.
- Use Timeline Reconstruction to understand event sequence.
- SQL-driven investigation and detection review.
- Abuse triage and prioritization.
- Evidence consolidation across multiple weak signals.
- Entity correlation and graph-based pivots.
- Shared infrastructure analysis in synthetic data.
- Timeline reconstruction and sequence reasoning.
- Workflow disposition tracking.
- Escalation queue design.
- Escalation routing and analyst handoff concepts.
- Documentation discipline and analyst handoff.
- Safety-conscious abstraction of investigation concepts.
The generator creates deterministic fictional telemetry from a random seed:
users.csvlogin_events.csvcontent_events.csvreports.csvenforcements.csv
The dataset includes mostly benign synthetic behavior plus intentionally seeded fictional scenarios for portfolio demonstration. Generated CSVs, processed outputs, workflow state, and SQLite databases are ignored by Git.
The detections/sql/ directory contains generalized, educational SQL:
shared_infra.sql: runner-compatible shared-device detection.shared_infrastructure.sql: readable shared-infrastructure portfolio example.ban_evasion.sql: fictional linkage to banned/suspended synthetic accounts.velocity_spike.sql: concentrated activity in fictional young accounts.coordinated_activity.sql: repeated synthetic content-hash correlation.suspicious_prompting.sql: abstract prompt-abuse proxy using fictional labels.
These examples are deliberately simplified. They should not be treated as operational thresholds, production classifiers, or enforcement logic.
Concise synthetic walkthroughs live in docs/case_studies/:
Each case study covers the scenario summary, telemetry involved, detections triggered, analyst reasoning, escalation decision, and recommended action.
app/
streamlit_app.py # Streamlit analyst console
db/
schema.sql # SQLite schema
detections/
python/risk_scoring.py # Deterministic scoring helpers
sql/*.sql # Educational SQL detections
docs/
architecture_diagram.mmd
case_studies/
screenshots/
strategic_disclaimer.md
scripts/
generate_synthetic_data.py
build_db.py
run_detections.py
generate_cases.py
data/
synthetic_raw/ # Generated, ignored
processed/ # Generated, ignored
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtpython scripts/generate_synthetic_data.py --seed 42 --users 1000
python scripts/build_db.py
python scripts/run_detections.py
python scripts/generate_cases.py
streamlit run app/streamlit_app.pyWith the local virtual environment:
.venv/bin/python scripts/generate_synthetic_data.py --seed 42 --users 1000
.venv/bin/python scripts/build_db.py
.venv/bin/python scripts/run_detections.py
.venv/bin/python scripts/generate_cases.py
.venv/bin/streamlit run app/streamlit_app.pyThe synthetic telemetry generator supports configurable dataset sizes by adjusting the number of users.
Example (generate 5000 users):
python scripts/generate_synthetic_data.py --users 5000 --seed 42This allows larger synthetic datasets for testing timeline reconstruction, entity correlation, graph analysis, and SQL detections at different scales.
The optional --seed parameter makes synthetic data generation reproducible across runs. Changing the seed generates a different synthetic telemetry dataset.
Example (any integer can be chosen for the seed value; in this case 17):
python scripts/generate_synthetic_data.py --seed 17 --users 1000This repository is a synthetic educational project.
- It contains no real telemetry.
- It contains no real users, accounts, incidents, investigations, or enforcement decisions.
- It does not implement real-world operational detection logic.
- It does not include production thresholds, sensitive investigative tradecraft, or deployable enforcement automation.
- All labels, entities, scores, cases, and scenarios are fictional and simplified.
- It is not intended for production deployment, monitoring, surveillance, or real-world platform enforcement.





