Automatically classify ops logs by severity, generate actionable remediations, and push notifications to Slack and JIRA — all orchestrated by a LangGraph multi-agent pipeline with a live Streamlit dashboard.
The Multi-Agent DevOps Incident Analysis Suite ingests raw operational logs in any format (syslog, JSON, CSV, plain text), routes them through a chain of specialized AI agents, and surfaces structured findings in an interactive dashboard. A central LangGraph orchestrator drives the pipeline: it classifies every log entry by severity, generates root-cause remediations, and then fans out to downstream agents based on how critical the issues are. The result is a single-click workflow that takes unstructured noise and produces a prioritized incident runbook, Slack alerts, and JIRA-style tickets — in under a minute.
The system follows a hub-and-spoke pattern. The orchestrator owns the shared IncidentState and decides which agents to invoke after each step.
┌───────────────────┐
│ Streamlit UI │
│ (Dashboard) │
└────────┬──────────┘
│ upload / paste logs
▼
┌───────────────────┐
│ Orchestrator │
│ (LangGraph) │
│ │
│ Shared State: │
│ IncidentState │
└────────┬──────────┘
│
┌─────────────────▼──────────────────┐
│ classifier node │
└─────────────────┬──────────────────┘
│
┌─────────────────▼──────────────────┐
│ remediation node │
└──────┬──────────┬──────────┬───────┘
│ │ │
┌─────────▼──┐ ┌────▼────┐ ┌──▼──────────┐
│ Cookbook │ │ Slack │ │ JIRA Ticket │
│ Synthesizer│ │Notifier │ │ Agent │
└────────────┘ └─────────┘ └─────────────┘
After the Remediation Agent completes, the orchestrator inspects the highest severity found across all classified entries and fans out accordingly:
| Highest Severity | Agents Invoked |
|---|---|
| CRITICAL or HIGH | Cookbook Synthesizer + Slack Notifier + JIRA Ticket Agent |
| MEDIUM | Cookbook Synthesizer + Slack Notifier |
| LOW | Cookbook Synthesizer only |
All agents in the fan-out receive the full remediation list regardless of the trigger severity.
- Multi-agent orchestration via LangGraph with a shared
IncidentStateTypedDict that flows through every node - Severity-based conditional routing — downstream agents are selected dynamically at runtime, not hardcoded
- Multi-format log parsing — handles syslog, JSON arrays, CSV, and plain-text log lines in a single LLM call using few-shot prompting
- Real Slack integration — formats Block Kit messages and posts to a configured channel via
slack-sdk; send status is captured back to state - JIRA ticket creation — when
JIRA_*env vars are configured, creates real JIRA issues viajira-python; falls back to structured mock tickets (clearly labelledMOCK-N) when credentials are missing, so the dashboard always renders something useful - Dark-theme Streamlit dashboard — six-tab layout styled after New Relic / GitHub dark, with sidebar severity breakdown and category tags
- Agent execution trace — every node records start time, end time, input summary, and output summary; visualised as a trace bar and detailed table in the dashboard
| Component | Technology | Version |
|---|---|---|
| Orchestration | LangGraph | 0.4.1 |
| LLM framework | LangChain | 0.3.25 |
| LLM provider | OpenAI-compatible via OpenRouter (langchain-openai) |
0.3.12 |
| Dashboard | Streamlit | 1.45.1 |
| Slack integration | slack-sdk | 3.34.0 |
| Configuration | python-dotenv | 1.1.0 |
| Testing | pytest | 8.3.5 |
The application ships a thin utils/llm.py wrapper that maps OPENROUTER_API_KEY to the standard OpenAI environment variables, so any OpenRouter-hosted model (default: openai/gpt-4o) works without code changes.
multi-agent-devops-suite/
├── app.py # Streamlit entry point — sidebar, tab layout, analysis loop
├── requirements.txt
├── .env.example # API key template
│
├── agents/
│ ├── classifier.py # Log Classifier Agent — parses & classifies log entries
│ ├── remediation.py # Remediation Agent — root-cause analysis & fix steps
│ ├── cookbook.py # Cookbook Synthesizer — builds a markdown incident runbook
│ ├── slack_notifier.py # Slack Notification Agent — posts Block Kit messages
│ └── jira_ticket.py # JIRA Ticket Agent (mocked) — generates ticket objects
│
├── orchestrator/
│ ├── graph.py # LangGraph StateGraph definition & node wiring
│ ├── state.py # IncidentState TypedDict and all data models
│ └── router.py # Severity-based conditional routing logic
│
├── ui/
│ ├── components.py # Reusable Streamlit components (trace bar, severity counts)
│ ├── tabs.py # Renderers for all six dashboard tabs
│ └── theme.css # Custom dark-theme CSS
│
├── utils/
│ ├── llm.py # LLM client factory (OpenRouter → OpenAI-compatible)
│ ├── log_parser.py # File reading and format detection helpers
│ └── slack_client.py # slack-sdk wrapper
│
├── tests/ # pytest test suite (one file per agent/module)
│
├── sample_logs/
│ ├── mixed_incident.log # Mixed-severity syslog-style incident
│ ├── k8s_crash.json # Kubernetes crash events in JSON array format
│ ├── app_errors.csv # Application errors in CSV format
│ ├── app_errors_1.csv # Additional application error sample
│ ├── LogFile.txt # Plain-text incident log
│ └── system_logs_mixed_1000.log # 1000-line mixed access + system log for load testing
│
└── docs/
└── superpowers/specs/ # Design specifications
-
Clone the repository
git clone <repo-url> cd multi-agent-devops-suite
-
Create and activate a virtual environment
python -m venv .venv # macOS/Linux source .venv/bin/activate # Windows .venv\Scripts\activate
-
Install dependencies
pip install -r requirements.txt
-
Configure environment variables
cp .env.example .env
-
Add your API keys to
.envOPENROUTER_API_KEY=openrouter-keys LLM_MODEL=llm-model SLACK_BOT_TOKEN=slack-bot-token (Ex - xoxb-......) SLACK_CHANNEL=slack-channel JIRA_INSTANCE_URL= jira-url (Ex. https://.......atlassian.net) JIRA_USER_EMAIL=jira-email JIRA_API_TOKEN=put-token-to-log-incident JIRA_PROJECT_KEY=workspace-or-project-token
- Get an OpenRouter key at openrouter.ai
- Create a Slack bot with
chat:writescope and invite it toSLACK_CHANNEL - Log in to Jira and create a
KanbanorScrumProject. Look for Project Key to be used asJIRA_PROJECT_KEY. - Visit Atlassian security page and get API Token
-
Run the dashboard
streamlit run app.py
The app opens at
http://localhost:8501.
-
Upload or paste logs — use the sidebar file uploader (supports
.log,.txt,.json,.csv) or paste raw log text directly into the text area -
Click "Analyze Logs" — the multi-agent pipeline runs; a spinner shows progress
-
Explore results across six tabs:
Tab Contents Analysis Classified log entries as severity-coded cards Remediations Root-cause cards with numbered fix steps and confidence scores Cookbook Rendered markdown incident runbook / checklist Slack Log History of Slack messages sent with send status JIRA Tickets Mocked ticket cards (title, priority, description, labels) Agent Trace Full execution graph with per-node timing and I/O summaries -
The sidebar updates after analysis to show a severity breakdown counter and issue-category tags.
Input: raw_logs (raw string from upload or paste)
Output: classified_entries: list[LogEntry]
Sends all log lines to the LLM in a single call using few-shot examples. Extracts timestamp, severity (CRITICAL/HIGH/MEDIUM/LOW), category (OOM, timeout, auth_failure, disk, network, …), source, and a one-line summary for every entry.
Input: classified_entries
Output: remediations: list[Remediation]
Groups related log entries into issue clusters, then reasons about root cause and generates ordered fix steps with a confidence score. Each remediation links back to the indices of the log entries that triggered it.
Input: remediations
Output: cookbook: str (markdown)
Deduplicates and prioritises remediations by severity, then produces a structured incident-response runbook with grouped, numbered steps suitable for on-call engineers.
Input: remediations + severity context from state
Output: slack_notifications: list[SlackMessage]
Formats remediations into Slack Block Kit messages with severity badges and fix summaries, posts them to the configured channel via slack-sdk, and records each message's send status (sent / failed) back to state.
Input: remediations filtered to CRITICAL/HIGH severity
Output: jira_tickets: list[JIRATicket]
Uses the LLM to draft ticket payloads (title, description, priority, labels) from each critical/high remediation, then either:
- Live mode: creates real JIRA issues via
jira-pythonwhenJIRA_INSTANCE_URL,JIRA_USER_EMAIL,JIRA_API_TOKEN, andJIRA_PROJECT_KEYare all set. The instance URL must usehttps://. - Mock mode: returns placeholder tickets keyed
MOCK-1,MOCK-2, … when any credential is missing.
Both modes produce the same JIRATicket shape, so the dashboard tab is identical regardless.
Run the full test suite with:
pytest tests/ -vTests cover all five agents, the orchestrator graph, the router, state initialisation, and log-parser utilities.
| File | Format | Contents |
|---|---|---|
sample_logs/mixed_incident.log |
Plain-text syslog | Mixed-severity incidents (permissions, timeouts, OOM) |
sample_logs/LogFile.txt |
Plain-text log | Mixed CRITICAL/ERROR/WARN incidents |
sample_logs/k8s_crash.json |
JSON array | Kubernetes pod failures: CrashLoopBackOff, back-off events |
sample_logs/app_errors.csv |
CSV | Application-layer errors: payment gateway timeouts, connection failures |
sample_logs/app_errors_1.csv |
CSV | Additional application error sample |
sample_logs/system_logs_mixed_1000.log |
Plain-text log | 1000-line dataset combining HTTP access and system events for load testing |
See SECURITY.md for the vulnerability reporting policy and the hygiene practices this project follows (secret redaction filter, input validation, no-secrets policy).
Released under the MIT License.