Skip to content

MARKASCHARAN/WatchTower

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

🔭 Watchtower — AI-Assisted Observability & Incident Intelligence Platform

Real-Time Metrics • Incident Lifecycle • AI Root-Cause Analysis • Production-Ready

Node.js TypeScript WebSockets PostgreSQL Redis Observability AI License


📖 Overview

Watchtower is an AI-assisted cloud observability and incident intelligence platform focused on early detection, fast diagnosis, and reliable resolution of production incidents.

Instead of static dashboards, Watchtower models how real SRE and on-call systems are designed and operated in production environments.

The system demonstrates:

  • Real-time metrics ingestion and streaming
  • Deterministic incident lifecycle management
  • Signal-driven incident detection
  • Metric, event, and deployment correlation
  • AI-assisted root-cause analysis

This project is built to mirror real-world observability and reliability tooling, not toy dashboards.


🎯 Why Watchtower Exists

Production systems fail continuously due to:

  • Traffic spikes
  • Memory leaks
  • Slow queries
  • Misconfigured deployments
  • Cascading service failures

Traditional monitoring answers what is broken, not why.

Watchtower is designed to reduce Mean Time To Resolution (MTTR) by converting telemetry into incident intelligence.


✨ Core Guarantees

Guarantee Implementation
Real-time visibility WebSocket-based metric streaming
Explicit incident lifecycle Finite State Machine
Actionable signals Threshold + anomaly detection
Incident traceability Timestamped event timelines
Faster diagnosis AI-generated root-cause hypotheses
Failure tolerance Backpressure handling + retries

🏗️ Architecture

Event-Driven Observability Platform

🚦 Incident Lifecycle (FSM)

Incidents follow a strict lifecycle:

Rules

  • No implicit transitions
  • Every state change is recorded
  • Incidents are immutable once resolved

This mirrors real on-call and SRE workflows.


📡 Real-Time Metrics Streaming

Metrics are ingested and streamed live:

  • CPU, memory, latency, error rates
  • Per-service and per-environment views
  • Zero polling, push-based updates

This enables immediate detection of production anomalies.


🧠 AI Incident Copilot

When an incident is created, Watchtower:

  • Analyzes metric deviations
  • Correlates recent deployments and events
  • Generates:
    • Root cause hypotheses
    • Debugging suggestions
    • Risk assessment

AI is used as decision support, not blind automation.


🧭 Incident Timeline & Correlation

Each incident maintains a structured timeline:

  • Metric threshold breaches
  • Deployments
  • Manual annotations
  • State transitions

This enables accurate post-incident analysis and learning.


⚙️ Failure-First Engineering

Watchtower is designed assuming failures are normal:

  • Metric ingestion retries
  • Graceful degradation when AI services fail
  • Explicit error handling instead of silent drops

This reflects production-grade reliability thinking.


🧪 Testing Philosophy

Watchtower prioritizes correctness over UI coverage:

  • Incident state transition tests
  • Threshold breach simulations
  • Streaming reconnection tests
  • AI fallback behavior tests

🧠 What This Project Demonstrates

  • ✔ Distributed systems thinking
  • ✔ Event-driven architecture
  • ✔ State machine modeling
  • ✔ Real-time data pipelines
  • ✔ Meaningful AI integration
  • ✔ SRE / observability domain knowledge

This project is intentionally designed to reflect real infrastructure problems, not frontend demos.


📄 License

MIT License © 2026 Watchtower

About

Watchtower is an AI-assisted cloud observability and incident intelligence platform that helps teams detect, understand, and resolve production issues faster

Topics

Resources

License

Stars

Watchers

Forks

Contributors