Skip to content

feat: implement Prometheus monitoring, Grafana dashboards, and Telegram alerting#79

Open
memreo wants to merge 10 commits into
mainfrom
feat/15/grafana-prometheus
Open

feat: implement Prometheus monitoring, Grafana dashboards, and Telegram alerting#79
memreo wants to merge 10 commits into
mainfrom
feat/15/grafana-prometheus

Conversation

@memreo

@memreo memreo commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Summary

This PR implements the complete Prometheus & Grafana monitoring and Telegram alerting stack for the DevPulse application, fully optimized for local Docker Compose development and ready for Kubernetes deployment (Rancher & Azure cloud).

Key Features Implemented:

  • Prometheus Instrumentation: Enabled metrics scraping endpoints across all three Spring Boot services and the Python microservice.
  • Auto-Provisioned Dashboards: Created three Grafana dashboards (devpulse_dashboard.json, devpulse_resources_dashboard.json, azure_sizing_dashboard.json) loaded automatically at startup.
  • Kubernetes Sizing & Resource Monitoring: Re-routed Azure VM sizing/estimations and resource dashboards to query Kubernetes-specific cAdvisor metrics (container_memory_working_set_bytes and container_cpu_usage_seconds_total) from the setops namespace, ensuring database (postgres) and message broker (rabbitmq) metrics are tracked accurately.
  • Namespace Isolated Alerts: Added alerting rules for downtime, restarts, and OOM kills in the setops namespace, routed through custom HTML templates to a Telegram chat point.

Visual Evidence:

1. System Monitoring & Client Log Ingestion

Split-screen displaying log ingestion throughput/latency metrics on Grafana (left) alongside the DevPulse React UI displaying AI insights (right):
System Monitoring Dashboard

2. Azure VM Sizing & Capacity Planning

Dynamic Azure VM size recommender calculating resource budgets by aggregating actual container memory footprints:
Azure VM Sizing Dashboard

3. Service Runtime Resource Utilization

Granular monitoring of JVM heap usage, Hikari connection pools, thread counts, and FastAPI resource consumption:
DevPulse Runtime Resources Dashboard

4. Custom Telegram Notifications in Action

Telegram channel displaying firing system outage alerts and instant recovery notifications formatted with the custom HTML template:
Telegram Alert Notification

Component

  • Client: client/
  • API contract: api/
  • Spring ingestion: services/spring-ingestion/
  • Spring logbook: services/spring-logbook/
  • Spring alerts: services/spring-alerts/
  • GenAI: services/py-intelligence/
  • Infrastructure: infra/
  • CI/CD: .github/workflows/
  • Documentation

API Impact

  • This changes the API.
  • api/openapi.yaml was updated.
  • This does not change the API.

Testing

  • I tested this locally.
  • I added or updated tests.
  • Tests are not applicable for this change.

Checklist

  • Branch name follows (feat|fix)/(issue_id)/(name_of_issue).
  • The change is limited to the intended component(s).
  • Documentation was updated if needed.

Related Issue

Closes #15

@memreo memreo self-assigned this Jun 25, 2026
@memreo memreo added feature New features ci Pull request responsible for continuous integration cd Pull request responsible for continuous deployment dependencies Pull requests that update a dependency file infra Pull requests that update infrastructure code. labels Jun 25, 2026
@tahahundekari tahahundekari requested review from sachmii and tahahundekari and removed request for sachmii June 25, 2026 13:52

@tahahundekari tahahundekari left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you!

@sachmii sachmii left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cd Pull request responsible for continuous deployment ci Pull request responsible for continuous integration dependencies Pull requests that update a dependency file feature New features infra Pull requests that update infrastructure code.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Prometheus & Grafana Integration

3 participants