feat: Add Prometheus/Grafana monitoring and observability stack#3
Open
feat: Add Prometheus/Grafana monitoring and observability stack#3
Conversation
Implement kube-prometheus-stack for comprehensive cluster monitoring: **Infrastructure:** - Helm values configuration (core/monitoring/prometheus-grafana-values.yaml) - kube-prometheus-stack with Prometheus Operator - Prometheus Server with 7-day retention and 5Gi storage - Grafana with pre-built Kubernetes dashboards - Alertmanager for alert management - Node Exporter for node metrics - Kube State Metrics for Kubernetes object metrics **Components:** - **Prometheus**: Metrics collection and storage (512Mi-1Gi memory) - **Grafana**: Visualization dashboards (128Mi-256Mi memory) - **Alertmanager**: Alert routing and management (64Mi-128Mi memory) - **Exporters**: Node and kube-state-metrics exporters **Installation:** - Installation script (scripts/install-monitoring.sh) - Uses prometheus-community Helm repository - Automatic namespace creation - 15-minute timeout for complete stack deployment **Developer Experience:** - Makefile targets: install-monitoring, grafana, prometheus, alertmanager - Port forwarding helpers for all UIs - Enhanced status command with monitoring pods and release info - Default admin credentials (admin/admin) **Monitoring Capabilities:** - Kubernetes cluster metrics (nodes, pods, deployments) - Resource utilization (CPU, memory, disk, network) - KLDP component metrics (Airflow, MinIO, Spark) - Custom scrape configs for application monitoring - Pre-configured alerting rules for critical issues **Default Dashboards:** - Kubernetes cluster overview - Node resource usage - Pod resource usage - Persistent volume monitoring - Network I/O and latency **Integrations:** - Airflow metrics scraping (webserver, scheduler) - MinIO metrics collection - Spark Operator metrics - Custom ServiceMonitor support **Configuration Highlights:** - Optimized for local development resources - 7-day data retention - NodePort services for easy access - Default Kubernetes dashboards enabled - Alerting rules for common issues Enables comprehensive observability: - Performance monitoring and optimization - Resource usage tracking - Troubleshooting and debugging - Capacity planning - SLA/SLO tracking 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Enhance CI workflow to validate all components installation and functionality: **Extended Testing Coverage:** - ✅ Airflow with KubernetesExecutor (existing) - ✅ MinIO object storage (NEW) - ✅ Spark Operator (NEW) - ✅ Prometheus/Grafana monitoring stack (NEW) **Changes:** - Increased job timeout from 45 to 90 minutes - Added MinIO installation and verification - Added Spark Operator installation and verification - Added Prometheus/Grafana installation and verification - Enhanced resource reporting across all namespaces - Improved failure diagnostics for all components **MinIO Validation:** - Helm install with OCI registry - Pod readiness check (600s timeout) - Storage namespace verification **Spark Operator Validation:** - Helm install from spark-operator repository - Operator pod readiness check (300s timeout) - CRD availability verification **Monitoring Stack Validation:** - kube-prometheus-stack installation - Prometheus and Grafana pod readiness (600s each) - Monitoring namespace verification **Enhanced Debugging:** - Resource status for all 4 namespaces (airflow, storage, spark, monitoring) - Services and PVCs across all namespaces - Events from all namespaces (last 20 per namespace) - Failed pod descriptions for all namespaces **Test Flow:** 1. DAG syntax validation 2. Minikube cluster setup 3. Install Airflow → verify 4. Install MinIO → verify 5. Install Spark Operator → verify 6. Install Monitoring → verify 7. Final component status check This ensures the complete KLDP stack works end-to-end in CI. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
The specific image tags (bitnami/minio:2025.7.23-debian-12-r3) don't exist, causing ImagePullBackOff errors in CI. Let the Helm chart use its default compatible image versions instead. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Create separate CI values to speed up MinIO installation: - Disabled persistence (use emptyDir instead of PVC) - Disabled console to save resources - Reduced resource limits (128Mi/256Mi) - Single bucket for testing - Faster timeout (10m vs 15m) This avoids PVC provisioning delays and reduces resource usage in CI. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This workflow isolates MinIO installation for diagnostic purposes: - Tests Docker registry connectivity and DNS resolution - Inspects Helm chart to verify default image tags - Attempts to pull various MinIO image tags - Pre-loads images into Minikube - Installs MinIO with verbose Helm output - Comprehensive pod status and event logging - Tests alternative image sources (quay.io) if Bitnami fails - Provides final diagnostic summary Can be triggered manually via workflow_dispatch or auto-runs on feat/add-prometheus-grafana branch for quick iteration. This is temporary for debugging CI image pull failures. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Replaced Unicode characters (✓, ✗, box drawing chars) with ASCII equivalents ([OK], [FAIL], =) to fix YAML validation issues. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Updated install-minio.sh to support KLDP_MINIO_VALUES_FILE env var for CI values file override. Both CI workflows now use the script instead of duplicating helm commands. Benefits: - Single source of truth for MinIO installation - Easier to maintain and debug - Consistent behavior between local and CI - If it works locally, it works in CI Environment variables: - KLDP_MINIO_VERSION: Override MinIO chart version - KLDP_MINIO_VALUES_FILE: Override values file path (for CI) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
No longer needed since we unified the installation approach. Both local and CI now use scripts/install-minio.sh. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Updated all installation scripts to support environment variable overrides for values files, and updated CI to use scripts instead of inline helm commands. Scripts updated: - install-airflow.sh: KLDP_AIRFLOW_VALUES_FILE - install-spark.sh: KLDP_SPARK_VALUES_FILE - install-monitoring.sh: KLDP_MONITORING_VALUES_FILE Benefits: - Single source of truth for ALL installations - Consistent behavior between local and CI - Easier to maintain and debug - Test locally = works in CI Environment variables for CI overrides: - KLDP_AIRFLOW_VALUES_FILE (defaults to core/airflow/values.yaml) - KLDP_MINIO_VALUES_FILE (defaults to core/storage/minio-values.yaml) - KLDP_SPARK_VALUES_FILE (defaults to core/compute/spark-operator-values.yaml) - KLDP_MONITORING_VALUES_FILE (defaults to core/monitoring/prometheus-grafana-values.yaml) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Removed CI-specific values files and unified configuration: - Deleted: core/airflow/values-ci-emptydir.yaml - Deleted: core/storage/minio-values-ci.yaml - Optimized main values files to work well in both environments - Updated CI workflow to use same values as local development Benefits: - CI validates the REAL configuration users run locally - No configuration drift between environments - Easier debugging: if it works locally, it works in CI - Fewer files to maintain - Single source of truth per component Resource optimizations in core/airflow/values.yaml: - Reduced memory requests to 256Mi for scheduler/webserver - Reduced triggerer to 128Mi - Still functional for local dev and fits in CI (2 CPUs, 4GB RAM) Updated CLAUDE.md: - Documented single source of truth approach - Simplified CI troubleshooting section - Updated local testing instructions 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Airflow 3.x replaced webserver with api-server component. Changes: - Wait for scheduler instead of webserver (webserver no longer exists) - Update port-forward command to use api-server service - Update log commands to use correct component labels This fixes CI failure where script was looking for non-existent webserver component. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Reduce resource usage in CI by disabling triggerer component which is not critical for basic validation. Changes: - Disable triggerer (saves 128Mi RAM, 100m CPU) - Increase scheduler ready timeout from 5min to 10min - Gives scheduler more time to stabilize in resource-constrained CI This should fix scheduler 1/2 ready issue in CI. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implement kube-prometheus-stack for comprehensive cluster monitoring:
Infrastructure:
Components:
Installation:
Developer Experience:
Monitoring Capabilities:
Default Dashboards:
Integrations:
Configuration Highlights:
Enables comprehensive observability:
🤖 Generated with Claude Code