Skip to content

Docs: reorganize and rewrite content#1894

Closed
aantn wants to merge 55 commits into
masterfrom
docs-reorganize
Closed

Docs: reorganize and rewrite content#1894
aantn wants to merge 55 commits into
masterfrom
docs-reorganize

Conversation

@aantn
Copy link
Copy Markdown
Collaborator

@aantn aantn commented Aug 11, 2025

Main changes:

  1. Move HolmesGPT docs out of Robusta docs
  2. Split Alert Sources docs into Alert Sources and Metrics Providers
  3. Improve docs on Robusta SaaS and the various APIs it exposes

aantn and others added 29 commits July 15, 2025 15:44
- Remove sinks/destinations from integrations overview
- Restructure page to highlight "Forward Alerts to Robusta"
- Add dedicated section for Prometheus, Nagios, and SolarWinds
- Simplify description to focus on alert forwarding and AI analysis

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Move sinks reference from Integrations nav to Notifications & Routing nav
- Place sinks as first item in Notifications & Routing section
- Remove sinks from Integrations section to reduce confusion
- Keep logical grouping of alert destinations with notification routing

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Redesign integrations overview to be truly high-level with categories
- Rename "Integrating with Prometheus" to "Alert Sources" for broader scope
- Remove duplicate content between overview and subpages
- Focus overview on integration categories rather than detailed setup
- Update navigation labels to match new page structure
- Make user journey clearer: overview → specific alert source setup

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
… Automation

- Rename "Integrations" section to "Alert Sources" - much clearer purpose
- Create dedicated "AI Analysis" section for Holmes GPT only
- Move KRR, Popeye, and data export to "Automation" section
- Update Alert Sources overview to focus purely on monitoring integrations
- Remove mixed content categories that confused users
- Improve logical flow: Alert Sources → AI Analysis → Notifications → Automation

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Create new notifications overview page as entry point
- Reorder navigation: Overview → Configuring Sinks → Sink Reference
- Fix information hierarchy so users start with concepts, not specific sinks
- Prevent sphinx menu expansion pushing basic setup pages down
- Add clear getting started flow and common workflows
- Update sink reference title to match new structure

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Move Nagios and SolarWinds from Prometheus subpages to top-level Alert Sources
- Remove confusing navigation hierarchy where non-Prometheus systems appeared under Prometheus
- Clean up Prometheus page to focus purely on Prometheus/AlertManager integrations
- Remove "Other Alerting Systems" section since they're now top-level
- Create clearer navigation: Overview → Prometheus → Nagios → SolarWinds

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Remove verbose explanations and get straight to the point
- Lead with clear action: "Choose your monitoring system"
- Make card descriptions more direct and useful
- Add practical note about webhook compatibility for other systems
- Cut word count by ~60% while improving clarity

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Change 'Legacy systems' to 'Nagios monitoring'
- Change 'Enterprise monitoring' to 'SolarWinds monitoring'
- Keep descriptions neutral and product-focused
- Avoid terms that might make users feel bad about their tech choices

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Link 'HTTP webhooks' to webhook triggers documentation
- Provides users with concrete next steps if their system isn't listed
- Makes the fallback option actionable rather than just informational

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Link HTTP webhooks to exporting-data.rst which contains POST /api/alerts endpoint
- This is the correct endpoint for sending alerts from any system via HTTP
- Previous link was to webhook triggers which is for different use case

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Move exporting-data.rst from Automation to Alert Sources as "Custom Webhooks"
- This contains the POST /api/alerts endpoint for sending alerts TO Robusta
- Makes it discoverable for users wanting to integrate custom monitoring systems
- Logical placement with other alert source integrations

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Remove navigation.tabs and navigation.tabs.sticky features
- Add navigation.sections and navigation.expand for better sidebar
- This provides more room for additional top-level sections
- Makes navigation more scalable for future section additions

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@aantn aantn requested a review from arikalon1 August 11, 2025 07:02
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
docs/configuration/alertmanager-integration/google-managed-prometheus.rst (2)

62-66: Fix code block language and missing closing quote

The “Verify it Works” command is a shell command but marked as YAML, and it’s missing a closing quote. This will render incorrectly and mislead users.

-.. code-block:: yaml
+.. code-block:: bash

-   robusta demo-alert --alertmanager-url='http://alertmanager.gmp-system.svc.cluster.local:9093
+   robusta demo-alert --alertmanager-url='http://alertmanager.gmp-system.svc.cluster.local:9093'

53-56: Fix namespace mismatch: use the same namespace for both commands

The secret is created in gmp-public, but the demo-alert URL points to gmp-system. This inconsistency will confuse users and likely cause failures. Please standardize the namespace across both snippets:

• docs/configuration/alertmanager-integration/google-managed-prometheus.rst:
– Line 54: kubectl create secret generic alertmanager -n gmp-public …
– Line 64: robusta demo-alert --alertmanager-url='http://alertmanager.gmp-system.svc.cluster.local:9093'

Options (pick one):

  1. If AlertManager runs in gmp-system, update the secret command:

    - kubectl create secret generic alertmanager \
    -   -n gmp-public \
    + kubectl create secret generic alertmanager \
    +   -n gmp-system \
        --from-file=alertmanager.yaml
  2. If gmp-public is correct, update the demo-alert URL:

    - robusta demo-alert --alertmanager-url='http://alertmanager.gmp-system.svc.cluster.local:9093'
    + robusta demo-alert --alertmanager-url='http://alertmanager.gmp-public.svc.cluster.local:9093'
♻️ Duplicate comments (4)
docs/configuration/metric-providers-victoria.rst (4)

48-51: Clarify prometheus_auth is sent verbatim as the Authorization header (duplicate of prior feedback).

This avoids ambiguity about including the "Bearer " prefix or using Basic auth.

     globalConfig:
         prometheus_url: "http://vmauth-victoria-metrics.default.svc.cluster.local:8427"
-        prometheus_auth: "Bearer YOUR_TOKEN"
+        # Will be sent as the Authorization header value (e.g., "Bearer <token>" or "Basic <base64(user:pass)>")
+        prometheus_auth: "Bearer YOUR_TOKEN"

59-64: Expand service discovery to include vmauth and vmalertmanager; add common namespace (duplicate of prior feedback).

Covers typical VM deployments and improves discoverability.

     # Find VictoriaMetrics services
-    kubectl get svc -A | grep -E "vmsingle|vmselect|vmalert"
+    kubectl get svc -A | grep -E "vmsingle|vmselect|vmauth|vmalert(manager)?"
     
-    # Check specific namespace
-    kubectl get svc -n victoria-metrics-system
+    # Check common namespaces
+    # Operator-based installs typically use:
+    kubectl get svc -n victoria-metrics-system
+    # Helm chart installs often default to:
+    kubectl get svc -n victoria-metrics

106-117: Document default and startup impact for check_prometheus_flags (duplicate of prior feedback).

Make it explicit that this is read from globalConfig, defaults to true, and that disabling it reduces early validation by skipping /api/v1/status/flags at startup.

    .. code-block:: yaml

        globalConfig:
            prometheus_url: "http://vmsingle-victoria-metrics.default.svc.cluster.local:8429"
            check_prometheus_flags: false

-   Some VictoriaMetrics configurations may not fully implement the Prometheus flags API, which can cause connection issues during Robusta initialization.
+   Some VictoriaMetrics configurations may not fully implement the Prometheus flags API, which can cause connection issues during Robusta initialization.
+   Note: check_prometheus_flags is read from globalConfig (Helm value) and defaults to true. Setting it to false disables the Prometheus flags API check during Robusta startup (skips calls to /api/v1/status/flags), which reduces early validation of the Prometheus endpoint.

127-129: Make kubectl run example portable across kubectl versions (duplicate of prior feedback).

Adding --restart=Never and quoting the URL increases portability and avoids shell parsing issues.

-       kubectl run test-curl --image=curlimages/curl --rm -it -- \
-           curl -v http://vmsingle-victoria-metrics.default.svc.cluster.local:8429/api/v1/query?query=up
+       kubectl run -it --rm --restart=Never vm-curl --image=curlimages/curl -- \
+         curl -v "http://vmsingle-victoria-metrics.default.svc.cluster.local:8429/api/v1/query?query=up"
🧹 Nitpick comments (10)
docs/configuration/exporting/namespace-resources-api.rst (5)

6-8: Replace vague “may be outdated” warning with a concrete, dated stability note (or remove).

As written, this undermines trust in the page. Either pin the docs to a specific “last updated” date and link to a changelog, or remove the warning.

Apply this diff:

-   Due to updates in the Namespace Resources API, these instructions may be outdated.
-   Please contact our team for support on Slack (https://bit.ly/robusta-slack) or by email (support@robusta.dev).
-   We're working on updating the documentation.
+   This API is in active development. The information below reflects the latest schema as of 2025-08-20.
+   For breaking changes, see the Robusta Release Notes. If you encounter issues, contact us on Slack (https://bit.ly/robusta-slack) or by email (support@robusta.dev).

74-87: Avoid embedding secrets in examples; prefer env var for the API key.

Even though this uses a placeholder, examples often get copy-pasted with real keys. Use an environment variable in the Authorization header.

Apply this diff:

-    --header 'Authorization: Bearer API-KEY-HERE' \
+    --header 'Authorization: Bearer '"$ROBUSTA_API_KEY" \

Add this snippet near the example to guide users:

# Export your API key securely before running the curl command
export ROBUSTA_API_KEY='YOUR-API-KEY'

Note: Static analysis flagged the current pattern as a potential secret leak (curl-auth-header).


91-95: Permissions naming and source-of-truth check.

Confirm the exact permission name/path in the UI (e.g., “Clusters → Read” vs “Clusters: Read”) and whether any extra scope is required for namespace resource counts. Update the text to match the UI precisely.


96-146: Tighten response documentation: enumerate possible error codes and status codes.

Current guidance says “varies depending on the error.” If there’s a defined set (e.g., 400 validation errors with specific error_codes like MISSING_FIELD, INVALID_RESOURCE, 403 FORBIDDEN for missing permission, 404 for unknown cluster/namespace, 500 for internal), list them for integrators.

I can help draft a small table of status codes and error_code values if you share the canonical list.


31-57: Clarify validation semantics for the request schema.

  • Is resources allowed to be empty? If not, specify “non-empty array”.
  • How are duplicates handled? First-wins/last-wins/aggregated?
  • Input size limits (number of resources) and rate limits?

Documenting these reduces integration friction.

docs/configuration/metric-providers-aws.rst (4)

25-31: Optional: anticipate session-token in the secret

If users rely on temporary credentials (STS), having an optional session-token makes the flow complete.

-       kubectl create secret generic aws-secret-key -n robusta \
-           --from-literal=access-key=YOUR_ACCESS_KEY \
-           --from-literal=secret-key=YOUR_SECRET_ACCESS_KEY
+       kubectl create secret generic aws-secret-key -n robusta \
+         --from-literal=access-key=YOUR_ACCESS_KEY_ID \
+         --from-literal=secret-key=YOUR_SECRET_ACCESS_KEY \
+         --from-literal=session-token=YOUR_SESSION_TOKEN   # optional

71-79: Align the “Required Environment Variables” list with standard names

Reflect the standard AWS env var names and note the optional session token.

-**Required Environment Variables**:
+**Required Environment Variables** (standard AWS):
 
-- ``PROMETHEUS_SSL_ENABLED``: Always ``"true"`` for AMP
-- ``AWS_SERVICE_NAME``: Always ``"aps"`` for Amazon Prometheus Service
-- ``AWS_REGION``: The AWS region where your workspace is located
+- ``PROMETHEUS_SSL_ENABLED``: Always ``"true"`` for AMP
+- ``AWS_ACCESS_KEY_ID`` and ``AWS_SECRET_ACCESS_KEY`` (or use IRSA)
+- ``AWS_SESSION_TOKEN``: Optional if using temporary credentials
+- ``AWS_SERVICE_NAME``: Always ``"aps"`` for Amazon Prometheus Service
+- ``AWS_REGION``: The AWS region where your workspace is located

86-108: Tighten IAM policy Resource scope

Current Resource uses a wildcard across all workspaces. Prefer least-privilege by scoping to the specific workspace ARN.

-                "Resource": "arn:aws:aps:*:*:workspace/*"
+                "Resource": "arn:aws:aps:us-east-1:123456789012:workspace/ws-12345678"

If you want to keep it generic in docs, add a comment indicating users should replace region, account ID, and workspace ID.


132-135: Clarify alerting path via Amazon Managed Grafana

“AlerManager URL is not needed (AWS handles alerting separately)” can be more concrete. AMP alerting is typically configured via Amazon Managed Grafana. Cross-link the alert-forwarding guide to reduce ambiguity.

-- AlertManager URL is not needed (AWS handles alerting separately)
+- AMP does not expose a built-in AlertManager. Configure alerting via Amazon Managed Grafana
+  and forward alerts to Robusta. See :doc:`/configuration/alertmanager-integration/eks-managed-prometheus`
+  and :doc:`/configuration/alertmanager-integration/grafana-alert-manager`.
docs/configuration/alertmanager-integration/eks-managed-prometheus.rst (1)

10-13: Good separation: alerts via Grafana, metrics via provider doc

Clear guidance that AMP alerting flows through Amazon Managed Grafana, with metric querying deferred to the AWS metric provider docs. Consider adding an inline link to AMG docs for completeness.

-AWS Managed Prometheus uses Amazon Managed Grafana for alerting. To send alerts to Robusta, configure your Grafana instance to forward alerts.
+AWS Managed Prometheus uses Amazon Managed Grafana for alerting. To send alerts to Robusta, configure your Grafana instance to forward alerts.
+See the `Amazon Managed Grafana documentation <https://docs.aws.amazon.com/grafana/latest/userguide/what-is-Amazon-Managed-Grafana.html>`_ for setup details.
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 905881c and 6505086.

📒 Files selected for processing (6)
  • docs/configuration/alertmanager-integration/eks-managed-prometheus.rst (1 hunks)
  • docs/configuration/alertmanager-integration/google-managed-prometheus.rst (2 hunks)
  • docs/configuration/exporting/namespace-resources-api.rst (1 hunks)
  • docs/configuration/metric-providers-aws.rst (1 hunks)
  • docs/configuration/metric-providers-google.rst (1 hunks)
  • docs/configuration/metric-providers-victoria.rst (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • docs/configuration/metric-providers-google.rst
🧰 Additional context used
🪛 Gitleaks (8.27.2)
docs/configuration/exporting/namespace-resources-api.rst

74-75: Discovered a potential authorization token provided in a curl command header, which could compromise the curl accessed resource.

(curl-auth-header)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: run_tests
  • GitHub Check: Deploy docs
🔇 Additional comments (6)
docs/configuration/metric-providers-victoria.rst (1)

23-23: Verified Sphinx reference “Simple Upgrade” exists

I confirmed that .. _Simple Upgrade: is defined in docs/setup-robusta/upgrade.rst (line 18), so the cross-reference in metric-providers-victoria.rst is valid. No changes needed.

docs/configuration/exporting/namespace-resources-api.rst (2)

25-27: Cross-reference target exists

The anchor cb-robusta-ui-sink-namespace-config is defined in docs/configuration/sinks/RobustaUI.rst (at line 64), so the reference in namespace-resources-api.rst is valid—no changes needed.


63-66: Fix resource schema examples: use singular Kubernetes Kind and proper reST code markup

The snippet in docs/configuration/exporting/namespace-resources-api.rst currently shows plural Kinds and uses single backticks. Please update it to:

• use singular Kubernetes Kind names (e.g. Deployment)
• wrap all inline literals in double backticks
• explicitly render the core API group as an empty‐string literal: ""

Apply this diff in docs/configuration/exporting/namespace-resources-api.rst (lines 63–66):

-* ``kind`` (e.g., `Deployments`)
-* ``apiGroup`` (e.g., `apps`, or empty string for core group)
-* ``apiVersion`` (e.g., `v1`, `v2`)
+* ``kind`` (e.g., ``Deployment``)
+* ``apiGroup`` (e.g., ``apps``, or empty string ``""`` for the core group)
+* ``apiVersion`` (e.g., ``v1``, ``v2``)

Additionally, please confirm whether the API accepts a core‐group value of an empty string (""), null, or both, and document accordingly.

docs/configuration/alertmanager-integration/google-managed-prometheus.rst (1)

71-71: LGTM: linking metric-querying to the provider doc

Pointing users to the dedicated metric-provider page for querying is the right split of concerns.

docs/configuration/alertmanager-integration/eks-managed-prometheus.rst (2)

26-26: LGTM: deferring metric-querying to the AWS metric provider doc

This keeps alert-forwarding docs focused and avoids duplication.


17-22: Grafana alert-manager cross-link validated
Verified that docs/configuration/alertmanager-integration/grafana-alert-manager.rst exists and is listed in the toctree of docs/configuration/alertmanager-integration/index.rst. No further action needed.

Comment thread docs/configuration/exporting/namespace-resources-api.rst
Comment thread docs/configuration/metric-providers-aws.rst
Comment thread docs/configuration/metric-providers-aws.rst
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
docs/configuration/exporting/namespace-resources-api.rst (1)

57-60: Use singular Kubernetes Kind values across the doc (Deployment, Service, Ingress, CronJob).

Kubernetes Kind is singular. Using plurals is misleading and can cause request validation issues. This appears both in the resource schema example, the request payload, and the response example.

Apply these diffs:

- * ``kind`` (e.g., `Deployments`)
+ * ``kind`` (e.g., ``Deployment``)
-        {"kind": "Deployments", "apiGroup": "apps", "apiVersion": "v1"},
-        {"kind": "Services", "apiGroup": "", "apiVersion": "v1"},
-        {"kind": "Ingresses", "apiGroup": "networking.k8s.io", "apiVersion": "v1"},
-        {"kind": "CronJobs", "apiGroup": "batch", "apiVersion": "v1"}
+        {"kind": "Deployment", "apiGroup": "apps", "apiVersion": "v1"},
+        {"kind": "Service", "apiGroup": "", "apiVersion": "v1"},
+        {"kind": "Ingress", "apiGroup": "networking.k8s.io", "apiVersion": "v1"},
+        {"kind": "CronJob", "apiGroup": "batch", "apiVersion": "v1"}
-                "kind": "Deployments"
+                "kind": "Deployment"
@@
-                "kind": "Services"
+                "kind": "Service"
@@
-                "kind": "Ingresses"
+                "kind": "Ingress"

Also applies to: 76-80, 104-121

🧹 Nitpick comments (4)
docs/configuration/exporting/namespace-resources-api.rst (4)

58-60: Clarify core group handling for apiGroup.

For core resources (e.g., Service, Pod, ConfigMap), confirm whether the API requires an explicit empty string for apiGroup or allows omitting the field. Document the requirement to prevent client-side confusion.

Proposed addition:

 * ``apiGroup`` (e.g., `apps`, or empty string for core group)
+  
+.. note::
+   For core Kubernetes resources (no API group), set ``apiGroup`` to an empty string ``""`` as shown in the examples. Do not omit the field unless the API supports it.

68-71: Avoid secret scanner false positives; improve placeholder clarity.

Use a conventional placeholder and add a short caution to prevent accidental key exposure in copy-pastes.

-    --header 'Authorization: Bearer API-KEY-HERE' \
+    --header 'Authorization: Bearer <API-KEY>' \
-- ``API-KEY-HERE`` with your API Key from **Settings → API Keys → New API Key**.  
+- ``<API-KEY>`` with your API Key from **Settings → API Keys → New API Key**.  
   Make sure the key has **Clusters → Read** permissions to access namespace resource data.

Optional addition:

+.. warning::
+   Never paste real API keys into code snippets committed to the repo or shared publicly.

Also applies to: 83-86


7-10: Define what “active count” means.

“Active” can be interpreted differently per Kind (e.g., CronJobs vs. Jobs vs. Pods). Add one sentence to avoid ambiguity.

-Use this endpoint to retrieve an **active count of specific Kubernetes resources** within a namespace. This is the same data displayed in the **Namespaces** tab of the Robusta UI.
+Use this endpoint to retrieve an **active count of specific Kubernetes resources** within a namespace. This is the same data displayed in the **Namespaces** tab of the Robusta UI.
+
+“Active” refers to currently existing objects in the cluster (not deleted/terminating). It does not include historical objects. If there are Kind-specific nuances, they are documented below.

126-140: Include 401 Unauthorized in error examples (if applicable).

Most auth-protected APIs return 401 when the Authorization header is missing/invalid and 403 when valid but forbidden. If that’s the case here, reflect it in the examples.

-- **Status Code**: Varies depending on the error (e.g., `400`, `403`, `500`)
+- **Status Code**: Varies depending on the error (e.g., `400`, `401`, `403`, `500`)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 6505086 and 862e74e.

📒 Files selected for processing (1)
  • docs/configuration/exporting/namespace-resources-api.rst (1 hunks)
🧰 Additional context used
🪛 Gitleaks (8.27.2)
docs/configuration/exporting/namespace-resources-api.rst

68-69: Discovered a potential authorization token provided in a curl command header, which could compromise the curl accessed resource.

(curl-auth-header)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: run_tests
  • GitHub Check: Deploy docs
🔇 Additional comments (2)
docs/configuration/exporting/namespace-resources-api.rst (2)

22-51: Request schema section is clear and actionable.

Well-structured list-table, good field descriptions, and explicit required flags.


19-21: Cross‐reference Verified
The label cb-robusta-ui-sink-namespace-config is defined in docs/configuration/sinks/RobustaUI.rst (line 64), so the :ref:cb-robusta-ui-sink-namespace-config`` link resolves correctly. No changes required.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
docs/configuration/alertmanager-integration/victoria-metrics.rst (3)

11-12: Use the correct brand: “VictoriaMetrics Alertmanager”

Brand should be “VictoriaMetrics” (one word). This is user-facing text, so we should fix it.

-Add the following to your Victoria Metrics Alertmanager configuration (e.g., Helm values file or VMAlertmanagerConfig CRD):
+Add the following to your VictoriaMetrics Alertmanager configuration (e.g., Helm values file or VMAlertmanagerConfig CRD):

15-30: YAML indentation is invalid; webhook and route fields won’t parse

The current snippet mis-indents webhook_configs and the route list item fields. Alertmanager/VictoriaMetrics Alertmanager will reject this.

Apply this corrected YAML layout:

-    receivers:
-      - name: 'robusta'
-      webhook_configs:
-          - url: 'http://<ROBUSTA-HELM-RELEASE-NAME>-runner.<NAMESPACE>.svc.cluster.local/api/alerts'
-          send_resolved: true # (3)
-
-    route: # (1)
-      routes:
-      - receiver: 'robusta'
-          group_by: [ '...' ]
-          group_wait: 1s
-          group_interval: 1s
-          matchers:
-          - severity =~ ".*"
-          repeat_interval: 4h
-          continue: true # (2)
+    receivers:
+      - name: 'robusta'
+        webhook_configs:
+          - url: 'http://<ROBUSTA-HELM-RELEASE-NAME>-runner.<NAMESPACE>.svc.cluster.local/api/alerts'
+            send_resolved: true # (3)
+
+    route: # (1)
+      routes:
+        - receiver: 'robusta'
+          group_by: ['...']
+          group_wait: 1s
+          group_interval: 1s
+          matchers:
+            - severity =~ ".*"
+          repeat_interval: 4h
+          continue: true # (2)

32-36: Grammar fix in code-annotations

Subject-verb agreement: “routes have …”.

-    1. Put Robusta's route as the first route, to guarantee it receives alerts. If you can't do so, you must guarantee all previous routes has ``continue: true`` set.
+    1. Put Robusta's route as the first route, to guarantee it receives alerts. If you can't do so, you must guarantee all previous routes have ``continue: true`` set.
🧹 Nitpick comments (3)
docs/configuration/alertmanager-integration/victoria-metrics.rst (3)

1-2: Heading underline length nitpick

The underline is longer than the title. reST allows it, but most of this repo’s docs keep it equal for consistency.

Apply this small tweak if you want to align with common style:

-VictoriaMetrics Alerts
-***********************
+VictoriaMetrics Alerts
+**********************

4-4: Clarify direction: “from … to Robusta”

“with Robusta” is ambiguous. “to Robusta” better reflects the alert flow.

-This guide shows how to send alerts from `VictoriaMetrics <https://victoriametrics.com/>`_ with Robusta.
+This guide shows how to send alerts from `VictoriaMetrics <https://victoriametrics.com/>`_ to Robusta.

6-7: Eliminate duplicate cross-reference to VictoriaMetrics

The link to /configuration/metric-providers-victoria appears both near the top (line 6) and again under “Configure Metrics Querying” (lines 39–43). Since the brief note at the top suffices in this alert-forwarding guide, remove the bottom section to avoid repetition.

• File: docs/configuration/alertmanager-integration/victoria-metrics.rst
Remove lines 39–43:

- Configure Metrics Querying
- ====================================
-
- To enable Robusta to query metrics from VictoriaMetrics, see :doc:`/configuration/metric-providers-victoria`.

If you’d rather keep the detailed section instead, drop line 6:

- For configuring metric querying from VictoriaMetrics, see :doc:`/configuration/metric-providers-victoria`.
+ .. (cross-reference moved to a dedicated section below)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 862e74e and 382a929.

📒 Files selected for processing (3)
  • docs/configuration/alertmanager-integration/victoria-metrics.rst (2 hunks)
  • docs/configuration/metric-providers-aws.rst (1 hunks)
  • docs/configuration/metric-providers-victoria.rst (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • docs/configuration/metric-providers-victoria.rst
  • docs/configuration/metric-providers-aws.rst
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: run_tests
  • GitHub Check: Deploy docs
🔇 Additional comments (1)
docs/configuration/alertmanager-integration/victoria-metrics.rst (1)

38-38: Include path validated

The file docs/configuration/alertmanager-integration/_testing_integration.rst exists at the specified relative path, so the .. include:: ./_testing_integration.rst directive is correct. No changes needed.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
docs/configuration/alertmanager-integration/victoria-metrics.rst (2)

15-20: YAML indentation for webhook_configs is invalid

webhook_configs must be a key inside the receiver list item, and send_resolved must be nested under the webhook entry. Current indentation will not parse.

-    receivers:
-      - name: 'robusta'
-      webhook_configs:
-          - url: 'http://<ROBUSTA-HELM-RELEASE-NAME>-runner.<NAMESPACE>.svc.cluster.local/api/alerts'
-          send_resolved: true # (3)
+    receivers:
+      - name: 'robusta'
+        webhook_configs:
+          - url: 'http://<ROBUSTA-HELM-RELEASE-NAME>-runner.<NAMESPACE>.svc.cluster.local/api/alerts'
+            send_resolved: true # (3)

Optional: if using Robusta Cloud/SaaS, consider documenting the cloud webhook URL variant as well.


21-30: Fix route block indentation and matcher formatting

List item properties under the - receiver: entry are over-indented. Also, use a consistent matcher string (no spaces around operator) to match common Alertmanager examples.

-    route: # (1)
-      routes:
-      - receiver: 'robusta'
-          group_by: [ '...' ]
-          group_wait: 1s
-          group_interval: 1s
-          matchers:
-          - severity =~ ".*"
-          repeat_interval: 4h
-          continue: true # (2)
+    route: # (1)
+      routes:
+        - receiver: 'robusta'
+          group_by: ['...']
+          group_wait: 1s
+          group_interval: 1s
+          matchers:
+            - severity=~".*"
+          repeat_interval: 4h
+          continue: true # (2)

Nit: 1s group_wait/group_interval is very aggressive and can increase notification noise. Consider more typical values (e.g., group_wait: 30s, group_interval: 5m) in examples.

♻️ Duplicate comments (1)
docs/configuration/exporting/robusta-pro-features.rst (1)

9-16: Fix broken HolmesGPT doc link target

The link points to ../holmesgpt/index which doesn’t exist in this PR. Update to an existing target (main-features). This avoids a broken internal reference in Sphinx.

-:doc:`AI Analysis (HolmesGPT) <../holmesgpt/index>`
+:doc:`AI Analysis (HolmesGPT) <../holmesgpt/main-features>`
🧹 Nitpick comments (3)
docs/configuration/alertmanager-integration/victoria-metrics.rst (2)

4-4: Fix preposition: “with Robusta” → “to Robusta”

This reads like Robusta is co-sending the alerts. “to Robusta” is clearer.

-This guide shows how to send alerts from `VictoriaMetrics <https://victoriametrics.com/>`_ with Robusta.
+This guide shows how to send alerts from `VictoriaMetrics <https://victoriametrics.com/>`_ to Robusta.

32-35: Grammar nit in annotation (plural agreement)

“previous routes has” → “preceding routes have”.

-    1. Put Robusta's route as the first route, to guarantee it receives alerts. If you can't do so, you must guarantee all previous routes has ``continue: true`` set.
+    1. Put Robusta's route as the first route, to guarantee it receives alerts. If you can't do so, you must guarantee all preceding routes have ``continue: true`` set.
docs/configuration/exporting/robusta-pro-features.rst (1)

1-2: Use a more descriptive page title to match the file’s purpose and slug

“Overview” is too generic for a page named robusta-pro-features.rst and used in navigation. Prefer a title that reflects the content (e.g., “Robusta Pro Features” or “Robusta Pro Overview”) to improve clarity and TOC/SEO.

Apply one of the following diffs:

Option A (most explicit):

-Overview
-========
+Robusta Pro Features
+====================

Option B:

-Overview
-========
+Robusta Pro Overview
+====================
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 9ed579a and 608f0cb.

📒 Files selected for processing (6)
  • docs/configuration/alertmanager-integration/_prometheus_flags_check.rst (1 hunks)
  • docs/configuration/alertmanager-integration/victoria-metrics.rst (2 hunks)
  • docs/configuration/exporting/robusta-pro-features.rst (2 hunks)
  • docs/configuration/metric-providers-coralogix.rst (1 hunks)
  • docs/configuration/metric-providers-external.rst (1 hunks)
  • docs/configuration/metric-providers-victoria.rst (1 hunks)
✅ Files skipped from review due to trivial changes (2)
  • docs/configuration/alertmanager-integration/_prometheus_flags_check.rst
  • docs/configuration/metric-providers-external.rst
🚧 Files skipped from review as they are similar to previous changes (2)
  • docs/configuration/metric-providers-coralogix.rst
  • docs/configuration/metric-providers-victoria.rst
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: run_tests
  • GitHub Check: Deploy docs
🔇 Additional comments (6)
docs/configuration/alertmanager-integration/victoria-metrics.rst (5)

1-2: Heading level and style look good

Top-level title and underline are consistent with RST conventions.


9-9: Sub-heading style is consistent

Good use of a secondary heading underline for the section.


41-41: Section heading looks good

Keeping the metrics querying content minimal here and linking out is the right move.


43-43: Cross-reference phrasing is fine

The link text is clear and matches the new docs split.


6-6: Cross-reference target verified and will build correctly

  • Found docs/configuration/metric-providers-victoria.rst in the repository.
  • Confirmed it’s included in the toctree in docs/index.rst (VictoriaMetrics <configuration/metric-providers-victoria>).

The :doc:/configuration/metric-providers-victoria link will resolve properly in the Sphinx build.

docs/configuration/exporting/robusta-pro-features.rst (1)

41-44: Ensure API docs exist and remove redundant overview links

We verified that exporting-data.rst already includes a toctree for the new API pages (send-alerts-api, configuration-changes-api, alert-export-api, alert-statistics-api), so those docs will be discoverable once present.

Please:

  • Confirm that the following files are created under docs/configuration/exporting/:
    alert-export-api.rst
    alert-statistics-api.rst
    send-alerts-api.rst
    configuration-changes-api.rst

  • Streamline navigation in robusta-pro-features.rst by choosing one of the two approaches:

    1. Keep the overview link to exporting-data.rst and remove the four individual API bullets.
    2. Retain the four API bullets and remove or retitle the duplicate
      :doc:Alert History Import and Export API `
      links (currently on lines 36 and 67).

This will eliminate duplication and ensure a single, clear entry point to your exporting APIs.

pavangudiwada
pavangudiwada previously approved these changes Aug 20, 2025
Copy link
Copy Markdown
Contributor

@pavangudiwada pavangudiwada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

nherment and others added 6 commits August 20, 2025 18:07
* docs for krr api

* wip

* docs

* Update docs/configuration/resource-recommender.rst

Co-authored-by: moshemorad <moshemorad12340@gmail.com>

---------

Co-authored-by: Arik Alon <alon.arik@gmail.com>
Co-authored-by: moshemorad <moshemorad12340@gmail.com>
* feat: runner binding address configurable using env variable

* chore: leave default bind address to the original value
* Add krr security context

* Add to docs + helm

* Fix comments
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Aug 20, 2025

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
6 out of 7 committers have signed the CLA.

✅ pavangudiwada
✅ aantn
✅ moshemorad
✅ donch
✅ Avi-Robusta
✅ nherment
❌ RinkiyaKeDad
You have signed the CLA already but the status is still pending? Let us recheck it.

@pavangudiwada pavangudiwada enabled auto-merge (squash) August 20, 2025 12:37
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
src/robusta/integrations/slack/sender.py (2)

370-372: Fix exception logging: star-unpacking into a set will crash for dict blocks

The f-string uses a set literal with a starred list: {*output_blocks,}. Since output_blocks contains dicts, this raises TypeError: unhashable type: 'dict' while handling an exception, masking the root cause.

Apply this diff:

-            logging.error(
-                f"error sending message to slack\ne={e}\ntext={message}\nchannel={channel}\nblocks={*output_blocks,}\nattachment_blocks={*attachment_blocks,}"
-            )
+            logging.error(
+                f"error sending message to slack\ne={e}\ntext={message}\nchannel={channel}\n"
+                f"blocks={output_blocks}\nattachment_blocks={attachment_blocks}"
+            )

290-303: Return type inconsistency can break callers expecting a tuple

prepare_slack_text sometimes returns a string (when message is empty) and elsewhere returns a (message, error_msg) tuple. Callers here always unpack a tuple, so the string path would raise a ValueError.

Apply this diff to always return a tuple:

-        if len(message) == 0:
-            return "empty-message"  # blank messages aren't allowed
+        if len(message) == 0:
+            message = "empty-message"  # blank messages aren't allowed
         message = Transformer.apply_length_limit(message, MAX_BLOCK_CHARS)
 
         if error_files:
             error_msg = (
                 "_Failed to send file(s) "
                 + ", ".join(error_files)
                 + " to slack._\n _See robusta-runner logs for details._"
             )
-            return message, error_msg
+            return message, error_msg
 
-        return message, None
+        return message, None
playbooks/robusta_playbooks/krr.py (2)

5-5: ImportError: pickle has no symbol named NONE

from pickle import NONE will raise an ImportError at module import time. Use the Python built-in None instead.

Apply this diff:

-from pickle import NONE

Also update call sites that pass NONE (see below).


447-447: Use None (built-in) instead of nonexistent pickle.NONE

This aligns with the fix to remove the invalid import.

-    _publish_krr_finding(event=event,krr_json=params.result, scan_id=params.scan_id, start_time=params.start_time, metadata=metadata, timeout=None, config_json=NONE)
+    _publish_krr_finding(event=event, krr_json=params.result, scan_id=params.scan_id, start_time=params.start_time, metadata=metadata, timeout=None, config_json=None)
🧹 Nitpick comments (24)
docs/configuration/alertmanager-integration/grafana-cloud-mimir.rst (12)

53-55: Avoid hardcoding tokens in curl; prefer env vars and caution against leaks

Inline tokens are easy to leak via shell history, screenshots, or logs. Recommend using an environment variable in the example.

Apply this diff:

-    curl -H "Authorization: Bearer YOUR_GLSA_TOKEN" \
+    # Recommended: export your token first to avoid leaking secrets in shell history/logs
+    # export GLSA_TOKEN='<your token>'
+    curl -H "Authorization: Bearer ${GLSA_TOKEN}" \
       "https://YOUR-INSTANCE.grafana.net/api/datasources" | jq

Optionally add a short admonition below the code block reminding users not to commit tokens, redact them in screenshots, and rotate if exposed.


11-14: Clarify prerequisites and terminology (Alertmanager vs datasource; add jq)

Grafana Cloud provides a managed Alertmanager; you don’t necessarily “configure an AlertManager datasource.” Also, the guide uses jq—call that out.

Apply this diff:

-* Prometheus and AlertManager datasources configured in Grafana Cloud
+* A Prometheus datasource configured in Grafana Cloud
+* Grafana Cloud Alerting (managed Alertmanager) enabled
 * Access to create service accounts and API tokens in Grafana Cloud
 * Your Robusta ``account_id`` and ``signing_key`` from ``generated_values.yaml`` 
+* ``jq`` installed locally (used in curl examples)

30-34: Specify minimum token permissions/role for service account

Spell out the least-privilege guidance so users don’t over-scope their token.

Proposed addition:

 2. Create a new service account named "robusta-integration"
 3. Generate a service account token
 4. Save the token (starts with ``glsa_``)
+5. Assign the minimal role needed:
+   - Viewer role is typically sufficient to query datasources via the Grafana proxy.
+   - If you plan to manage alerting configuration via API, an Editor role may be required.
+   - Ensure the token has permission to query the Prometheus datasource and read Alertmanager APIs.

Please confirm the exact minimum role requirements for your Robusta/Grafana setup and adjust accordingly.


38-40: Fix minor grammar and spacing in the cluster-name section

Small readability polish.

Apply this diff:

-If your grafana setup covers  multiple clusters, the cluster name is required and used to 
-identify your specific cluster in Prometheus queries:
+If your Grafana setup covers multiple clusters, the cluster name is required and is used to
+identify your specific cluster in Prometheus queries:

56-56: Tighten wording on datasource UID

Duplicated “UID” in the sentence.

Apply this diff:

-Note the UID for Prometheus datasource UID (typically ``grafanacloud-prom``)
+Note the Prometheus datasource UID (typically ``grafanacloud-prom``)

77-83: Consistency: use “Alertmanager” (Grafana’s preferred casing) and clarify base URL

Consistent casing improves clarity, and a short note helps users avoid appending /api paths manually.

Apply this diff:

-# Grafana Cloud AlertManager Configuration 
+# Grafana Cloud Alertmanager Configuration 
 alertmanager_url: https://YOUR-INSTANCE.grafana.net
 alertmanager_flavor: grafana
 alertmanager_auth: Bearer YOUR_GLSA_TOKEN

Consider adding this note after the config block:

+.. note::
+
+    When using ``alertmanager_flavor: grafana``, set ``alertmanager_url`` to your Grafana base URL
+    (e.g., ``https://YOUR-INSTANCE.grafana.net``). Robusta will call the appropriate
+    Alertmanager-compatible API paths under this base URL. Do not append ``/api/alertmanager`` yourself.

66-87: Security note: avoid committing tokens in generated_values.yaml; prefer secrets

These values often end up in repos or CI logs. Encourage secret management.

Proposed addition just after the YAML block:

+.. warning::
+
+   The values shown above contain long-lived credentials. Avoid committing ``generated_values.yaml``
+   to source control. Where possible, store tokens in Kubernetes Secrets and reference them via the
+   chart's supported secret values or use your secret manager/CI to inject them at deploy time. Rotate
+   tokens if exposure is suspected.

If the Helm chart supports referencing secrets directly for these fields, consider adding an example.


99-109: Generalize namespace in kubectl/helm commands

Many users don’t run in “default”. Use a placeholder to reduce copy-paste errors.

Apply this diff:

-    helm upgrade robusta robusta/robusta -f generated_values.yaml -n default
+    # Replace <NAMESPACE> with the namespace where Robusta is installed (e.g., robusta or default)
+    helm upgrade robusta robusta/robusta -f generated_values.yaml -n <NAMESPACE>
-    kubectl rollout restart deployment/robusta-runner -n default
+    kubectl rollout restart deployment/robusta-runner -n <NAMESPACE>

128-129: Generalize namespace for Holmes restart as well

Keep consistency with earlier commands.

Apply this diff:

-    helm upgrade robusta robusta/robusta -f generated_values.yaml -n default
-    kubectl rollout restart deployment/robusta-holmes -n default
+    helm upgrade robusta robusta/robusta -f generated_values.yaml -n <NAMESPACE>
+    kubectl rollout restart deployment/robusta-holmes -n <NAMESPACE>

119-123: Holmes settings: call out why these flags matter for Mimir

A brief justification reduces misconfiguration churn.

Proposed rewording:

-* Set ``fetch_labels_with_labels_api: false`` (important for Mimir compatibility)
-* Set ``fetch_metadata_with_series_api: true`` (important for Mimir compatibility)
+* Set ``fetch_labels_with_labels_api: false`` (avoids Mimir labels API incompatibilities)
+* Set ``fetch_metadata_with_series_api: true`` (uses the series API which Mimir supports reliably)

145-151: Optional: suggest a quick query to validate Prometheus connectivity

Give users a concrete smoke test.

Proposed addition:

 2. Check if CPU and memory graphs are displayed
 3. If graphs are shown, the metrics integration is working correctly
+4. Alternatively, run a quick query via the proxy to validate connectivity:
+
+   .. code-block:: bash
+
+       curl -H "Authorization: Bearer ${GLSA_TOKEN}" \
+         "https://YOUR-INSTANCE.grafana.net/api/datasources/proxy/uid/PROMETHEUS_DATASOURCE_UID/api/v1/query?query=up%5B5m%5D" | jq '.status'

171-176: Troubleshooting Holmes: include 403/401 guidance

Add actionable hints for common auth failures.

Proposed additions:

 * Verify ``fetch_metadata_with_series_api`` is set to ``true``
 * Check that the Holmes deployment has restarted after configuration changes
 * Review Holmes logs for authentication errors: ``kubectl logs -n default deployment/robusta-holmes``
+* For 401/403 errors:
+  - Confirm the service account role (Viewer or above) and that the token is valid (not expired/revoked).
+  - Verify the Prometheus datasource UID is correct and accessible to the token.
+  - Ensure the Authorization header is sent to the Grafana proxy (``Bearer <token>``).
src/robusta/integrations/slack/sender.py (1)

646-651: Nit: capitalize “Holmes” and tighten punctuation for clarity

Small copy improvements for consistency with product naming and readability.

Apply this diff:

-            return MarkdownBlock("_Ask AI questions about this alert, by connecting <https://platform.robusta.dev/create-account|Robusta SaaS> and tagging holmes._")
+            return MarkdownBlock("_Ask AI questions about this alert by connecting <https://platform.robusta.dev/create-account|Robusta SaaS> and tagging Holmes._")
@@
-            return MarkdownBlock("_Ask AI questions about this alert, by adding holmes to your <https://docs.robusta.dev/master/configuration/holmesgpt/index.html#enable-holmes-in-slack-in-the-platform|Slack>._")
+            return MarkdownBlock("_Ask AI questions about this alert by adding Holmes to your <https://docs.robusta.dev/master/configuration/holmesgpt/index.html#enable-holmes-in-slack-in-the-platform|Slack>._")
@@
-            return MarkdownBlock("_Ask AI questions about this alert, by tagging holmes in a threaded reply_")
+            return MarkdownBlock("_Ask AI questions about this alert by tagging Holmes in a threaded reply_")
helm/robusta/values.yaml (1)

712-714: Expose runner bind address via Helm values

You added RUNNER_BIND_ADDR support in code, but it's not configurable via Helm. If you want users to override it (e.g., bind only to 127.0.0.1 in certain setups), add a Helm value and wire it into the template.

Apply this diff here to introduce the value:

   # Enable hardened filesystem security (read-only root filesystem with writable volume mounts)
   hardenedFs: false
   setKRRSecurityContext: false
+  # Bind address for the runner web server (consumed by RUNNER_BIND_ADDR)
+  bindAddr: "0.0.0.0"

Then add the env var in helm/robusta/templates/runner.yaml alongside the other runner envs (see my comment there for the exact patch).

src/robusta/runner/web.py (1)

69-72: Log the bind address/port on startup for easier ops

Tiny traceability improvement: log the bind host and port before starting Flask so it’s visible in pod logs.

     @staticmethod
     def run():
-        app.run(host=RUNNER_BIND_ADDR, port=PORT, use_reloader=False)
+        logging.info("Starting runner web server on %s:%s", RUNNER_BIND_ADDR, PORT)
+        app.run(host=RUNNER_BIND_ADDR, port=PORT, use_reloader=False)
helm/robusta/templates/runner.yaml (1)

121-126: Also wire RUNNER_BIND_ADDR from values

Since RUNNER_BIND_ADDR is now supported in code, expose it here so users can override the default bind address via Helm.

           - name: CLUSTER_DOMAIN
             value: {{ .Values.global.clusterDomain }}
           - name: SET_KRR_SECURITY_CONTEXT
             value: {{ .Values.runner.setKRRSecurityContext | quote }}
+          - name: RUNNER_BIND_ADDR
+            value: {{ .Values.runner.bindAddr | default "0.0.0.0" | quote }}
           {{- if .Values.openshift.enabled }}
           - name: IS_OPENSHIFT
             value: "True"
docs/configuration/resource-recommender.rst (1)

228-235: Avoid embedding secrets directly in curl examples

To reduce accidental leakage (shell history, copy/paste), prefer an env var for the API key in the example.

-.. code-block:: bash
-
-    curl -X GET "https://api.robusta.dev/api/query/krr/recommendations?account_id=YOUR_ACCOUNT_ID&cluster_id=my-cluster&namespace=default" \
-      -H "Authorization: Bearer YOUR_API_KEY" \
-      -H "Content-Type: application/json"
+.. code-block:: bash
+
+    export ROBUSTA_API_KEY="YOUR_API_KEY"
+    curl -X GET "https://api.robusta.dev/api/query/krr/recommendations?account_id=YOUR_ACCOUNT_ID&cluster_id=my-cluster&namespace=default" \
+      -H "Authorization: Bearer $ROBUSTA_API_KEY" \
+      -H "Content-Type: application/json"

This also addresses the static secret-header warning from scanners like gitleaks.

playbooks/robusta_playbooks/krr.py (3)

504-514: Good least-privilege securityContext; consider adding runAsUser/readOnlyRootFilesystem (optional)

The new SecurityContext is solid: non-root, drop ALL, no privilege escalation, RuntimeDefault seccomp. Two optional hardening tweaks you may consider (ensure the image supports them):

  • Set runAsUser to a non-root UID (e.g., 65532) to satisfy clusters enforcing MustRunAsNonRoot regardless of image metadata.
  • Set readOnlyRootFilesystem=True if KRR doesn’t need to write to the container FS.

Apply only if you’ve validated the krr image does not require root or a writable root FS.

Here is a minimal diff for the context creation (optional):

-    security_context = SecurityContext(
-        runAsNonRoot=True, 
-        capabilities=Capabilities(
-            drop=["ALL"]
-        ),
-        allowPrivilegeEscalation=False,
-        seccompProfile=SeccompProfile(type="RuntimeDefault")
-    )
+    security_context = SecurityContext(
+        runAsNonRoot=True,
+        runAsUser=65532,  # optional: only if image supports non-root UID
+        readOnlyRootFilesystem=True,  # optional: only if image doesn't write to root FS
+        capabilities=Capabilities(drop=["ALL"]),
+        allowPrivilegeEscalation=False,
+        seccompProfile=SeccompProfile(type="RuntimeDefault"),
+    )

525-525: Avoid passing None for securityContext to keep the serialized spec clean

Passing securityContext=None may serialize an explicit null (depending on the client), which is noisy. Only set the field when enabled.

Apply this refactor:

-    spec = PodSpec(
-        serviceAccountName=params.serviceAccountName,
-        containers=[
-            Container(
-                name="krr",
-                image=IMAGE,
-                imagePullPolicy="Always",
-                command=["/bin/sh", "-c", python_command],
-                env=env_var,
-                resources=resources,
-                securityContext=security_context,
-            )
-        ],
-        restartPolicy="Never",
-        **params.krr_job_spec,
-    )
+    container = Container(
+        name="krr",
+        image=IMAGE,
+        imagePullPolicy="Always",
+        command=["/bin/sh", "-c", python_command],
+        env=env_var,
+        resources=resources,
+    )
+    if security_context:
+        container.securityContext = security_context
+
+    spec = PodSpec(
+        serviceAccountName=params.serviceAccountName,
+        containers=[container],
+        restartPolicy="Never",
+        **params.krr_job_spec,
+    )

43-43: Type hint mismatch: KRR_PUSH_SCAN is a bool, not a str

load_bool returns a boolean. Adjust the annotation to match.

-KRR_PUSH_SCAN: str = load_bool("KRR_PUSH_SCAN", True)
+KRR_PUSH_SCAN: bool = load_bool("KRR_PUSH_SCAN", True)
src/robusta/utils/error_codes.py (1)

40-40: Confirmed: Proper Usage of HOLMES_RATE_LIMIT_EXCEEDED; Documentation Update Needed

I ran the suggested search and found the new error code is only referenced in the centralized handler:

  • src/robusta/utils/error_codes.py (line 40)
  • src/robusta/core/playbooks/internal/ai_integration.py (line 58)

However, there are no mentions of HOLMES_RATE_LIMIT_EXCEEDED or 5204 in the docs or error-code listings. Please add an entry (e.g., in your error-codes documentation or README) to surface this new Holmes rate-limit code.

src/robusta/core/playbooks/internal/ai_integration.py (3)

51-62: Harden the centralized Holmes error mapping; add Timeout handling, guard HTTPError without response, fix typos, and include status info.

  • Handle requests.Timeout as a connection class error.
  • Guard against HTTPError with no attached response (defensive).
  • Surface HTTP status in messages for better operability; include Retry-After on 429 when present.
  • Fix typos (“occurred”, “API”).

Apply this diff:

-def handle_holmes_error(e: Exception) -> NoReturn:
-    if isinstance(e, requests.ConnectionError):
-        raise ActionException(ErrorCodes.HOLMES_CONNECTION_ERROR, "Holmes endpoint is currently unreachable.")
-    elif isinstance(e, requests.HTTPError):
-        if e.response.status_code == 401 and "invalid_api_key" in e.response.text:
-            raise ActionException(ErrorCodes.HOLMES_REQUEST_ERROR, "Holmes invalid api key.")
-        elif e.response.status_code == 429:
-            raise ActionException(ErrorCodes.HOLMES_RATE_LIMIT_EXCEEDED, "Holmes rate limit exceeded.")
-        raise ActionException(ErrorCodes.HOLMES_REQUEST_ERROR, "Holmes internal configuration error.")
-    else:
-        raise ActionException(ErrorCodes.HOLMES_UNEXPECTED_ERROR, "An unexpected error occured.")
+def handle_holmes_error(e: Exception) -> NoReturn:
+    # Connection and timeout errors
+    if isinstance(e, (requests.ConnectionError, requests.Timeout)):
+        raise ActionException(
+            ErrorCodes.HOLMES_CONNECTION_ERROR,
+            "Holmes endpoint is unreachable or timed out.",
+        )
+    # HTTP status errors
+    elif isinstance(e, requests.HTTPError):
+        resp = getattr(e, "response", None)
+        if resp is not None:
+            status = resp.status_code
+            if status == 401:
+                # Prefer a clear message; avoid brittle body matching
+                if "invalid_api_key" in (resp.text or ""):
+                    msg = "Holmes unauthorized: invalid API key (HTTP 401)."
+                else:
+                    msg = "Holmes unauthorized (HTTP 401). Check API key."
+                raise ActionException(ErrorCodes.HOLMES_REQUEST_ERROR, msg)
+            elif status == 429:
+                retry_after = resp.headers.get("Retry-After")
+                msg = "Holmes rate limit exceeded (HTTP 429)."
+                if retry_after:
+                    msg += f" Retry after {retry_after}s."
+                raise ActionException(ErrorCodes.HOLMES_RATE_LIMIT_EXCEEDED, msg)
+            else:
+                raise ActionException(
+                    ErrorCodes.HOLMES_REQUEST_ERROR,
+                    f"Holmes API request failed (HTTP {status}).",
+                )
+        # Fallback when no response is attached
+        raise ActionException(ErrorCodes.HOLMES_REQUEST_ERROR, "Holmes API request failed (no response).")
+    else:
+        raise ActionException(ErrorCodes.HOLMES_UNEXPECTED_ERROR, "An unexpected error occurred.")

170-174: Reduce repetition: extract a small helper to build FindingSubject from ResourceInfo.

The same conditional subject construction appears in multiple places. A tiny helper would reduce duplication and make future changes safer.

For example, add this helper once in the module:

def _make_subject(resource: ResourceInfo | None) -> FindingSubject:
    return FindingSubject(
        name=resource.name if resource else "",
        namespace=resource.namespace if resource else "",
        subject_type=FindingSubjectType.from_kind(resource.kind) if resource else FindingSubjectType.TYPE_NONE,
        node=resource.node if resource else "",
        container=resource.container if resource else "",
    )

Then replace repeated inline constructions with:

subject=_make_subject(params.resource)

Also applies to: 234-238, 311-315, 439-443


342-344: Consider sending JSON via requests’ json= parameter (and add a timeout).

Using data=holmes_req.json() works but sends a raw string without automatic Content-Type. json=holmes_req.dict() sets headers correctly and avoids double serialization. Also consider a sensible timeout to avoid indefinite hangs (esp. in streaming).

You could refactor like:

holmes_req = HolmesChatRequest(
    ask=params.ask,
    conversation_history=params.conversation_history,
    model=params.model,
    stream=params.stream,
)

# Streaming
with requests.post(url, json=holmes_req.dict(), stream=True, headers={"Connection": "keep-alive"}, timeout=30) as resp:
    ...

# Non-streaming
result = requests.post(url, json=holmes_req.dict(), timeout=30)

Apply similarly to other requests in this module.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 28629f2 and af986c0.

⛔ Files ignored due to path filters (3)
  • docs/images/krr-api-key.png is excluded by !**/*.png
  • docs/images/new-krr-api-key.png is excluded by !**/*.png
  • poetry.lock is excluded by !**/*.lock
📒 Files selected for processing (13)
  • docs/configuration/alertmanager-integration/grafana-alert-manager.rst (3 hunks)
  • docs/configuration/alertmanager-integration/grafana-cloud-mimir.rst (1 hunks)
  • docs/configuration/alertmanager-integration/index.rst (2 hunks)
  • docs/configuration/resource-recommender.rst (1 hunks)
  • docs/setup-robusta/installation/dev-setup.rst (2 hunks)
  • helm/robusta/templates/runner.yaml (1 hunks)
  • helm/robusta/values.yaml (1 hunks)
  • playbooks/robusta_playbooks/krr.py (4 hunks)
  • src/robusta/core/model/env_vars.py (2 hunks)
  • src/robusta/core/playbooks/internal/ai_integration.py (16 hunks)
  • src/robusta/integrations/slack/sender.py (1 hunks)
  • src/robusta/runner/web.py (2 hunks)
  • src/robusta/utils/error_codes.py (1 hunks)
✅ Files skipped from review due to trivial changes (2)
  • docs/configuration/alertmanager-integration/index.rst
  • docs/setup-robusta/installation/dev-setup.rst
🚧 Files skipped from review as they are similar to previous changes (1)
  • docs/configuration/alertmanager-integration/grafana-alert-manager.rst
🧰 Additional context used
🧬 Code Graph Analysis (3)
src/robusta/integrations/slack/sender.py (1)
src/robusta/core/reporting/blocks.py (1)
  • MarkdownBlock (42-61)
playbooks/robusta_playbooks/krr.py (1)
src/robusta/core/model/env_vars.py (1)
  • load_bool (7-9)
src/robusta/core/playbooks/internal/ai_integration.py (3)
src/robusta/utils/error_codes.py (2)
  • ActionException (42-47)
  • ErrorCodes (5-40)
src/robusta/core/reporting/consts.py (2)
  • FindingSubjectType (42-72)
  • from_kind (54-72)
src/robusta/core/reporting/holmes.py (1)
  • HolmesChatRequest (37-41)
🪛 Gitleaks (8.27.2)
docs/configuration/alertmanager-integration/grafana-cloud-mimir.rst

53-53: Discovered a potential authorization token provided in a curl command header, which could compromise the curl accessed resource.

(curl-auth-header)

docs/configuration/resource-recommender.rst

232-233: Discovered a potential authorization token provided in a curl command header, which could compromise the curl accessed resource.

(curl-auth-header)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: run_tests
  • GitHub Check: Deploy docs
🔇 Additional comments (17)
docs/configuration/alertmanager-integration/grafana-cloud-mimir.rst (1)

1-4: Overall: clear, actionable integration doc—nice work

Structure, flow, and cross-references are solid. The proxy-based approach for Mimir is correctly highlighted, and the Holmes-specific flags are called out.

Also applies to: 58-66, 110-117, 142-188

src/robusta/integrations/slack/sender.py (2)

646-651: LGTM: copy updated to remove the Slack “@” mention

The new wording aligns with platform-agnostic docs (no implicit Slack-only syntax). Functionality unchanged.


648-648: Doc link confirmed canonical
The URL returns HTTP 200 with no redirects, so it remains valid and doesn’t need updating.

helm/robusta/values.yaml (1)

713-713: New KRR securityContext flag looks good

The new runner.setKRRSecurityContext toggle with a safe default (false) is clear and aligns with the Helm/env wiring in the PR.

src/robusta/runner/web.py (1)

13-15: Importing RUNNER_BIND_ADDR is correct and used below

The new env var is properly imported and consumed. No issues.

helm/robusta/templates/runner.yaml (1)

121-123: KRR security context env wiring is correct

SET_KRR_SECURITY_CONTEXT is correctly surfaced from values and quoted as a string for consistent parsing.

src/robusta/core/model/env_vars.py (2)

91-92: RUNNER_BIND_ADDR addition is correct and non-breaking

Defaulting to 0.0.0.0 preserves existing behavior while enabling overrides. Looks good.


142-142: SET_KRR_SECURITY_CONTEXT toggle added cleanly

Boolean toggle via load_bool with a safe default. Matches Helm/env wiring.

docs/configuration/resource-recommender.rst (2)

160-166: Docs for enabling KRR security context are accurate

The key matches the Helm value (runner.setKRRSecurityContext). Clear and actionable.


187-189: Confirm and align KRR API endpoint path

The heading at line 187 shows
GET /api/krr/recommendations
but the curl example at lines 232–235 calls
GET /api/query/krr/recommendations

Please verify which path the server actually exposes and update both locations in docs/configuration/resource-recommender.rst accordingly:

• Line 187: replace the “GET …” heading
• Lines 232–235: update the curl example URL

playbooks/robusta_playbooks/krr.py (2)

11-11: Hikaru imports for SecurityContext are correct and align with k8s 1.26 models

Importing SecurityContext, Capabilities, and SeccompProfile from hikaru 1.26 is appropriate for attaching a container-level securityContext.


33-33: No Action Needed: SET_KRR_SECURITY_CONTEXT Is Already a Boolean

Verified that in src/robusta/core/model/env_vars.py:

SET_KRR_SECURITY_CONTEXT = load_bool("SET_KRR_SECURITY_CONTEXT", False)

This ensures SET_KRR_SECURITY_CONTEXT is a boolean, so the existing if SET_KRR_SECURITY_CONTEXT: check behaves as intended.

src/robusta/core/playbooks/internal/ai_integration.py (5)

4-4: Type annotation import (NoReturn) is appropriate.

Import is correct and matches the handler’s always-raise contract.


136-136: Verified: All exception handlers delegate to handle_holmes_error

Every except Exception as e block in src/robusta/core/playbooks/internal/ai_integration.py (lines 132, 186, 251, 328, 409, 456) correctly calls handle_holmes_error(e). This centralizes error handling and eliminates duplication.


83-83: Confirmed: HolmesRequest defines a model field, so passing model=params.model on line 83 is valid.
Approved.


298-299: HolmesIssueChatRequest already supports model
The model: Optional[str] = None field is declared on the parent HolmesChatRequest (line 40 in src/robusta/core/reporting/holmes.py), so HolmesIssueChatRequest inherits it and will include model without validation errors.


426-427: Confirmed: model field is supported via inheritance

The model: Optional[str] attribute is declared on HolmesChatRequest (line 40) and inherited by HolmesWorkloadHealthRequest, so no changes are needed.

@pavangudiwada
Copy link
Copy Markdown
Contributor

Merged with a new PR #1902

auto-merge was automatically disabled August 21, 2025 12:36

Pull request was closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants