feat(flow-collector): horizontal scale via NATS KV TemplateStore + Da… by mikemiles-dev · Pull Request #3148 · carverauto/serviceradar

mikemiles-dev · 2026-05-03T21:26:06Z

…emonSet

Wires the new netflow_parser 1.0.3 TemplateStore extension point into the flow-collector and switches the K8s deployment from a single Deployment to a per-node DaemonSet behind a Service: LoadBalancer with externalTrafficPolicy: Local. Together these two changes let the flow collector scale horizontally with node count while preserving exporter source IP and sharing learned NetFlow / IPFIX templates across pods.

Why

A single flow-collector Deployment was a singleton bottleneck for UDP ingest. Scaling it required either source-IP-affinity routing (so every exporter always lands on the same pod and templates stay local) or a shared template store so any pod could decode any flow. We picked the shared-store path because it doesn't depend on the LB layer.

The original plan was to front this with Envoy Gateway UDPRoute, but a research spike found that Envoy Gateway's UDPRoute does not preserve client source IP (https://gateway.envoyproxy.io/docs/tasks/traffic/udp-routing/) — upstreams see the gateway pod's IP, which would collapse AutoScopedParser's per-(source_ip,source_id) scoping. Service+DaemonSet +externalTrafficPolicy: Local is the workable alternative on MetalLB (L2 mode preserves source IP through to the pod). Gateway API stays available for the rest of the stack (HTTP/gRPC/mTLS).

Code changes

Cargo.toml: bump netflow_parser 1.0.0 -> 1.0.3 (TemplateStore API); add bytes for the NATS KV payload type.
src/template_store.rs (new): NatsKvTemplateStore implements the parser's TemplateStore trait. async_nats is async-only and the trait is sync, so we bridge via tokio::task::block_in_place + Handle::block_on. Safe under the multi_thread runtime the collector already uses. NATS KV keys are restricted to [A-Za-z0-9._=/-]; the scope produced by AutoScopedParser ("v9:1.2.3.4:2055/0", IPv6 with brackets, etc.) is sanitized via underscore replacement. Five render tests cover IPv4, IPv6, empty scope, distinct-scopes and distinct-kinds.
src/netflow/mod.rs: NetflowHandler::new accepts an optional Arc<dyn TemplateStore> and threads it through NetflowParserBuilder::with_template_store(...). Adds the new TemplateEvent::Restored arm to the existing event-callback (logs at info). Adds update_template_store_metrics() which iterates each per-source NetflowParser via AutoScopedParser::{ipfix,v9,legacy}_info(), sums template_store_* counters from each CacheMetrics, and writes the totals into ListenerMetrics atomics. Called under the parser lock inside parse_datagram so the snapshot is always fresh on the next scrape.
src/listener.rs: build_handler now takes Option<Arc<dyn TemplateStore>>. sFlow ignores it (template-less).
src/config.rs: new optional TemplateStoreConfig { kv_bucket, kv_history, kv_ttl_secs }. Two deserialization tests.
src/main.rs: new bootstrap_template_store() opens an independent NATS connection (separate from the publisher's), gets-or-creates the KV bucket idempotently, and wraps it in NatsKvTemplateStore. Kept on its own connection so KV-bucket flakiness can't backpressure publish and vice versa.
src/metrics.rs: ListenerMetrics gains template_store_restored, template_store_codec_errors, template_store_backend_errors, source_count atomics. New render_prometheus() produces text exposition format (8 metrics, HELP/TYPE headers, protocol + listen_addr labels). New run_prometheus_server() is a hand-rolled HTTP/1.1 server on tokio's TcpListener — GET /metrics returns the exposition; anything else 404s. No new deps. Spawned from main.rs when metrics_addr is set; runs independently of the log reporter and the publisher so a scrape failure can never stall ingestion.

K8s manifest changes

k8s/demo/base/serviceradar-flow-collector.yaml: Deployment ->
DaemonSet, replicas dropped, unused PVC removed (the Rust binary writes nothing to disk). ConfigMap JSON gains the template_store block. Service was already externalTrafficPolicy: Local — kept. Added RollingUpdate with maxUnavailable: 1.
helm/serviceradar/templates/flow-collector.yaml: same kind flip, PVC removal, plus nodeSelector/tolerations knobs.
helm/serviceradar/values.yaml: drop replicaCount and data.*, add flowCollector.config.template_store block, default service.type: LoadBalancer and service.externalTrafficPolicy: Local, add the IPFIX (4739) and metrics (50046) ports.

Tests

48 tests pass:

5 NatsKvTemplateStore key-render tests
2 TemplateStoreConfig deserialization tests
1 Prometheus exposition format test
40 pre-existing flow-collector tests (unchanged) clippy --all-targets -D warnings: clean

What's left to do

Code:

Prometheus scrape annotations on the Service so the existing Prometheus / ServiceMonitor picks up the new /metrics endpoint automatically. Both manifests need this.
PrometheusRule alerts: template_store_backend_errors > 0 sustained

5min = NATS unhealthy; template_store_codec_errors > 0 ever =
wire-format mismatch (drain bucket).
Optional: TemplateStoreConfig.kv_replicas hint so the bucket is created with the right JetStream replication for HA. Currently defaults to 1 — operators can pre-create the bucket with nats kv add flow_templates --replicas=3 to override.

Operational (no cluster access from here):

Run the source-IP smoke test against staging: LB_IP=$(kubectl get svc serviceradar-flow-collector \ -o jsonpath='{.status.loadBalancer.ingress[0].ip}') kubectl debug $POD -it --image=nicolaka/netshoot -- tcpdump -nn -i any port 2055 # send a packet to $LB_IP:2055 from a known IP # tcpdump should show that source IP, not MetalLB / node IP
Run the cross-pod template-share smoke test: send template+data, delete the receiving pod, send same data record, verify "Template restored from secondary store" log on the new pod and that NATS KV nats kv ls flow_templates shows scoped keys.
Decide JetStream stream replicas for HA on multi-AZ clusters.
Decide TTL on KV entries (default 0 = forever; 24h is reasonable if exporters re-announce frequently).

Decisions still open:

Whether to ship Prometheus scrape annotations + PrometheusRule in this branch or a follow-up.
Whether to expose kv_replicas in TemplateStoreConfig or rely on pre-create.

IMPORTANT: Please sign the Developer Certificate of Origin

Thank you for your contribution to ServiceRadar. Please note, when contributing, the developer must include
a DCO sign-off statement indicating the DCO acceptance in one commit message. Here
is an example DCO Signed-off-by line in a commit message:

Signed-off-by: J. Doe <j.doe@domain.com>

Describe your changes

Issue ticket number and link

Code checklist before requesting a review

I have signed the DCO?
The build completes without errors?
All tests are passing when running make test?

qodo-code-review · 2026-05-03T21:26:26Z

Review Summary by Qodo

(Agentic_describe updated until commit `d8e67a9`)

Horizontal scale flow collector via NATS KV TemplateStore and DaemonSet deployment

✨ Enhancement

Walkthroughs

Description

• Enable horizontal scaling of flow collector via NATS KV shared template store
• Switch K8s deployment from single Deployment to per-node DaemonSet with LoadBalancer service
• Preserve exporter source IP through load balancer using externalTrafficPolicy: Local
• Add Prometheus metrics endpoint with hand-rolled HTTP server and template-store counters
• Implement background metrics ticker to aggregate per-source template-store statistics

Diagram

flowchart LR
  exporters["Exporters<br/>UDP NetFlow/IPFIX/sFlow"]
  lb["Cloud L4 LB<br/>externalTrafficPolicy: Local"]
  daemonset["DaemonSet<br/>one pod per node"]
  nats_kv["NATS JetStream KV<br/>flow_templates bucket"]
  prometheus["Prometheus<br/>metrics endpoint"]
  
  exporters -->|source IP preserved| lb
  lb -->|distribute across nodes| daemonset
  daemonset -->|read/write templates| nats_kv
  daemonset -->|expose metrics| prometheus

File Changes

1. rust/flow-collector/Cargo.toml Dependencies +2/-1

Bump netflow_parser and add bytes dependency

rust/flow-collector/Cargo.toml

2. rust/flow-collector/src/config.rs ⚙️ Configuration changes +98/-0

Add TemplateStoreConfig with NATS KV settings

rust/flow-collector/src/config.rs

3. rust/flow-collector/src/nats_client.rs ✨ Enhancement +109/-0

Shared NATS connection helpers for TLS/creds

rust/flow-collector/src/nats_client.rs

View more (11)

4. rust/flow-collector/src/template_store.rs ✨ Enhancement +220/-0

NATS KV-backed TemplateStore implementation

rust/flow-collector/src/template_store.rs

5. rust/flow-collector/src/main.rs ✨ Enhancement +82/-6

Bootstrap template store and spawn Prometheus server

rust/flow-collector/src/main.rs

6. rust/flow-collector/src/listener.rs ✨ Enhancement +6/-0

Thread template store through handler construction

rust/flow-collector/src/listener.rs

7. rust/flow-collector/src/netflow/mod.rs ✨ Enhancement +262/-5

Integrate TemplateStore and add metrics ticker

rust/flow-collector/src/netflow/mod.rs

8. rust/flow-collector/src/publisher.rs ✨ Enhancement +19/-38

Refactor to use shared NATS connection helpers

rust/flow-collector/src/publisher.rs

9. rust/flow-collector/src/metrics.rs ✨ Enhancement +561/-4

Add template-store counters and Prometheus HTTP server

rust/flow-collector/src/metrics.rs

10. helm/serviceradar/templates/flow-collector.yaml ⚙️ Configuration changes +31/-49

Convert Deployment to DaemonSet with LoadBalancer service

helm/serviceradar/templates/flow-collector.yaml

11. helm/serviceradar/templates/NOTES.txt 📝 Documentation +29/-0

Add upgrade notes for DaemonSet migration

helm/serviceradar/templates/NOTES.txt

12. helm/serviceradar/values.yaml ⚙️ Configuration changes +56/-8

Update defaults for DaemonSet, template store, and metrics

helm/serviceradar/values.yaml

13. k8s/demo/base/serviceradar-flow-collector.yaml ⚙️ Configuration changes +35/-30

Convert demo manifest to DaemonSet with template store

k8s/demo/base/serviceradar-flow-collector.yaml

14. docs/docs/flow-collector-scaling.md 📝 Documentation +210/-0

New scaling guide with sizing rules and monitoring

docs/docs/flow-collector-scaling.md

qodo-code-review · 2026-05-03T21:26:27Z

Code Review by Qodo

🐞 Bugs (7) 📘 Rule violations (1)

1. Non-ASCII in scaling guide 📘 Rule violation ⚙ Maintainability ⭐ New

Description

The new Markdown doc includes non-ASCII characters (e.g., em dash —, ≤, arrows), violating the
ASCII-only documentation requirement. This can break tooling/portability expectations for docs
content.

Code

docs/docs/flow-collector-scaling.md[2]
+title: Flow Collector — Scaling Guide

Evidence
PR Compliance ID 6 requires Markdown content to be ASCII-only. The added doc contains multiple
non-ASCII characters, such as — in the title and ≤ in section headings.
AGENTS.md
docs/docs/flow-collector-scaling.md[2-2]
docs/docs/flow-collector-scaling.md[56-56]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`docs/docs/flow-collector-scaling.md` contains non-ASCII characters (e.g., `—`, `≤`, `→`). Compliance requires ASCII-only Markdown.

## Issue Context
The scaling guide is newly added documentation under `docs/docs/`, so it must adhere to the ASCII-only constraint.

## Fix Focus Areas
- docs/docs/flow-collector-scaling.md[1-210]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

2. IPFIX port not listened 🐞 Bug ≡ Correctness ⭐ New

Description

The Helm/Kustomize manifests expose UDP port 4739 (ipfix) by default, but the default
flow-collector config does not bind any listener on 4739, so IPFIX datagrams sent to the Service’s
IPFIX port will be dropped silently.

Code

helm/serviceradar/values.yaml[R902-906]

+      ipfix:
+        enabled: true
+        port: 4739
+        targetPort: 4739
+        protocol: UDP

Evidence
values.yaml enables the Service’s ipfix port on 4739, but the default
flowCollector.config.listeners only binds 0.0.0.0:2055 (netflow) and 0.0.0.0:6343 (sflow). The
kustomize demo manifest has the same mismatch: it exposes ipfix:4739 in the Service and container
ports, but the embedded flow-collector.json has no listener on 4739.
helm/serviceradar/values.yaml[865-873]
helm/serviceradar/values.yaml[891-906]
k8s/demo/base/serviceradar-flow-collector.yaml[21-56]
k8s/demo/base/serviceradar-flow-collector.yaml[173-189]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The chart/manifests advertise IPFIX ingestion on UDP/4739, but the flow-collector config does not bind to 4739 by default. This causes silent IPFIX data loss.

### Issue Context
`ListenerConfig` only has `netflow`/`sflow` variants; to receive IPFIX on 4739 you must add a second `protocol: netflow` listener with `listen_addr: 0.0.0.0:4739` (the netflow parser can decode IPFIX datagrams on any UDP port).

### Fix Focus Areas
- helm/serviceradar/values.yaml[865-906]
- k8s/demo/base/serviceradar-flow-collector.yaml[21-56]

### What to change
- Either (preferred) add a default listener for 4739 in Helm values and the demo manifest, or
- Set `service.ports.ipfix.enabled: false` by default and clearly document that enabling it requires adding a matching listener.
- Optionally add config validation at startup that warns/errors when a Service port is enabled but no listener is configured for that port (Helm-only check if you don’t want runtime coupling).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

3. Metrics probe always enabled 🐞 Bug ☼ Reliability ⭐ New

Description

The DaemonSet liveness/readiness probes always hit GET /metrics on port 50046, but the metrics
server only starts when config.metrics_addr is set; disabling metrics (as in values-demo.yaml)
will make pods fail probes and crashloop.

Code

helm/serviceradar/templates/flow-collector.yaml[R87-104]

+        # HTTP probe against the Prometheus endpoint — also confirms the
+        # async runtime is alive (a deadlocked-but-running collector
+        # would pass a `pgrep` check but fail this).
        livenessProbe:
-          exec:
-            command:
-            - /bin/sh
-            - -c
-            - "pgrep -f flow-collector > /dev/null"
+          httpGet:
+            path: /metrics
+            port: {{ .Values.flowCollector.service.ports.metrics.targetPort | default 50046 }}
          initialDelaySeconds: 30
          periodSeconds: 30
          timeoutSeconds: 5
          failureThreshold: 3
        readinessProbe:
-          exec:
-            command:
-            - /bin/sh
-            - -c
-            - "pgrep -f flow-collector > /dev/null"
+          httpGet:
+            path: /metrics
+            port: {{ .Values.flowCollector.service.ports.metrics.targetPort | default 50046 }}
          initialDelaySeconds: 10
          periodSeconds: 10
          timeoutSeconds: 5

Evidence

The Helm template hardwires both probes to /metrics on the metrics port. In the binary, the
Prometheus server is only spawned when config.metrics_addr is Some(_). The demo values
explicitly disable the metrics port and do not set flowCollector.config.metrics_addr, which means
nothing will be listening on 50046 but probes will still run.

helm/serviceradar/templates/flow-collector.yaml[87-105]
rust/flow-collector/src/main.rs[140-150]
helm/serviceradar/values-demo.yaml[489-533]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
Kubernetes probes depend on `/metrics`, but `/metrics` is optional at runtime. When operators disable metrics (or omit `metrics_addr`), pods will fail readiness/liveness and crashloop.

### Issue Context
- The binary only starts the Prometheus server when `config.metrics_addr` is set.
- Helm currently always configures HTTP probes.

### Fix Focus Areas
- helm/serviceradar/templates/flow-collector.yaml[84-105]
- rust/flow-collector/src/main.rs[140-150]
- helm/serviceradar/values-demo.yaml[489-533]

### What to change
Implement one of:
1) **Helm fix (recommended):** Wrap liveness/readiness probes in a conditional, e.g. only enable HTTP probes when `.Values.flowCollector.service.ports.metrics.enabled` AND `.Values.flowCollector.config.metrics_addr` are set; otherwise fall back to an `exec` probe (`pgrep`) or disable probes.
2) **App fix:** Always start the metrics server on a fixed port (or always set a default `metrics_addr`) so `/metrics` is guaranteed to exist if probes expect it.

Also update `values-demo.yaml` to match the chosen behavior so demo installs don’t regress.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

View more (2)

4. Template store skips creds 🐞 Bug ≡ Correctness

Description

bootstrap_template_store() connects with async_nats::connect(nats_url) and ignores
Config.nats_creds_file and Config.security (mTLS), so enabling template_store will fail to connect
on authenticated/mTLS NATS clusters (or connect without the intended client identity). This is
inconsistent with Publisher::connect_once(), which explicitly applies credentials and TLS options.

Code

rust/flow-collector/src/main.rs[R185-188]

+    let client = async_nats::connect(nats_url)
+        .await
+        .with_context(|| format!("connecting to NATS at {} for template store", nats_url))?;
+    let js = jetstream::new(client);

Evidence
The flow-collector already requires NATS creds/mTLS for publishing, and Helm mounts/waits for the
creds file; however the template-store connection is created without those options, making it unable
to authenticate in the same environments.
rust/flow-collector/src/main.rs[181-189]
rust/flow-collector/src/publisher.rs[88-115]
rust/flow-collector/src/config.rs[6-47]
helm/serviceradar/values.yaml[829-872]
helm/serviceradar/templates/flow-collector.yaml[109-170]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`bootstrap_template_store()` creates a NATS connection via `async_nats::connect(nats_url)` and does not apply the same credentials-file and mTLS configuration used by the publisher. In clusters where NATS requires creds and/or client certs, enabling `template_store` will prevent startup (connection/auth failure) or connect with the wrong security posture.
### Issue Context
`Publisher::connect_once()` already builds `ConnectOptions` from `Config.security` and `Config.nats_creds_file`. Helm also mounts and waits for `/etc/serviceradar/creds/platform.creds`, suggesting creds are required in real deployments.
### Fix Focus Areas
- rust/flow-collector/src/main.rs[181-207]
- rust/flow-collector/src/publisher.rs[88-115]
- rust/flow-collector/src/config.rs[6-47]
### Suggested fix
- Change `bootstrap_template_store()` to accept `&Config` (or the needed security fields), and build `async_nats::ConnectOptions` the same way as the publisher (root certs + client cert + creds file).
- Consider factoring the shared NATS connection setup into a helper (e.g., `fn nats_connect_options(cfg: &Config) -> Result<ConnectOptions>`), used by both publisher and template store.
- (Optional but consistent) add the same retry/backoff behavior as `connect_with_retry()` if template_store should tolerate NATS startup ordering.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

5. Metrics server slowloris DoS 🐞 Bug ⛨ Security

Description

run_prometheus_server() spawns an unbounded task per accepted TCP connection and serve_one()
performs a read with no timeout, allowing clients to hold connections open indefinitely and exhaust
tasks/file descriptors. This can degrade or crash the flow-collector when metrics_addr is enabled.

Code

rust/flow-collector/src/metrics.rs[R206-236]

+    loop {
+        let (mut conn, peer) = match socket.accept().await {
+            Ok(v) => v,
+            Err(e) => {
+                warn!("metrics accept error: {}", e);
+                continue;
+            }
+        };
+        let listeners = listeners.clone();
+        tokio::spawn(async move {
+            if let Err(e) = serve_one(&mut conn, &listeners).await {
+                warn!("metrics conn from {} error: {}", peer, e);
+            }
+        });
+    }
+}
+
+async fn serve_one(
+    conn: &mut tokio::net::TcpStream,
+    listeners: &[Arc<ListenerMetrics>],
+) -> io::Result<()> {
+    // We don't parse the request rigorously — Prometheus and curl both
+    // send a small HTTP/1.1 GET. Read up to the end of the request line
+    // so we can branch on the path.
+    let mut buf = [0u8; 1024];
+    let n = conn.read(&mut buf).await?;
+    let request_line = std::str::from_utf8(&buf[..n])
+        .ok()
+        .and_then(|s| s.lines().next())
+        .unwrap_or("");
+    let path = request_line.split_whitespace().nth(1).unwrap_or("/");

Evidence

The server accepts connections in a loop, clones listener state, and spawns a task that awaits
conn.read() without any deadline; an attacker (or misbehaving scraper) can keep many connections
open and consume resources.

rust/flow-collector/src/metrics.rs[194-253]
helm/serviceradar/values.yaml[836-840]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The Prometheus HTTP server has no read/write timeouts and spawns unbounded tasks per connection. A slowloris-style client can hold many connections open, consuming memory and file descriptors.
### Issue Context
The server is intentionally hand-rolled (no hyper/axum), so it must implement basic defensive measures itself.
### Fix Focus Areas
- rust/flow-collector/src/metrics.rs[194-253]
### Suggested fix
- Wrap `conn.read(...)` and `conn.write_all(...)` in `tokio::time::timeout` (e.g., 1-5s for request read, 5-10s for response write).
- Add a small concurrency limit (e.g., `Semaphore`) around per-connection handling to bound simultaneous in-flight connections.
- Optionally: read until `\r\n` for the request line (or `\r\n\r\n` for headers) with a strict max size, to avoid partial-read misclassification.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

6. Sources metric never updates 🐞 Bug ◔ Observability ⭐ New

Description

flow_collector_sources is documented and exported, but ListenerMetrics.source_count is only
updated by the template-store metrics ticker which is spawned only when template_store is
configured, leaving flow_collector_sources stuck at 0 for local-only deployments.

Code

rust/flow-collector/src/netflow/mod.rs[R130-143]

+        // Spawn a background ticker that aggregates per-source CacheMetrics
+        // into the listener-level Prometheus counters once a second. This
+        // keeps the parse_datagram hot path free of O(sources) work and
+        // (combined with retired-source accounting in the ticker) makes
+        // the surfaced counters monotonically increasing — required for
+        // Prometheus rate() semantics. Only spawned when a template store
+        // is configured, since the counters are no-ops otherwise.
+        if store_enabled {
+            let parser_for_ticker = Arc::clone(&parser);
+            let metrics_for_ticker = Arc::clone(&metrics);
+            tokio::runtime::Handle::current().spawn(async move {
+                run_metrics_ticker(parser_for_ticker, metrics_for_ticker).await;
+            });
        }

Evidence

render_prometheus() always emits flow_collector_sources for netflow listeners, and the new
scaling guide explicitly tells operators to monitor it. However, NetflowHandler::new() only spawns
the ticker (which is where source_count gets written) when store_enabled is true, so when
template_store is omitted (a supported/mentioned mode) the gauge stays at its initial 0 value
indefinitely.

rust/flow-collector/src/netflow/mod.rs[130-143]
rust/flow-collector/src/metrics.rs[189-196]
docs/docs/flow-collector-scaling.md[83-96]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The `flow_collector_sources` gauge is exported and documented, but it remains 0 unless `template_store` is enabled.

### Issue Context
`source_count` is independent of the template store; it’s useful even in local-only deployments for capacity planning (`max_sources` proximity).

### Fix Focus Areas
- rust/flow-collector/src/netflow/mod.rs[130-171]
- rust/flow-collector/src/metrics.rs[189-196]

### What to change
Choose one:
- **Low overhead (recommended):** Update `metrics.source_count` inside `parse_datagram` while the parser mutex is already held (e.g., right after `iter_packets_from_source(...)`), so it’s accurate regardless of template store.
- Or spawn a lightweight 1Hz ticker unconditionally for `source_count`, and keep the heavier per-source template-store aggregation only when `template_store` is enabled.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

7. Metrics exposed via LoadBalancer 🐞 Bug ⛨ Security

Description

Helm defaults set flow-collector Service type to LoadBalancer and enable the TCP metrics port while
metrics_addr binds to 0.0.0.0, making an unauthenticated HTTP /metrics endpoint externally reachable
wherever the LoadBalancer is reachable. This leaks operational data and increases attack surface.

Code

helm/serviceradar/values.yaml[R839-883]

+    metrics_addr: "0.0.0.0:50046"
+
+    # Optional: shared template store. When set, every flow-collector pod
+    # writes through learned NetFlow/IPFIX templates to this NATS KV bucket
+    # and consults it on cache miss, so a packet routed to a fresh pod can
+    # decode templates other pods have already seen. Comment out the
+    # whole block to disable (each pod uses only its in-process LRU).
+    template_store:
+      kv_bucket: "flow_templates"
+      # Number of historical revisions per key. Templates rarely change so
+      # 1 is fine; larger values support audit / debug. NATS KV history is
+      # capped at 64.
+      kv_history: 1
+      # Optional TTL (seconds) for entries. 0 disables expiration; NATS
+      # will then keep templates indefinitely (or until withdrawn).
+      kv_ttl_secs: 0
+
   listeners:
     - protocol: "sflow"
       listen_addr: "0.0.0.0:6343"
       subject: "flows.raw.sflow"
     - protocol: "netflow"
       listen_addr: "0.0.0.0:2055"
       subject: "flows.raw.netflow"
+        max_templates: 2000
+        max_template_fields: 10000
   security:
     mode: "mtls"
     cert_dir: "/etc/serviceradar/certs"
     tls:
       cert_file: "flow-collector.pem"
       key_file: "flow-collector-key.pem"
       ca_file: "ca.crt"
-  data:
-    enabled: true
-    existingClaim: ""
-    size: 1Gi
-    storageClassName: ""
+
 service:
-    type: ClusterIP
+    # LoadBalancer + Local preserves the exporter's source IP through the
+    # LB → kube-proxy → pod path, which AutoScopedParser depends on for
+    # per-source template scoping.
+    type: LoadBalancer
   annotations: {}
   loadBalancerIP: ""
-    externalTrafficPolicy: Cluster
+    externalTrafficPolicy: Local
   sessionAffinity: None
   ports:
     netflow:

Evidence
values.yaml enables metrics_addr and the metrics service port by default, and the chart defaults the
Service to LoadBalancer; the metrics server serves plaintext without auth/TLS.
helm/serviceradar/values.yaml[836-902]
helm/serviceradar/templates/flow-collector.yaml[173-225]
rust/flow-collector/src/metrics.rs[194-253]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
With the new Prometheus server, the Helm defaults now expose `/metrics` over a public-ish LoadBalancer by default (service type LoadBalancer + metrics port enabled + bind 0.0.0.0).
### Issue Context
Even if many clusters use internal LBs, chart defaults should avoid unintentionally exposing unauthenticated HTTP endpoints.
### Fix Focus Areas
- helm/serviceradar/values.yaml[836-902]
- helm/serviceradar/templates/flow-collector.yaml[173-225]
### Suggested fix
Choose one (or combine):
- Default `service.ports.metrics.enabled: false`.
- Bind metrics to localhost by default (e.g., `metrics_addr: "127.0.0.1:50046"`) and require explicit override to `0.0.0.0`.
- Create a separate `ClusterIP` Service for metrics (and keep the UDP ingest Service as LoadBalancer).
- Document/optionally template NetworkPolicy guidance for restricting metrics access.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

8. Per-packet O(sources) scan 🐞 Bug ➹ Performance

Description

NetflowHandler::parse_datagram calls update_template_store_metrics() on every UDP datagram while
holding the parser Mutex, and the function iterates per-source info maps to sum metrics. This adds
work proportional to (packet_rate × source_count) and increases lock hold time in the parsing hot
path.

Code

rust/flow-collector/src/netflow/mod.rs[R145-159]

       let packets: Vec<_> = {
           let mut parser = self.parser.lock().unwrap();
-            match parser.iter_packets_from_source(peer, buf) {
+            let result = match parser.iter_packets_from_source(peer, buf) {
               Ok(iter) => iter.collect(),
               Err(e) => {
                   warn!("Failed to parse NetFlow packet from {}: {:?}", peer, e);
                   self.metrics.parse_errors.fetch_add(1, Ordering::Relaxed);
                   return vec![];
               }
-            }
+            };
+            // Refresh the listener-level aggregates while we still hold the
+            // parser lock — atomic stores are cheap and this keeps the
+            // Prometheus snapshot fresh on every datagram.
+            update_template_store_metrics(&parser, &self.metrics);
+            result

Evidence
Listener::run invokes parse_datagram for every recv_from() call, and parse_datagram holds the parser
Mutex while calling update_template_store_metrics(), which loops all sources via
parser.{ipfix,v9,legacy}_info().
rust/flow-collector/src/listener.rs[98-134]
rust/flow-collector/src/netflow/mod.rs[145-160]
rust/flow-collector/src/netflow/mod.rs[190-239]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Template-store metrics aggregation currently scans all sources on every datagram while holding the parser lock, increasing CPU usage and lock time as `source_count` grows.
### Issue Context
Prometheus scrapes happen on a much lower cadence than packet ingestion. Aggregation does not need to run per packet to be useful.
### Fix Focus Areas
- rust/flow-collector/src/netflow/mod.rs[145-160]
- rust/flow-collector/src/netflow/mod.rs[190-239]
### Suggested fix
- Move aggregation to a periodic task (e.g., every 5-30s) that locks the parser, computes totals once, and updates atomics.
- Or, only refresh aggregates on demand (e.g., when serving `/metrics`), if you can safely access the parser from the metrics server.
- If keeping it inline, add sampling (e.g., refresh at most once per N milliseconds) to bound overhead.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Previous review results

Review updated until commit d8e67a9

Results up to commit 1f09dd8

🐞 Bugs (4) 📘 Rule violations (0)

1. Template store skips creds 🐞 Bug ≡ Correctness

Description

bootstrap_template_store() connects with async_nats::connect(nats_url) and ignores
Config.nats_creds_file and Config.security (mTLS), so enabling template_store will fail to connect
on authenticated/mTLS NATS clusters (or connect without the intended client identity). This is
inconsistent with Publisher::connect_once(), which explicitly applies credentials and TLS options.

Code

rust/flow-collector/src/main.rs[R185-188]

+    let client = async_nats::connect(nats_url)
+        .await
+        .with_context(|| format!("connecting to NATS at {} for template store", nats_url))?;
+    let js = jetstream::new(client);

Evidence
The flow-collector already requires NATS creds/mTLS for publishing, and Helm mounts/waits for the
creds file; however the template-store connection is created without those options, making it unable
to authenticate in the same environments.
rust/flow-collector/src/main.rs[181-189]
rust/flow-collector/src/publisher.rs[88-115]
rust/flow-collector/src/config.rs[6-47]
helm/serviceradar/values.yaml[829-872]
helm/serviceradar/templates/flow-collector.yaml[109-170]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`bootstrap_template_store()` creates a NATS connection via `async_nats::connect(nats_url)` and does not apply the same credentials-file and mTLS configuration used by the publisher. In clusters where NATS requires creds and/or client certs, enabling `template_store` will prevent startup (connection/auth failure) or connect with the wrong security posture.

### Issue Context
`Publisher::connect_once()` already builds `ConnectOptions` from `Config.security` and `Config.nats_creds_file`. Helm also mounts and waits for `/etc/serviceradar/creds/platform.creds`, suggesting creds are required in real deployments.

### Fix Focus Areas
- rust/flow-collector/src/main.rs[181-207]
- rust/flow-collector/src/publisher.rs[88-115]
- rust/flow-collector/src/config.rs[6-47]

### Suggested fix
- Change `bootstrap_template_store()` to accept `&Config` (or the needed security fields), and build `async_nats::ConnectOptions` the same way as the publisher (root certs + client cert + creds file).
- Consider factoring the shared NATS connection setup into a helper (e.g., `fn nats_connect_options(cfg: &Config) -> Result<ConnectOptions>`), used by both publisher and template store.
- (Optional but consistent) add the same retry/backoff behavior as `connect_with_retry()` if template_store should tolerate NATS startup ordering.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

2. Metrics server slowloris DoS 🐞 Bug ⛨ Security

Description

run_prometheus_server() spawns an unbounded task per accepted TCP connection and serve_one()
performs a read with no timeout, allowing clients to hold connections open indefinitely and exhaust
tasks/file descriptors. This can degrade or crash the flow-collector when metrics_addr is enabled.

Code

rust/flow-collector/src/metrics.rs[R206-236]

+    loop {
+        let (mut conn, peer) = match socket.accept().await {
+            Ok(v) => v,
+            Err(e) => {
+                warn!("metrics accept error: {}", e);
+                continue;
+            }
+        };
+        let listeners = listeners.clone();
+        tokio::spawn(async move {
+            if let Err(e) = serve_one(&mut conn, &listeners).await {
+                warn!("metrics conn from {} error: {}", peer, e);
+            }
+        });
+    }
+}
+
+async fn serve_one(
+    conn: &mut tokio::net::TcpStream,
+    listeners: &[Arc<ListenerMetrics>],
+) -> io::Result<()> {
+    // We don't parse the request rigorously — Prometheus and curl both
+    // send a small HTTP/1.1 GET. Read up to the end of the request line
+    // so we can branch on the path.
+    let mut buf = [0u8; 1024];
+    let n = conn.read(&mut buf).await?;
+    let request_line = std::str::from_utf8(&buf[..n])
+        .ok()
+        .and_then(|s| s.lines().next())
+        .unwrap_or("");
+    let path = request_line.split_whitespace().nth(1).unwrap_or("/");

Evidence

The server accepts connections in a loop, clones listener state, and spawns a task that awaits
conn.read() without any deadline; an attacker (or misbehaving scraper) can keep many connections
open and consume resources.

rust/flow-collector/src/metrics.rs[194-253]
helm/serviceradar/values.yaml[836-840]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The Prometheus HTTP server has no read/write timeouts and spawns unbounded tasks per connection. A slowloris-style client can hold many connections open, consuming memory and file descriptors.

### Issue Context
The server is intentionally hand-rolled (no hyper/axum), so it must implement basic defensive measures itself.

### Fix Focus Areas
- rust/flow-collector/src/metrics.rs[194-253]

### Suggested fix
- Wrap `conn.read(...)` and `conn.write_all(...)` in `tokio::time::timeout` (e.g., 1-5s for request read, 5-10s for response write).
- Add a small concurrency limit (e.g., `Semaphore`) around per-connection handling to bound simultaneous in-flight connections.
- Optionally: read until `\r\n` for the request line (or `\r\n\r\n` for headers) with a strict max size, to avoid partial-read misclassification.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

3. Metrics exposed via LoadBalancer 🐞 Bug ⛨ Security

Description

Helm defaults set flow-collector Service type to LoadBalancer and enable the TCP metrics port while
metrics_addr binds to 0.0.0.0, making an unauthenticated HTTP /metrics endpoint externally reachable
wherever the LoadBalancer is reachable. This leaks operational data and increases attack surface.

Code

helm/serviceradar/values.yaml[R839-883]

+    metrics_addr: "0.0.0.0:50046"
+
+    # Optional: shared template store. When set, every flow-collector pod
+    # writes through learned NetFlow/IPFIX templates to this NATS KV bucket
+    # and consults it on cache miss, so a packet routed to a fresh pod can
+    # decode templates other pods have already seen. Comment out the
+    # whole block to disable (each pod uses only its in-process LRU).
+    template_store:
+      kv_bucket: "flow_templates"
+      # Number of historical revisions per key. Templates rarely change so
+      # 1 is fine; larger values support audit / debug. NATS KV history is
+      # capped at 64.
+      kv_history: 1
+      # Optional TTL (seconds) for entries. 0 disables expiration; NATS
+      # will then keep templates indefinitely (or until withdrawn).
+      kv_ttl_secs: 0
+
    listeners:
      - protocol: "sflow"
        listen_addr: "0.0.0.0:6343"
        subject: "flows.raw.sflow"
      - protocol: "netflow"
        listen_addr: "0.0.0.0:2055"
        subject: "flows.raw.netflow"
+        max_templates: 2000
+        max_template_fields: 10000
    security:
      mode: "mtls"
      cert_dir: "/etc/serviceradar/certs"
      tls:
        cert_file: "flow-collector.pem"
        key_file: "flow-collector-key.pem"
        ca_file: "ca.crt"
-  data:
-    enabled: true
-    existingClaim: ""
-    size: 1Gi
-    storageClassName: ""
+
  service:
-    type: ClusterIP
+    # LoadBalancer + Local preserves the exporter's source IP through the
+    # LB → kube-proxy → pod path, which AutoScopedParser depends on for
+    # per-source template scoping.
+    type: LoadBalancer
    annotations: {}
    loadBalancerIP: ""
-    externalTrafficPolicy: Cluster
+    externalTrafficPolicy: Local
    sessionAffinity: None
    ports:
      netflow:

Evidence
values.yaml enables metrics_addr and the metrics service port by default, and the chart defaults the
Service to LoadBalancer; the metrics server serves plaintext without auth/TLS.
helm/serviceradar/values.yaml[836-902]
helm/serviceradar/templates/flow-collector.yaml[173-225]
rust/flow-collector/src/metrics.rs[194-253]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
With the new Prometheus server, the Helm defaults now expose `/metrics` over a public-ish LoadBalancer by default (service type LoadBalancer + metrics port enabled + bind 0.0.0.0).

### Issue Context
Even if many clusters use internal LBs, chart defaults should avoid unintentionally exposing unauthenticated HTTP endpoints.

### Fix Focus Areas
- helm/serviceradar/values.yaml[836-902]
- helm/serviceradar/templates/flow-collector.yaml[173-225]

### Suggested fix
Choose one (or combine):
- Default `service.ports.metrics.enabled: false`.
- Bind metrics to localhost by default (e.g., `metrics_addr: "127.0.0.1:50046"`) and require explicit override to `0.0.0.0`.
- Create a separate `ClusterIP` Service for metrics (and keep the UDP ingest Service as LoadBalancer).
- Document/optionally template NetworkPolicy guidance for restricting metrics access.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

4. Per-packet O(sources) scan 🐞 Bug ➹ Performance

Description

NetflowHandler::parse_datagram calls update_template_store_metrics() on every UDP datagram while
holding the parser Mutex, and the function iterates per-source info maps to sum metrics. This adds
work proportional to (packet_rate × source_count) and increases lock hold time in the parsing hot
path.

Code

rust/flow-collector/src/netflow/mod.rs[R145-159]

        let packets: Vec<_> = {
            let mut parser = self.parser.lock().unwrap();
-            match parser.iter_packets_from_source(peer, buf) {
+            let result = match parser.iter_packets_from_source(peer, buf) {
                Ok(iter) => iter.collect(),
                Err(e) => {
                    warn!("Failed to parse NetFlow packet from {}: {:?}", peer, e);
                    self.metrics.parse_errors.fetch_add(1, Ordering::Relaxed);
                    return vec![];
                }
-            }
+            };
+            // Refresh the listener-level aggregates while we still hold the
+            // parser lock — atomic stores are cheap and this keeps the
+            // Prometheus snapshot fresh on every datagram.
+            update_template_store_metrics(&parser, &self.metrics);
+            result

Evidence
Listener::run invokes parse_datagram for every recv_from() call, and parse_datagram holds the parser
Mutex while calling update_template_store_metrics(), which loops all sources via
parser.{ipfix,v9,legacy}_info().
rust/flow-collector/src/listener.rs[98-134]
rust/flow-collector/src/netflow/mod.rs[145-160]
rust/flow-collector/src/netflow/mod.rs[190-239]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
Template-store metrics aggregation currently scans all sources on every datagram while holding the parser lock, increasing CPU usage and lock time as `source_count` grows.

### Issue Context
Prometheus scrapes happen on a much lower cadence than packet ingestion. Aggregation does not need to run per packet to be useful.

### Fix Focus Areas
- rust/flow-collector/src/netflow/mod.rs[145-160]
- rust/flow-collector/src/netflow/mod.rs[190-239]

### Suggested fix
- Move aggregation to a periodic task (e.g., every 5-30s) that locks the parser, computes totals once, and updates atomics.
- Or, only refresh aggregates on demand (e.g., when serving `/metrics`), if you can safely access the parser from the metrics server.
- If keeping it inline, add sampling (e.g., refresh at most once per N milliseconds) to bound overhead.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

…emonSet Wire the netflow_parser 1.0.3 TemplateStore extension point into the flow collector and switch the K8s deployment from a single Deployment to a per-node DaemonSet behind `Service: LoadBalancer` with `externalTrafficPolicy: Local`. Together these let the collector scale horizontally with node count while preserving exporter source IP and sharing learned NetFlow / IPFIX templates across pods via NATS JetStream KV. Why --- A single flow-collector Deployment was a singleton bottleneck for UDP ingest. Scaling required either source-IP-affinity routing (so every exporter always lands on the same pod and templates stay local) or a shared template store so any pod could decode any flow. We picked the shared-store path because it doesn't depend on the LB layer. The original plan was to front this with Envoy Gateway UDPRoute, but a research spike found UDPRoute does not preserve client source IP today — upstreams see the gateway pod's IP, which would collapse AutoScopedParser's per-(source_ip,source_id) scoping. Service + DaemonSet + `externalTrafficPolicy: Local` is the workable alternative on MetalLB (L2 mode preserves source IP through to the pod). Gateway API stays available for the rest of the stack (HTTP/gRPC/mTLS). Code ---- * `Cargo.toml`: bump netflow_parser 1.0.0 -> 1.0.3 (TemplateStore API); add `bytes` for the NATS KV payload type. * `src/nats_client.rs` (new): Shared `connect_once` and `connect_with_retry` helpers. Both publisher and template-store bootstrap go through these, applying TLS root CA + client cert + creds file from `Config.security` and `Config.nats_creds_file`. Eliminates the easy-to-miss bug where one path used bare `async_nats::connect(url)` and silently broke mTLS deployments. Bounded backoff (60 attempts, 0.5s -> 30s) so neither path crash-loops the pod when NATS is slow to come up. * `src/template_store.rs` (new): `NatsKvTemplateStore` implements the parser's sync `TemplateStore` trait. async_nats is async-only so we bridge via `tokio::task::block_in_place` + `Handle::block_on`. NATS KV keys are restricted to `[A-Za-z0-9._=/-]`; the scope produced by AutoScopedParser (`v9:1.2.3.4:2055/0`, IPv6 with brackets, etc.) is sanitized via underscore replacement. `kind_tag` wildcard arm warns once via `std::sync::Once` if a future netflow_parser version ships a new `TemplateKind` variant. * `src/netflow/mod.rs`: `NetflowHandler::new` accepts an optional `Arc<dyn TemplateStore>` and threads it through `NetflowParserBuilder::with_template_store`. New `TemplateEvent:: Restored` arm in the event-callback (logs at info). When a store is configured, spawns a 1Hz background ticker that aggregates per-source `CacheMetrics` into the listener-level Prometheus counters via a `DeltaState` that tracks per-source last-known values plus a `retired` total — so listener counters never decrease when sources are LRU-evicted. Off the parse_datagram hot path. * `src/listener.rs`: `build_handler` takes `Option<Arc<dyn TemplateStore>>`. sFlow ignores it (template-less). * `src/config.rs`: new optional `TemplateStoreConfig` with `kv_bucket`, `kv_history` (u8, validated `1..=64`), `kv_ttl_secs`, and optional `nats_url` override for split-fault-domain setups. * `src/main.rs`: `bootstrap_template_store(&Config, &TemplateStoreConfig)` uses `nats_client::connect_with_retry` (so TLS/creds/retry match the publisher), then `create_or_update_key_value` for genuinely idempotent KV bucket bootstrap across config drift. Independent NATS connection from the publisher's so KV failures cannot stall publishing and vice versa. * `src/metrics.rs`: `ListenerMetrics` gains `template_store_restored`, `template_store_codec_errors`, `template_store_backend_errors`, `source_count` atomics. New `render_prometheus()` produces text exposition format with HELP/TYPE headers and properly escaped label values (`\` -> `\\`, `"` -> `\"`, `\n` -> `\n`). sFlow listeners are filtered out of `template_store_*` and `flow_collector_sources` rows since those are structurally zero for sFlow. New `run_prometheus_server()` is a hand-rolled HTTP/1.1 server on tokio's TcpListener — `GET /metrics` returns the exposition; anything else 404s. No new deps. Defensive measures against slowloris-style attacks: per-line read timeout (3s), response write timeout (5s), 8 KiB max-request-bytes cap via `AsyncReadExt::take`, and a `Semaphore`-bounded concurrency cap (64) that fast-fails excess connections rather than queueing them so an attacker cannot exhaust file descriptors. Spawned from main.rs when `metrics_addr` is set; runs independently of the log reporter and the publisher so a scrape failure can never stall ingestion. K8s manifests ------------- * `k8s/demo/base/serviceradar-flow-collector.yaml`: `Deployment` -> `DaemonSet`, replicas dropped, unused PVC removed (the Rust binary writes nothing to disk). ConfigMap JSON gains the `template_store` block. Service was already `externalTrafficPolicy: Local` — kept. Added `RollingUpdate` with `maxUnavailable: 1`. HTTP probes on the `/metrics` endpoint replace the old `pgrep` exec probes (catches deadlocked-but-running scenarios). * `helm/serviceradar/templates/flow-collector.yaml`: same kind flip, PVC removal, plus `nodeSelector` / `tolerations` knobs. `templates/NOTES.txt` warns operators about the `replicaCount` and `data.*` value removals and prints the explicit `kubectl delete deployment / pvc` commands needed to clean up orphaned objects from a pre-DaemonSet install. * `helm/serviceradar/values.yaml`: drop `replicaCount` and `data.*`, add `flowCollector.config.template_store` block, default `service.type: LoadBalancer` and `service.externalTrafficPolicy: Local`, add the IPFIX (4739) and metrics (50046) ports. Tests ----- 55 tests pass; `clippy --all-targets -D warnings` clean. New tests: * 5 `NatsKvTemplateStore` key-render tests (IPv4, IPv6, empty scope, distinct scopes, distinct kinds). * 2 `TemplateStoreConfig` deserialization tests. * 2 `DeltaState` monotonic-counter tests covering source eviction and re-emergence and multiple-sources / multiple-kinds aggregation. * `escape_label_handles_quotes_backslashes_newlines`, `template_store_metrics_only_emit_for_netflow_listeners`. * `http_server_serves_metrics_and_404` — end-to-end including a deliberate split-read of the request line that the old single-`read()` code would have failed. * `http_server_drops_idle_connection_within_read_timeout` — slow client closed within 2x READ_TIMEOUT (was: would hang forever). * `http_server_concurrency_cap_fast_fails_excess_connections` — saturate with 64 idle conns, verify a fresh request doesn't hang past 2s (was: would queue and starve real scrapers). Docs ---- `docs/docs/flow-collector-scaling.md`: deployment model with diagram and rationale for `externalTrafficPolicy: Local`; configuration knobs table; small / medium / large / very-large sizing rules of thumb; Prometheus metrics with healthy-vs-trouble interpretations and alerting suggestions; signals you've outgrown the design plus the sharding strategy when you do; "Upgrading from a Pre-DaemonSet Install" section with explicit `kubectl delete deployment / pvc` commands; "Rolling Upgrade Behavior" section explaining how `externalTrafficPolicy: Local` interacts with DaemonSet rolling updates and why `maxUnavailable: 1` is the floor; deploy verification checklist (LB IP, source-IP smoke test, cross-pod template share check, Prometheus scrape). Deliberately avoids speculative throughput numbers — flagged as "benchmark on your hardware" — until real load tests fill them in.

qodo-code-review · 2026-05-03T22:04:55Z

Persistent review updated to latest commit d8e67a9

…aling doc - Add a default netflow listener on 0.0.0.0:4739 in Helm values and the demo manifest so the advertised IPFIX Service port has a real UDP socket behind it instead of silently dropping datagrams. - Gate the flow-collector liveness/readiness probes on service.ports.metrics.enabled AND config.metrics_addr; fall back to a pgrep exec probe when metrics are off so disabling /metrics no longer crashloops the pod. Document the implication near the demo values' metrics.enabled toggle. - Replace non-ASCII characters (em/en dashes, arrows, box drawing, less-than-or-equal) in docs/docs/flow-collector-scaling.md with ASCII equivalents to satisfy the docs ASCII-only constraint.

mikemiles-dev requested a review from mfreeman451 as a code owner May 3, 2026 21:26

qodo-code-review Bot reviewed May 3, 2026

View reviewed changes

Comment thread rust/flow-collector/src/main.rs Outdated

Comment thread rust/flow-collector/src/metrics.rs Outdated

mikemiles-dev marked this pull request as draft May 3, 2026 21:33

mikemiles-dev force-pushed the feat/flow-collector-template-store branch from 5eac517 to d8e67a9 Compare May 3, 2026 22:03

mikemiles-dev marked this pull request as ready for review May 3, 2026 22:04

qodo-code-review Bot reviewed May 3, 2026

View reviewed changes

Comment thread docs/docs/flow-collector-scaling.md Outdated

Comment thread helm/serviceradar/values.yaml

Comment thread helm/serviceradar/templates/flow-collector.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(flow-collector): horizontal scale via NATS KV TemplateStore + Da…#3148

feat(flow-collector): horizontal scale via NATS KV TemplateStore + Da…#3148
mikemiles-dev wants to merge 2 commits into
carverauto:stagingfrom
mikemiles-dev:feat/flow-collector-template-store

mikemiles-dev commented May 3, 2026

Uh oh!

qodo-code-review Bot commented May 3, 2026 •

edited

Loading

Uh oh!

qodo-code-review Bot commented May 3, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

qodo-code-review Bot commented May 3, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

mikemiles-dev commented May 3, 2026

Why

Code changes

K8s manifest changes

Tests

What's left to do

IMPORTANT: Please sign the Developer Certificate of Origin

Describe your changes

Issue ticket number and link

Code checklist before requesting a review

Uh oh!

qodo-code-review Bot commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review Summary by Qodo

(Agentic_describe updated until commit d8e67a9)

Walkthroughs

File Changes

Uh oh!

qodo-code-review Bot commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review by Qodo

Previous review results

Uh oh!

Uh oh!

Uh oh!

qodo-code-review Bot commented May 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

qodo-code-review Bot commented May 3, 2026 •

edited

Loading

(Agentic_describe updated until commit `d8e67a9`)

qodo-code-review Bot commented May 3, 2026 •

edited

Loading

qodo-code-review Bot commented May 3, 2026 •

edited

Loading