Skip to content

feat(flow-collector): horizontal scale via NATS KV TemplateStore + Da…#3148

Open
mikemiles-dev wants to merge 2 commits into
carverauto:stagingfrom
mikemiles-dev:feat/flow-collector-template-store
Open

feat(flow-collector): horizontal scale via NATS KV TemplateStore + Da…#3148
mikemiles-dev wants to merge 2 commits into
carverauto:stagingfrom
mikemiles-dev:feat/flow-collector-template-store

Conversation

@mikemiles-dev

Copy link
Copy Markdown
Collaborator

…emonSet

Wires the new netflow_parser 1.0.3 TemplateStore extension point into the flow-collector and switches the K8s deployment from a single Deployment to a per-node DaemonSet behind a Service: LoadBalancer with externalTrafficPolicy: Local. Together these two changes let the flow collector scale horizontally with node count while preserving exporter source IP and sharing learned NetFlow / IPFIX templates across pods.

Why

A single flow-collector Deployment was a singleton bottleneck for UDP ingest. Scaling it required either source-IP-affinity routing (so every exporter always lands on the same pod and templates stay local) or a shared template store so any pod could decode any flow. We picked the shared-store path because it doesn't depend on the LB layer.

The original plan was to front this with Envoy Gateway UDPRoute, but a research spike found that Envoy Gateway's UDPRoute does not preserve client source IP (https://gateway.envoyproxy.io/docs/tasks/traffic/udp-routing/) — upstreams see the gateway pod's IP, which would collapse AutoScopedParser's per-(source_ip,source_id) scoping. Service+DaemonSet +externalTrafficPolicy: Local is the workable alternative on MetalLB (L2 mode preserves source IP through to the pod). Gateway API stays available for the rest of the stack (HTTP/gRPC/mTLS).

Code changes

  • Cargo.toml: bump netflow_parser 1.0.0 -> 1.0.3 (TemplateStore API); add bytes for the NATS KV payload type.

  • src/template_store.rs (new): NatsKvTemplateStore implements the parser's TemplateStore trait. async_nats is async-only and the trait is sync, so we bridge via tokio::task::block_in_place + Handle::block_on. Safe under the multi_thread runtime the collector already uses. NATS KV keys are restricted to [A-Za-z0-9._=/-]; the scope produced by AutoScopedParser ("v9:1.2.3.4:2055/0", IPv6 with brackets, etc.) is sanitized via underscore replacement. Five render tests cover IPv4, IPv6, empty scope, distinct-scopes and distinct-kinds.

  • src/netflow/mod.rs: NetflowHandler::new accepts an optional Arc<dyn TemplateStore> and threads it through NetflowParserBuilder::with_template_store(...). Adds the new TemplateEvent::Restored arm to the existing event-callback (logs at info). Adds update_template_store_metrics() which iterates each per-source NetflowParser via AutoScopedParser::{ipfix,v9,legacy}_info(), sums template_store_* counters from each CacheMetrics, and writes the totals into ListenerMetrics atomics. Called under the parser lock inside parse_datagram so the snapshot is always fresh on the next scrape.

  • src/listener.rs: build_handler now takes Option<Arc<dyn TemplateStore>>. sFlow ignores it (template-less).

  • src/config.rs: new optional TemplateStoreConfig { kv_bucket, kv_history, kv_ttl_secs }. Two deserialization tests.

  • src/main.rs: new bootstrap_template_store() opens an independent NATS connection (separate from the publisher's), gets-or-creates the KV bucket idempotently, and wraps it in NatsKvTemplateStore. Kept on its own connection so KV-bucket flakiness can't backpressure publish and vice versa.

  • src/metrics.rs: ListenerMetrics gains template_store_restored, template_store_codec_errors, template_store_backend_errors, source_count atomics. New render_prometheus() produces text exposition format (8 metrics, HELP/TYPE headers, protocol + listen_addr labels). New run_prometheus_server() is a hand-rolled HTTP/1.1 server on tokio's TcpListener — GET /metrics returns the exposition; anything else 404s. No new deps. Spawned from main.rs when metrics_addr is set; runs independently of the log reporter and the publisher so a scrape failure can never stall ingestion.

K8s manifest changes

  • k8s/demo/base/serviceradar-flow-collector.yaml: Deployment ->
    DaemonSet, replicas dropped, unused PVC removed (the Rust binary writes nothing to disk). ConfigMap JSON gains the template_store block. Service was already externalTrafficPolicy: Local — kept. Added RollingUpdate with maxUnavailable: 1.

  • helm/serviceradar/templates/flow-collector.yaml: same kind flip, PVC removal, plus nodeSelector/tolerations knobs.

  • helm/serviceradar/values.yaml: drop replicaCount and data.*, add flowCollector.config.template_store block, default service.type: LoadBalancer and service.externalTrafficPolicy: Local, add the IPFIX (4739) and metrics (50046) ports.

Tests

48 tests pass:

  • 5 NatsKvTemplateStore key-render tests
  • 2 TemplateStoreConfig deserialization tests
  • 1 Prometheus exposition format test
  • 40 pre-existing flow-collector tests (unchanged) clippy --all-targets -D warnings: clean

What's left to do

Code:

  • Prometheus scrape annotations on the Service so the existing Prometheus / ServiceMonitor picks up the new /metrics endpoint automatically. Both manifests need this.
  • PrometheusRule alerts: template_store_backend_errors > 0 sustained

    5min = NATS unhealthy; template_store_codec_errors > 0 ever =
    wire-format mismatch (drain bucket).

  • Optional: TemplateStoreConfig.kv_replicas hint so the bucket is created with the right JetStream replication for HA. Currently defaults to 1 — operators can pre-create the bucket with nats kv add flow_templates --replicas=3 to override.

Operational (no cluster access from here):

  • Run the source-IP smoke test against staging: LB_IP=$(kubectl get svc serviceradar-flow-collector \ -o jsonpath='{.status.loadBalancer.ingress[0].ip}') kubectl debug $POD -it --image=nicolaka/netshoot -- tcpdump -nn -i any port 2055 # send a packet to $LB_IP:2055 from a known IP # tcpdump should show that source IP, not MetalLB / node IP
  • Run the cross-pod template-share smoke test: send template+data, delete the receiving pod, send same data record, verify "Template restored from secondary store" log on the new pod and that NATS KV nats kv ls flow_templates shows scoped keys.
  • Decide JetStream stream replicas for HA on multi-AZ clusters.
  • Decide TTL on KV entries (default 0 = forever; 24h is reasonable if exporters re-announce frequently).

Decisions still open:

  • Whether to ship Prometheus scrape annotations + PrometheusRule in this branch or a follow-up.
  • Whether to expose kv_replicas in TemplateStoreConfig or rely on pre-create.

IMPORTANT: Please sign the Developer Certificate of Origin

Thank you for your contribution to ServiceRadar. Please note, when contributing, the developer must include
a DCO sign-off statement indicating the DCO acceptance in one commit message. Here
is an example DCO Signed-off-by line in a commit message:

Signed-off-by: J. Doe <j.doe@domain.com>

Describe your changes

Issue ticket number and link

Code checklist before requesting a review

  • I have signed the DCO?
  • The build completes without errors?
  • All tests are passing when running make test?

@mikemiles-dev mikemiles-dev requested a review from mfreeman451 as a code owner May 3, 2026 21:26
@qodo-code-review

qodo-code-review Bot commented May 3, 2026

Copy link
Copy Markdown
Contributor

Review Summary by Qodo

(Agentic_describe updated until commit d8e67a9)

Horizontal scale flow collector via NATS KV TemplateStore and DaemonSet deployment

✨ Enhancement

Grey Divider

Walkthroughs

Description
• Enable horizontal scaling of flow collector via NATS KV shared template store
• Switch K8s deployment from single Deployment to per-node DaemonSet with LoadBalancer service
• Preserve exporter source IP through load balancer using externalTrafficPolicy: Local
• Add Prometheus metrics endpoint with hand-rolled HTTP server and template-store counters
• Implement background metrics ticker to aggregate per-source template-store statistics
Diagram
flowchart LR
  exporters["Exporters<br/>UDP NetFlow/IPFIX/sFlow"]
  lb["Cloud L4 LB<br/>externalTrafficPolicy: Local"]
  daemonset["DaemonSet<br/>one pod per node"]
  nats_kv["NATS JetStream KV<br/>flow_templates bucket"]
  prometheus["Prometheus<br/>metrics endpoint"]
  
  exporters -->|source IP preserved| lb
  lb -->|distribute across nodes| daemonset
  daemonset -->|read/write templates| nats_kv
  daemonset -->|expose metrics| prometheus
Loading

Grey Divider

File Changes

1. rust/flow-collector/Cargo.toml Dependencies +2/-1

Bump netflow_parser and add bytes dependency

rust/flow-collector/Cargo.toml


2. rust/flow-collector/src/config.rs ⚙️ Configuration changes +98/-0

Add TemplateStoreConfig with NATS KV settings

rust/flow-collector/src/config.rs


3. rust/flow-collector/src/nats_client.rs ✨ Enhancement +109/-0

Shared NATS connection helpers for TLS/creds

rust/flow-collector/src/nats_client.rs


View more (11)
4. rust/flow-collector/src/template_store.rs ✨ Enhancement +220/-0

NATS KV-backed TemplateStore implementation

rust/flow-collector/src/template_store.rs


5. rust/flow-collector/src/main.rs ✨ Enhancement +82/-6

Bootstrap template store and spawn Prometheus server

rust/flow-collector/src/main.rs


6. rust/flow-collector/src/listener.rs ✨ Enhancement +6/-0

Thread template store through handler construction

rust/flow-collector/src/listener.rs


7. rust/flow-collector/src/netflow/mod.rs ✨ Enhancement +262/-5

Integrate TemplateStore and add metrics ticker

rust/flow-collector/src/netflow/mod.rs


8. rust/flow-collector/src/publisher.rs ✨ Enhancement +19/-38

Refactor to use shared NATS connection helpers

rust/flow-collector/src/publisher.rs


9. rust/flow-collector/src/metrics.rs ✨ Enhancement +561/-4

Add template-store counters and Prometheus HTTP server

rust/flow-collector/src/metrics.rs


10. helm/serviceradar/templates/flow-collector.yaml ⚙️ Configuration changes +31/-49

Convert Deployment to DaemonSet with LoadBalancer service

helm/serviceradar/templates/flow-collector.yaml


11. helm/serviceradar/templates/NOTES.txt 📝 Documentation +29/-0

Add upgrade notes for DaemonSet migration

helm/serviceradar/templates/NOTES.txt


12. helm/serviceradar/values.yaml ⚙️ Configuration changes +56/-8

Update defaults for DaemonSet, template store, and metrics

helm/serviceradar/values.yaml


13. k8s/demo/base/serviceradar-flow-collector.yaml ⚙️ Configuration changes +35/-30

Convert demo manifest to DaemonSet with template store

k8s/demo/base/serviceradar-flow-collector.yaml


14. docs/docs/flow-collector-scaling.md 📝 Documentation +210/-0

New scaling guide with sizing rules and monitoring

docs/docs/flow-collector-scaling.md


Grey Divider

Qodo Logo

@qodo-code-review

qodo-code-review Bot commented May 3, 2026

Copy link
Copy Markdown
Contributor

Code Review by Qodo

🐞 Bugs (7) 📘 Rule violations (1)

Grey Divider


Action required

1. Non-ASCII in scaling guide 📘 Rule violation ⚙ Maintainability ⭐ New
Description
The new Markdown doc includes non-ASCII characters (e.g., em dash , , arrows), violating the
ASCII-only documentation requirement. This can break tooling/portability expectations for docs
content.
Code

docs/docs/flow-collector-scaling.md[2]

+title: Flow Collector — Scaling Guide
Evidence
PR Compliance ID 6 requires Markdown content to be ASCII-only. The added doc contains multiple
non-ASCII characters, such as  in the title and  in section headings.

AGENTS.md
docs/docs/flow-collector-scaling.md[2-2]
docs/docs/flow-collector-scaling.md[56-56]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`docs/docs/flow-collector-scaling.md` contains non-ASCII characters (e.g., `—`, `≤`, `→`). Compliance requires ASCII-only Markdown.

## Issue Context
The scaling guide is newly added documentation under `docs/docs/`, so it must adhere to the ASCII-only constraint.

## Fix Focus Areas
- docs/docs/flow-collector-scaling.md[1-210]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. IPFIX port not listened 🐞 Bug ≡ Correctness ⭐ New
Description
The Helm/Kustomize manifests expose UDP port 4739 (ipfix) by default, but the default
flow-collector config does not bind any listener on 4739, so IPFIX datagrams sent to the Service’s
IPFIX port will be dropped silently.
Code

helm/serviceradar/values.yaml[R902-906]

+      ipfix:
+        enabled: true
+        port: 4739
+        targetPort: 4739
+        protocol: UDP
Evidence
values.yaml enables the Service’s ipfix port on 4739, but the default
flowCollector.config.listeners only binds 0.0.0.0:2055 (netflow) and 0.0.0.0:6343 (sflow). The
kustomize demo manifest has the same mismatch: it exposes ipfix:4739 in the Service and container
ports, but the embedded flow-collector.json has no listener on 4739.

helm/serviceradar/values.yaml[865-873]
helm/serviceradar/values.yaml[891-906]
k8s/demo/base/serviceradar-flow-collector.yaml[21-56]
k8s/demo/base/serviceradar-flow-collector.yaml[173-189]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The chart/manifests advertise IPFIX ingestion on UDP/4739, but the flow-collector config does not bind to 4739 by default. This causes silent IPFIX data loss.

### Issue Context
`ListenerConfig` only has `netflow`/`sflow` variants; to receive IPFIX on 4739 you must add a second `protocol: netflow` listener with `listen_addr: 0.0.0.0:4739` (the netflow parser can decode IPFIX datagrams on any UDP port).

### Fix Focus Areas
- helm/serviceradar/values.yaml[865-906]
- k8s/demo/base/serviceradar-flow-collector.yaml[21-56]

### What to change
- Either (preferred) add a default listener for 4739 in Helm values and the demo manifest, or
- Set `service.ports.ipfix.enabled: false` by default and clearly document that enabling it requires adding a matching listener.
- Optionally add config validation at startup that warns/errors when a Service port is enabled but no listener is configured for that port (Helm-only check if you don’t want runtime coupling).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


3. Metrics probe always enabled 🐞 Bug ☼ Reliability ⭐ New
Description
The DaemonSet liveness/readiness probes always hit GET /metrics on port 50046, but the metrics
server only starts when config.metrics_addr is set; disabling metrics (as in values-demo.yaml)
will make pods fail probes and crashloop.
Code

helm/serviceradar/templates/flow-collector.yaml[R87-104]

+        # HTTP probe against the Prometheus endpoint — also confirms the
+        # async runtime is alive (a deadlocked-but-running collector
+        # would pass a `pgrep` check but fail this).
        livenessProbe:
-          exec:
-            command:
-            - /bin/sh
-            - -c
-            - "pgrep -f flow-collector > /dev/null"
+          httpGet:
+            path: /metrics
+            port: {{ .Values.flowCollector.service.ports.metrics.targetPort | default 50046 }}
          initialDelaySeconds: 30
          periodSeconds: 30
          timeoutSeconds: 5
          failureThreshold: 3
        readinessProbe:
-          exec:
-            command:
-            - /bin/sh
-            - -c
-            - "pgrep -f flow-collector > /dev/null"
+          httpGet:
+            path: /metrics
+            port: {{ .Values.flowCollector.service.ports.metrics.targetPort | default 50046 }}
          initialDelaySeconds: 10
          periodSeconds: 10
          timeoutSeconds: 5
Evidence
The Helm template hardwires both probes to /metrics on the metrics port. In the binary, the
Prometheus server is only spawned when config.metrics_addr is Some(_). The demo values
explicitly disable the metrics port and do not set flowCollector.config.metrics_addr, which means
nothing will be listening on 50046 but probes will still run.

helm/serviceradar/templates/flow-collector.yaml[87-105]
rust/flow-collector/src/main.rs[140-150]
helm/serviceradar/values-demo.yaml[489-533]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
Kubernetes probes depend on `/metrics`, but `/metrics` is optional at runtime. When operators disable metrics (or omit `metrics_addr`), pods will fail readiness/liveness and crashloop.

### Issue Context
- The binary only starts the Prometheus server when `config.metrics_addr` is set.
- Helm currently always configures HTTP probes.

### Fix Focus Areas
- helm/serviceradar/templates/flow-collector.yaml[84-105]
- rust/flow-collector/src/main.rs[140-150]
- helm/serviceradar/values-demo.yaml[489-533]

### What to change
Implement one of:
1) **Helm fix (recommended):** Wrap liveness/readiness probes in a conditional, e.g. only enable HTTP probes when `.Values.flowCollector.service.ports.metrics.enabled` AND `.Values.flowCollector.config.metrics_addr` are set; otherwise fall back to an `exec` probe (`pgrep`) or disable probes.
2) **App fix:** Always start the metrics server on a fixed port (or always set a default `metrics_addr`) so `/metrics` is guaranteed to exist if probes expect it.

Also update `values-demo.yaml` to match the chosen behavior so demo installs don’t regress.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


View more (2)
4. Template store skips creds 🐞 Bug ≡ Correctness
Description
bootstrap_template_store() connects with async_nats::connect(nats_url) and ignores
Config.nats_creds_file and Config.security (mTLS), so enabling template_store will fail to connect
on authenticated/mTLS NATS clusters (or connect without the intended client identity). This is
inconsistent with Publisher::connect_once(), which explicitly applies credentials and TLS options.
Code

rust/flow-collector/src/main.rs[R185-188]

+    let client = async_nats::connect(nats_url)
+        .await
+        .with_context(|| format!("connecting to NATS at {} for template store", nats_url))?;
+    let js = jetstream::new(client);
Evidence
The flow-collector already requires NATS creds/mTLS for publishing, and Helm mounts/waits for the
creds file; however the template-store connection is created without those options, making it unable
to authenticate in the same environments.

rust/flow-collector/src/main.rs[181-189]
rust/flow-collector/src/publisher.rs[88-115]
rust/flow-collector/src/config.rs[6-47]
helm/serviceradar/values.yaml[829-872]
helm/serviceradar/templates/flow-collector.yaml[109-170]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`bootstrap_template_store()` creates a NATS connection via `async_nats::connect(nats_url)` and does not apply the same credentials-file and mTLS configuration used by the publisher. In clusters where NATS requires creds and/or client certs, enabling `template_store` will prevent startup (connection/auth failure) or connect with the wrong security posture.
### Issue Context
`Publisher::connect_once()` already builds `ConnectOptions` from `Config.security` and `Config.nats_creds_file`. Helm also mounts and waits for `/etc/serviceradar/creds/platform.creds`, suggesting creds are required in real deployments.
### Fix Focus Areas
- rust/flow-collector/src/main.rs[181-207]
- rust/flow-collector/src/publisher.rs[88-115]
- rust/flow-collector/src/config.rs[6-47]
### Suggested fix
- Change `bootstrap_template_store()` to accept `&Config` (or the needed security fields), and build `async_nats::ConnectOptions` the same way as the publisher (root certs + client cert + creds file).
- Consider factoring the shared NATS connection setup into a helper (e.g., `fn nats_connect_options(cfg: &Config) -> Result<ConnectOptions>`), used by both publisher and template store.
- (Optional but consistent) add the same retry/backoff behavior as `connect_with_retry()` if template_store should tolerate NATS startup ordering.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


5. Metrics server slowloris DoS 🐞 Bug ⛨ Security
Description
run_prometheus_server() spawns an unbounded task per accepted TCP connection and serve_one()
performs a read with no timeout, allowing clients to hold connections open indefinitely and exhaust
tasks/file descriptors. This can degrade or crash the flow-collector when metrics_addr is enabled.
Code

rust/flow-collector/src/metrics.rs[R206-236]

+    loop {
+        let (mut conn, peer) = match socket.accept().await {
+            Ok(v) => v,
+            Err(e) => {
+                warn!("metrics accept error: {}", e);
+                continue;
+            }
+        };
+        let listeners = listeners.clone();
+        tokio::spawn(async move {
+            if let Err(e) = serve_one(&mut conn, &listeners).await {
+                warn!("metrics conn from {} error: {}", peer, e);
+            }
+        });
+    }
+}
+
+async fn serve_one(
+    conn: &mut tokio::net::TcpStream,
+    listeners: &[Arc<ListenerMetrics>],
+) -> io::Result<()> {
+    // We don't parse the request rigorously — Prometheus and curl both
+    // send a small HTTP/1.1 GET. Read up to the end of the request line
+    // so we can branch on the path.
+    let mut buf = [0u8; 1024];
+    let n = conn.read(&mut buf).await?;
+    let request_line = std::str::from_utf8(&buf[..n])
+        .ok()
+        .and_then(|s| s.lines().next())
+        .unwrap_or("");
+    let path = request_line.split_whitespace().nth(1).unwrap_or("/");
Evidence
The server accepts connections in a loop, clones listener state, and spawns a task that awaits
conn.read() without any deadline; an attacker (or misbehaving scraper) can keep many connections
open and consume resources.

rust/flow-collector/src/metrics.rs[194-253]
helm/serviceradar/values.yaml[836-840]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The Prometheus HTTP server has no read/write timeouts and spawns unbounded tasks per connection. A slowloris-style client can hold many connections open, consuming memory and file descriptors.
### Issue Context
The server is intentionally hand-rolled (no hyper/axum), so it must implement basic defensive measures itself.
### Fix Focus Areas
- rust/flow-collector/src/metrics.rs[194-253]
### Suggested fix
- Wrap `conn.read(...)` and `conn.write_all(...)` in `tokio::time::timeout` (e.g., 1-5s for request read, 5-10s for response write).
- Add a small concurrency limit (e.g., `Semaphore`) around per-connection handling to bound simultaneous in-flight connections.
- Optionally: read until `\r\n` for the request line (or `\r\n\r\n` for headers) with a strict max size, to avoid partial-read misclassification.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

6. Sources metric never updates 🐞 Bug ◔ Observability ⭐ New
Description
flow_collector_sources is documented and exported, but ListenerMetrics.source_count is only
updated by the template-store metrics ticker which is spawned only when template_store is
configured, leaving flow_collector_sources stuck at 0 for local-only deployments.
Code

rust/flow-collector/src/netflow/mod.rs[R130-143]

+        // Spawn a background ticker that aggregates per-source CacheMetrics
+        // into the listener-level Prometheus counters once a second. This
+        // keeps the parse_datagram hot path free of O(sources) work and
+        // (combined with retired-source accounting in the ticker) makes
+        // the surfaced counters monotonically increasing — required for
+        // Prometheus rate() semantics. Only spawned when a template store
+        // is configured, since the counters are no-ops otherwise.
+        if store_enabled {
+            let parser_for_ticker = Arc::clone(&parser);
+            let metrics_for_ticker = Arc::clone(&metrics);
+            tokio::runtime::Handle::current().spawn(async move {
+                run_metrics_ticker(parser_for_ticker, metrics_for_ticker).await;
+            });
        }
Evidence
render_prometheus() always emits flow_collector_sources for netflow listeners, and the new
scaling guide explicitly tells operators to monitor it. However, NetflowHandler::new() only spawns
the ticker (which is where source_count gets written) when store_enabled is true, so when
template_store is omitted (a supported/mentioned mode) the gauge stays at its initial 0 value
indefinitely.

rust/flow-collector/src/netflow/mod.rs[130-143]
rust/flow-collector/src/metrics.rs[189-196]
docs/docs/flow-collector-scaling.md[83-96]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The `flow_collector_sources` gauge is exported and documented, but it remains 0 unless `template_store` is enabled.

### Issue Context
`source_count` is independent of the template store; it’s useful even in local-only deployments for capacity planning (`max_sources` proximity).

### Fix Focus Areas
- rust/flow-collector/src/netflow/mod.rs[130-171]
- rust/flow-collector/src/metrics.rs[189-196]

### What to change
Choose one:
- **Low overhead (recommended):** Update `metrics.source_count` inside `parse_datagram` while the parser mutex is already held (e.g., right after `iter_packets_from_source(...)`), so it’s accurate regardless of template store.
- Or spawn a lightweight 1Hz ticker unconditionally for `source_count`, and keep the heavier per-source template-store aggregation only when `template_store` is enabled.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


7. Metrics exposed via LoadBalancer 🐞 Bug ⛨ Security
Description
Helm defaults set flow-collector Service type to LoadBalancer and enable the TCP metrics port while
metrics_addr binds to 0.0.0.0, making an unauthenticated HTTP /metrics endpoint externally reachable
wherever the LoadBalancer is reachable. This leaks operational data and increases attack surface.
Code

helm/serviceradar/values.yaml[R839-883]

+    metrics_addr: "0.0.0.0:50046"
+
+    # Optional: shared template store. When set, every flow-collector pod
+    # writes through learned NetFlow/IPFIX templates to this NATS KV bucket
+    # and consults it on cache miss, so a packet routed to a fresh pod can
+    # decode templates other pods have already seen. Comment out the
+    # whole block to disable (each pod uses only its in-process LRU).
+    template_store:
+      kv_bucket: "flow_templates"
+      # Number of historical revisions per key. Templates rarely change so
+      # 1 is fine; larger values support audit / debug. NATS KV history is
+      # capped at 64.
+      kv_history: 1
+      # Optional TTL (seconds) for entries. 0 disables expiration; NATS
+      # will then keep templates indefinitely (or until withdrawn).
+      kv_ttl_secs: 0
+
   listeners:
     - protocol: "sflow"
       listen_addr: "0.0.0.0:6343"
       subject: "flows.raw.sflow"
     - protocol: "netflow"
       listen_addr: "0.0.0.0:2055"
       subject: "flows.raw.netflow"
+        max_templates: 2000
+        max_template_fields: 10000
   security:
     mode: "mtls"
     cert_dir: "/etc/serviceradar/certs"
     tls:
       cert_file: "flow-collector.pem"
       key_file: "flow-collector-key.pem"
       ca_file: "ca.crt"
-  data:
-    enabled: true
-    existingClaim: ""
-    size: 1Gi
-    storageClassName: ""
+
 service:
-    type: ClusterIP
+    # LoadBalancer + Local preserves the exporter's source IP through the
+    # LB → kube-proxy → pod path, which AutoScopedParser depends on for
+    # per-source template scoping.
+    type: LoadBalancer
   annotations: {}
   loadBalancerIP: ""
-    externalTrafficPolicy: Cluster
+    externalTrafficPolicy: Local
   sessionAffinity: None
   ports:
     netflow:
Evidence
values.yaml enables metrics_addr and the metrics service port by default, and the chart defaults the
Service to LoadBalancer; the metrics server serves plaintext without auth/TLS.

helm/serviceradar/values.yaml[836-902]
helm/serviceradar/templates/flow-collector.yaml[173-225]
rust/flow-collector/src/metrics.rs[194-253]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
With the new Prometheus server, the Helm defaults now expose `/metrics` over a public-ish LoadBalancer by default (service type LoadBalancer + metrics port enabled + bind 0.0.0.0).
### Issue Context
Even if many clusters use internal LBs, chart defaults should avoid unintentionally exposing unauthenticated HTTP endpoints.
### Fix Focus Areas
- helm/serviceradar/values.yaml[836-902]
- helm/serviceradar/templates/flow-collector.yaml[173-225]
### Suggested fix
Choose one (or combine):
- Default `service.ports.metrics.enabled: false`.
- Bind metrics to localhost by default (e.g., `metrics_addr: "127.0.0.1:50046"`) and require explicit override to `0.0.0.0`.
- Create a separate `ClusterIP` Service for metrics (and keep the UDP ingest Service as LoadBalancer).
- Document/optionally template NetworkPolicy guidance for restricting metrics access.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


8. Per-packet O(sources) scan 🐞 Bug ➹ Performance
Description
NetflowHandler::parse_datagram calls update_template_store_metrics() on every UDP datagram while
holding the parser Mutex, and the function iterates per-source info maps to sum metrics. This adds
work proportional to (packet_rate × source_count) and increases lock hold time in the parsing hot
path.
Code

rust/flow-collector/src/netflow/mod.rs[R145-159]

       let packets: Vec<_> = {
           let mut parser = self.parser.lock().unwrap();
-            match parser.iter_packets_from_source(peer, buf) {
+            let result = match parser.iter_packets_from_source(peer, buf) {
               Ok(iter) => iter.collect(),
               Err(e) => {
                   warn!("Failed to parse NetFlow packet from {}: {:?}", peer, e);
                   self.metrics.parse_errors.fetch_add(1, Ordering::Relaxed);
                   return vec![];
               }
-            }
+            };
+            // Refresh the listener-level aggregates while we still hold the
+            // parser lock — atomic stores are cheap and this keeps the
+            // Prometheus snapshot fresh on every datagram.
+            update_template_store_metrics(&parser, &self.metrics);
+            result
Evidence
Listener::run invokes parse_datagram for every recv_from() call, and parse_datagram holds the parser
Mutex while calling update_template_store_metrics(), which loops all sources via
parser.{ipfix,v9,legacy}_info().

rust/flow-collector/src/listener.rs[98-134]
rust/flow-collector/src/netflow/mod.rs[145-160]
rust/flow-collector/src/netflow/mod.rs[190-239]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
Template-store metrics aggregation currently scans all sources on every datagram while holding the parser lock, increasing CPU usage and lock time as `source_count` grows.
### Issue Context
Prometheus scrapes happen on a much lower cadence than packet ingestion. Aggregation does not need to run per packet to be useful.
### Fix Focus Areas
- rust/flow-collector/src/netflow/mod.rs[145-160]
- rust/flow-collector/src/netflow/mod.rs[190-239]
### Suggested fix
- Move aggregation to a periodic task (e.g., every 5-30s) that locks the parser, computes totals once, and updates atomics.
- Or, only refresh aggregates on demand (e.g., when serving `/metrics`), if you can safely access the parser from the metrics server.
- If keeping it inline, add sampling (e.g., refresh at most once per N milliseconds) to bound overhead.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Previous review results

Review updated until commit d8e67a9

Results up to commit 1f09dd8


🐞 Bugs (4) 📘 Rule violations (0)


Action required
1. Template store skips creds 🐞 Bug ≡ Correctness
Description
bootstrap_template_store() connects with async_nats::connect(nats_url) and ignores
Config.nats_creds_file and Config.security (mTLS), so enabling template_store will fail to connect
on authenticated/mTLS NATS clusters (or connect without the intended client identity). This is
inconsistent with Publisher::connect_once(), which explicitly applies credentials and TLS options.
Code

rust/flow-collector/src/main.rs[R185-188]

+    let client = async_nats::connect(nats_url)
+        .await
+        .with_context(|| format!("connecting to NATS at {} for template store", nats_url))?;
+    let js = jetstream::new(client);
Evidence
The flow-collector already requires NATS creds/mTLS for publishing, and Helm mounts/waits for the
creds file; however the template-store connection is created without those options, making it unable
to authenticate in the same environments.

rust/flow-collector/src/main.rs[181-189]
rust/flow-collector/src/publisher.rs[88-115]
rust/flow-collector/src/config.rs[6-47]
helm/serviceradar/values.yaml[829-872]
helm/serviceradar/templates/flow-collector.yaml[109-170]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
`bootstrap_template_store()` creates a NATS connection via `async_nats::connect(nats_url)` and does not apply the same credentials-file and mTLS configuration used by the publisher. In clusters where NATS requires creds and/or client certs, enabling `template_store` will prevent startup (connection/auth failure) or connect with the wrong security posture.

### Issue Context
`Publisher::connect_once()` already builds `ConnectOptions` from `Config.security` and `Config.nats_creds_file`. Helm also mounts and waits for `/etc/serviceradar/creds/platform.creds`, suggesting creds are required in real deployments.

### Fix Focus Areas
- rust/flow-collector/src/main.rs[181-207]
- rust/flow-collector/src/publisher.rs[88-115]
- rust/flow-collector/src/config.rs[6-47]

### Suggested fix
- Change `bootstrap_template_store()` to accept `&Config` (or the needed security fields), and build `async_nats::ConnectOptions` the same way as the publisher (root certs + client cert + creds file).
- Consider factoring the shared NATS connection setup into a helper (e.g., `fn nats_connect_options(cfg: &Config) -> Result<ConnectOptions>`), used by both publisher and template store.
- (Optional but consistent) add the same retry/backoff behavior as `connect_with_retry()` if template_store should tolerate NATS startup ordering.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. Metrics server slowloris DoS 🐞 Bug ⛨ Security
Description
run_prometheus_server() spawns an unbounded task per accepted TCP connection and serve_one()
performs a read with no timeout, allowing clients to hold connections open indefinitely and exhaust
tasks/file descriptors. This can degrade or crash the flow-collector when metrics_addr is enabled.
Code

rust/flow-collector/src/metrics.rs[R206-236]

+    loop {
+        let (mut conn, peer) = match socket.accept().await {
+            Ok(v) => v,
+            Err(e) => {
+                warn!("metrics accept error: {}", e);
+                continue;
+            }
+        };
+        let listeners = listeners.clone();
+        tokio::spawn(async move {
+            if let Err(e) = serve_one(&mut conn, &listeners).await {
+                warn!("metrics conn from {} error: {}", peer, e);
+            }
+        });
+    }
+}
+
+async fn serve_one(
+    conn: &mut tokio::net::TcpStream,
+    listeners: &[Arc<ListenerMetrics>],
+) -> io::Result<()> {
+    // We don't parse the request rigorously — Prometheus and curl both
+    // send a small HTTP/1.1 GET. Read up to the end of the request line
+    // so we can branch on the path.
+    let mut buf = [0u8; 1024];
+    let n = conn.read(&mut buf).await?;
+    let request_line = std::str::from_utf8(&buf[..n])
+        .ok()
+        .and_then(|s| s.lines().next())
+        .unwrap_or("");
+    let path = request_line.split_whitespace().nth(1).unwrap_or("/");
Evidence
The server accepts connections in a loop, clones listener state, and spawns a task that awaits
conn.read() without any deadline; an attacker (or misbehaving scraper) can keep many connections
open and consume resources.

rust/flow-collector/src/metrics.rs[194-253]
helm/serviceradar/values.yaml[836-840]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
The Prometheus HTTP server has no read/write timeouts and spawns unbounded tasks per connection. A slowloris-style client can hold many connections open, consuming memory and file descriptors.

### Issue Context
The server is intentionally hand-rolled (no hyper/axum), so it must implement basic defensive measures itself.

### Fix Focus Areas
- rust/flow-collector/src/metrics.rs[194-253]

### Suggested fix
- Wrap `conn.read(...)` and `conn.write_all(...)` in `tokio::time::timeout` (e.g., 1-5s for request read, 5-10s for response write).
- Add a small concurrency limit (e.g., `Semaphore`) around per-connection handling to bound simultaneous in-flight connections.
- Optionally: read until `\r\n` for the request line (or `\r\n\r\n` for headers) with a strict max size, to avoid partial-read misclassification.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended
3. Metrics exposed via LoadBalancer 🐞 Bug ⛨ Security
Description
Helm defaults set flow-collector Service type to LoadBalancer and enable the TCP metrics port while
metrics_addr binds to 0.0.0.0, making an unauthenticated HTTP /metrics endpoint externally reachable
wherever the LoadBalancer is reachable. This leaks operational data and increases attack surface.
Code

helm/serviceradar/values.yaml[R839-883]

+    metrics_addr: "0.0.0.0:50046"
+
+    # Optional: shared template store. When set, every flow-collector pod
+    # writes through learned NetFlow/IPFIX templates to this NATS KV bucket
+    # and consults it on cache miss, so a packet routed to a fresh pod can
+    # decode templates other pods have already seen. Comment out the
+    # whole block to disable (each pod uses only its in-process LRU).
+    template_store:
+      kv_bucket: "flow_templates"
+      # Number of historical revisions per key. Templates rarely change so
+      # 1 is fine; larger values support audit / debug. NATS KV history is
+      # capped at 64.
+      kv_history: 1
+      # Optional TTL (seconds) for entries. 0 disables expiration; NATS
+      # will then keep templates indefinitely (or until withdrawn).
+      kv_ttl_secs: 0
+
    listeners:
      - protocol: "sflow"
        listen_addr: "0.0.0.0:6343"
        subject: "flows.raw.sflow"
      - protocol: "netflow"
        listen_addr: "0.0.0.0:2055"
        subject: "flows.raw.netflow"
+        max_templates: 2000
+        max_template_fields: 10000
    security:
      mode: "mtls"
      cert_dir: "/etc/serviceradar/certs"
      tls:
        cert_file: "flow-collector.pem"
        key_file: "flow-collector-key.pem"
        ca_file: "ca.crt"
-  data:
-    enabled: true
-    existingClaim: ""
-    size: 1Gi
-    storageClassName: ""
+
  service:
-    type: ClusterIP
+    # LoadBalancer + Local preserves the exporter's source IP through the
+    # LB → kube-proxy → pod path, which AutoScopedParser depends on for
+    # per-source template scoping.
+    type: LoadBalancer
    annotations: {}
    loadBalancerIP: ""
-    externalTrafficPolicy: Cluster
+    externalTrafficPolicy: Local
    sessionAffinity: None
    ports:
      netflow:
Evidence
values.yaml enables metrics_addr and the metrics service port by default, and the chart defaults the
Service to LoadBalancer; the metrics server serves plaintext without auth/TLS.

helm/serviceradar/values.yaml[836-902]
helm/serviceradar/templates/flow-collector.yaml[173-225]
rust/flow-collector/src/metrics.rs[194-253]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
With the new Prometheus server, the Helm defaults now expose `/metrics` over a public-ish LoadBalancer by default (service type LoadBalancer + metrics port enabled + bind 0.0.0.0).

### Issue Context
Even if many clusters use internal LBs, chart defaults should avoid unintentionally exposing unauthenticated HTTP endpoints.

### Fix Focus Areas
- helm/serviceradar/values.yaml[836-902]
- helm/serviceradar/templates/flow-collector.yaml[173-225]

### Suggested fix
Choose one (or combine):
- Default `service.ports.metrics.enabled: false`.
- Bind metrics to localhost by default (e.g., `metrics_addr: "127.0.0.1:50046"`) and require explicit override to `0.0.0.0`.
- Create a separate `ClusterIP` Service for metrics (and keep the UDP ingest Service as LoadBalancer).
- Document/optionally template NetworkPolicy guidance for restricting metrics access.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


4. Per-packet O(sources) scan 🐞 Bug ➹ Performance
Description
NetflowHandler::parse_datagram calls update_template_store_metrics() on every UDP datagram while
holding the parser Mutex, and the function iterates per-source info maps to sum metrics. This adds
work proportional to (packet_rate × source_count) and increases lock hold time in the parsing hot
path.
Code

rust/flow-collector/src/netflow/mod.rs[R145-159]

        let packets: Vec<_> = {
            let mut parser = self.parser.lock().unwrap();
-            match parser.iter_packets_from_source(peer, buf) {
+            let result = match parser.iter_packets_from_source(peer, buf) {
                Ok(iter) => iter.collect(),
                Err(e) => {
                    warn!("Failed to parse NetFlow packet from {}: {:?}", peer, e);
                    self.metrics.parse_errors.fetch_add(1, Ordering::Relaxed);
                    return vec![];
                }
-            }
+            };
+            // Refresh the listener-level aggregates while we still hold the
+            // parser lock — atomic stores are cheap and this keeps the
+            // Prometheus snapshot fresh on every datagram.
+            update_template_store_metrics(&parser, &self.metrics);
+            result
Evidence
Listener::run invokes parse_datagram for every recv_from() call, and parse_datagram holds the parser
Mutex while calling update_template_store_metrics(), which loops all sources via
parser.{ipfix,v9,legacy}_info().

rust/flow-collector/src/listener.rs[98-134]
rust/flow-collector/src/netflow/mod.rs[145-160]
rust/flow-collector/src/netflow/mod.rs[190-239]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

### Issue description
Template-store metrics aggregation currently scans all sources on every datagram while holding the parser lock, increasing CPU usage and lock time as `source_count` grows.

### Issue Context
Prometheus scrapes happen on a much lower cadence than packet ingestion. Aggregation does not need to run per packet to be useful.

### Fix Focus Areas
- rust/flow-collector/src/netflow/mod.rs[145-160]
- rust/flow-collector/src/netflow/mod.rs[190-239]

### Suggested fix
- Move aggregation to a periodic task (e.g., every 5-30s) that locks the parser, computes totals once, and updates atomics.
- Or, only refresh aggregates on demand (e.g., when serving `/metrics`), if you can safely access the parser from the metrics server.
- If keeping it inline, add sampling (e.g., refresh at most once per N milliseconds) to bound overhead.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Qodo Logo

Comment thread rust/flow-collector/src/main.rs Outdated
Comment thread rust/flow-collector/src/metrics.rs Outdated
@mikemiles-dev mikemiles-dev marked this pull request as draft May 3, 2026 21:33
…emonSet

Wire the netflow_parser 1.0.3 TemplateStore extension point into the
flow collector and switch the K8s deployment from a single Deployment
to a per-node DaemonSet behind `Service: LoadBalancer` with
`externalTrafficPolicy: Local`. Together these let the collector scale
horizontally with node count while preserving exporter source IP and
sharing learned NetFlow / IPFIX templates across pods via NATS
JetStream KV.

Why
---
A single flow-collector Deployment was a singleton bottleneck for UDP
ingest. Scaling required either source-IP-affinity routing (so every
exporter always lands on the same pod and templates stay local) or a
shared template store so any pod could decode any flow. We picked the
shared-store path because it doesn't depend on the LB layer.

The original plan was to front this with Envoy Gateway UDPRoute, but a
research spike found UDPRoute does not preserve client source IP today
— upstreams see the gateway pod's IP, which would collapse
AutoScopedParser's per-(source_ip,source_id) scoping. Service +
DaemonSet + `externalTrafficPolicy: Local` is the workable alternative
on MetalLB (L2 mode preserves source IP through to the pod). Gateway
API stays available for the rest of the stack (HTTP/gRPC/mTLS).

Code
----
* `Cargo.toml`: bump netflow_parser 1.0.0 -> 1.0.3 (TemplateStore API);
  add `bytes` for the NATS KV payload type.

* `src/nats_client.rs` (new): Shared `connect_once` and
  `connect_with_retry` helpers. Both publisher and template-store
  bootstrap go through these, applying TLS root CA + client cert +
  creds file from `Config.security` and `Config.nats_creds_file`.
  Eliminates the easy-to-miss bug where one path used bare
  `async_nats::connect(url)` and silently broke mTLS deployments.
  Bounded backoff (60 attempts, 0.5s -> 30s) so neither path
  crash-loops the pod when NATS is slow to come up.

* `src/template_store.rs` (new): `NatsKvTemplateStore` implements the
  parser's sync `TemplateStore` trait. async_nats is async-only so we
  bridge via `tokio::task::block_in_place` + `Handle::block_on`. NATS
  KV keys are restricted to `[A-Za-z0-9._=/-]`; the scope produced by
  AutoScopedParser (`v9:1.2.3.4:2055/0`, IPv6 with brackets, etc.)
  is sanitized via underscore replacement. `kind_tag` wildcard arm
  warns once via `std::sync::Once` if a future netflow_parser version
  ships a new `TemplateKind` variant.

* `src/netflow/mod.rs`: `NetflowHandler::new` accepts an optional
  `Arc<dyn TemplateStore>` and threads it through
  `NetflowParserBuilder::with_template_store`. New `TemplateEvent::
  Restored` arm in the event-callback (logs at info). When a store
  is configured, spawns a 1Hz background ticker that aggregates
  per-source `CacheMetrics` into the listener-level Prometheus
  counters via a `DeltaState` that tracks per-source last-known
  values plus a `retired` total — so listener counters never decrease
  when sources are LRU-evicted. Off the parse_datagram hot path.

* `src/listener.rs`: `build_handler` takes
  `Option<Arc<dyn TemplateStore>>`. sFlow ignores it (template-less).

* `src/config.rs`: new optional `TemplateStoreConfig` with
  `kv_bucket`, `kv_history` (u8, validated `1..=64`), `kv_ttl_secs`,
  and optional `nats_url` override for split-fault-domain setups.

* `src/main.rs`: `bootstrap_template_store(&Config, &TemplateStoreConfig)`
  uses `nats_client::connect_with_retry` (so TLS/creds/retry match
  the publisher), then `create_or_update_key_value` for genuinely
  idempotent KV bucket bootstrap across config drift. Independent
  NATS connection from the publisher's so KV failures cannot stall
  publishing and vice versa.

* `src/metrics.rs`: `ListenerMetrics` gains `template_store_restored`,
  `template_store_codec_errors`, `template_store_backend_errors`,
  `source_count` atomics. New `render_prometheus()` produces text
  exposition format with HELP/TYPE headers and properly escaped
  label values (`\` -> `\\`, `"` -> `\"`, `\n` -> `\n`). sFlow
  listeners are filtered out of `template_store_*` and
  `flow_collector_sources` rows since those are structurally zero
  for sFlow.

  New `run_prometheus_server()` is a hand-rolled HTTP/1.1 server on
  tokio's TcpListener — `GET /metrics` returns the exposition;
  anything else 404s. No new deps. Defensive measures against
  slowloris-style attacks: per-line read timeout (3s), response
  write timeout (5s), 8 KiB max-request-bytes cap via
  `AsyncReadExt::take`, and a `Semaphore`-bounded concurrency cap
  (64) that fast-fails excess connections rather than queueing them
  so an attacker cannot exhaust file descriptors.

  Spawned from main.rs when `metrics_addr` is set; runs
  independently of the log reporter and the publisher so a scrape
  failure can never stall ingestion.

K8s manifests
-------------
* `k8s/demo/base/serviceradar-flow-collector.yaml`: `Deployment` ->
  `DaemonSet`, replicas dropped, unused PVC removed (the Rust binary
  writes nothing to disk). ConfigMap JSON gains the `template_store`
  block. Service was already `externalTrafficPolicy: Local` — kept.
  Added `RollingUpdate` with `maxUnavailable: 1`. HTTP probes on the
  `/metrics` endpoint replace the old `pgrep` exec probes (catches
  deadlocked-but-running scenarios).

* `helm/serviceradar/templates/flow-collector.yaml`: same kind flip,
  PVC removal, plus `nodeSelector` / `tolerations` knobs.
  `templates/NOTES.txt` warns operators about the `replicaCount` and
  `data.*` value removals and prints the explicit
  `kubectl delete deployment / pvc` commands needed to clean up
  orphaned objects from a pre-DaemonSet install.

* `helm/serviceradar/values.yaml`: drop `replicaCount` and `data.*`,
  add `flowCollector.config.template_store` block, default
  `service.type: LoadBalancer` and
  `service.externalTrafficPolicy: Local`, add the IPFIX (4739) and
  metrics (50046) ports.

Tests
-----
55 tests pass; `clippy --all-targets -D warnings` clean. New tests:

* 5 `NatsKvTemplateStore` key-render tests (IPv4, IPv6, empty scope,
  distinct scopes, distinct kinds).
* 2 `TemplateStoreConfig` deserialization tests.
* 2 `DeltaState` monotonic-counter tests covering source eviction and
  re-emergence and multiple-sources / multiple-kinds aggregation.
* `escape_label_handles_quotes_backslashes_newlines`,
  `template_store_metrics_only_emit_for_netflow_listeners`.
* `http_server_serves_metrics_and_404` — end-to-end including a
  deliberate split-read of the request line that the old
  single-`read()` code would have failed.
* `http_server_drops_idle_connection_within_read_timeout` — slow
  client closed within 2x READ_TIMEOUT (was: would hang forever).
* `http_server_concurrency_cap_fast_fails_excess_connections` —
  saturate with 64 idle conns, verify a fresh request doesn't hang
  past 2s (was: would queue and starve real scrapers).

Docs
----
`docs/docs/flow-collector-scaling.md`: deployment model with diagram
and rationale for `externalTrafficPolicy: Local`; configuration knobs
table; small / medium / large / very-large sizing rules of thumb;
Prometheus metrics with healthy-vs-trouble interpretations and
alerting suggestions; signals you've outgrown the design plus the
sharding strategy when you do; "Upgrading from a Pre-DaemonSet
Install" section with explicit `kubectl delete deployment / pvc`
commands; "Rolling Upgrade Behavior" section explaining how
`externalTrafficPolicy: Local` interacts with DaemonSet rolling
updates and why `maxUnavailable: 1` is the floor; deploy verification
checklist (LB IP, source-IP smoke test, cross-pod template share
check, Prometheus scrape).

Deliberately avoids speculative throughput numbers — flagged as
"benchmark on your hardware" — until real load tests fill them in.
@mikemiles-dev mikemiles-dev force-pushed the feat/flow-collector-template-store branch from 5eac517 to d8e67a9 Compare May 3, 2026 22:03
@mikemiles-dev mikemiles-dev marked this pull request as ready for review May 3, 2026 22:04
@qodo-code-review

qodo-code-review Bot commented May 3, 2026

Copy link
Copy Markdown
Contributor

Persistent review updated to latest commit d8e67a9

Comment thread docs/docs/flow-collector-scaling.md Outdated
Comment thread helm/serviceradar/values.yaml
Comment thread helm/serviceradar/templates/flow-collector.yaml
…aling doc

- Add a default netflow listener on 0.0.0.0:4739 in Helm values and the
  demo manifest so the advertised IPFIX Service port has a real UDP
  socket behind it instead of silently dropping datagrams.
- Gate the flow-collector liveness/readiness probes on
  service.ports.metrics.enabled AND config.metrics_addr; fall back to
  a pgrep exec probe when metrics are off so disabling /metrics no
  longer crashloops the pod. Document the implication near the demo
  values' metrics.enabled toggle.
- Replace non-ASCII characters (em/en dashes, arrows, box drawing,
  less-than-or-equal) in docs/docs/flow-collector-scaling.md with
  ASCII equivalents to satisfy the docs ASCII-only constraint.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant