feat(snmp-discovery): serialize diode ingest via queued client#419
Conversation
Wrap diode.Client with a buffered channel and single consumer goroutine to prevent concurrent OAuth re-auth storms from parallel crawl jobs. Co-authored-by: Cursor <cursoragent@cursor.com>
Add --ingest-buffer-size (default 256), wrap the diode client at startup, and close the queued client on shutdown. Co-authored-by: Cursor <cursoragent@cursor.com>
Read ingest_buffer_size from snmp_discovery backend YAML (default 256) and forward it as --ingest-buffer-size when starting the backend. Co-authored-by: Cursor <cursoragent@cursor.com>
Cover serial ingest execution, context cancellation, Close behavior, and agent YAML forwarding of ingest_buffer_size to the subprocess. Co-authored-by: Cursor <cursoragent@cursor.com>
Describe the serial ingest queue, default size, and tuning guidance in the backend README and config samples. Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
|
@codex review |
|
Go test coverage
|
Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1dfb8a64b6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Signal shutdown via shutdownCh only; the consumer drains pending requests and exits without closing the requests channel, eliminating the race where Close could close the channel while a producer was still sending. Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
Pull request overview
This PR introduces an ingest-serialization layer for SNMP discovery by wrapping the Diode client with a buffered, single-consumer queue to prevent concurrent OAuth re-auth storms when many crawls finish at once. It also wires the queue capacity through CLI/config and updates tests and docs accordingly.
Changes:
- Add
ingest.QueuedClientto serializediode.Clientingest calls via a buffered channel + consumer goroutine. - Expose queue capacity via
--ingest-buffer-size(snmp-discovery) andorb.backends.snmp_discovery.ingest_buffer_size(agent backend). - Add unit tests for queue behavior and backend CLI forwarding; update SNMP discovery backend documentation and config samples.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| orb-discovery/snmp-discovery/ingest/queued_client.go | Implements QueuedClient to serialize ingest and manage shutdown semantics. |
| orb-discovery/snmp-discovery/ingest/queued_client_test.go | Adds tests covering serialization, cancellation while queued, and Close behavior. |
| orb-discovery/snmp-discovery/cmd/main.go | Adds --ingest-buffer-size flag and wraps/closes the ingest client on shutdown. |
| agent/backend/snmpdiscovery/snmp_discovery.go | Reads ingest_buffer_size and forwards --ingest-buffer-size to the subprocess. |
| agent/backend/snmpdiscovery/snmp_discovery_test.go | Verifies --ingest-buffer-size CLI arg forwarding when configured. |
| docs/config_samples.md | Documents orb.backends.snmp_discovery.ingest_buffer_size in sample config. |
| docs/backends/snmp_discovery/README.md | Documents backend-level ingest queue configuration and behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Vulnerability Scan: PassedImage: No vulnerabilities found. Commit: 1733852 |
Reject zero, negative, non-integer, and unsupported types at configure time instead of failing opaquely when the subprocess starts. Co-authored-by: Cursor <cursoragent@cursor.com>
Warn when a completed or drained ingest result cannot be sent because the caller is no longer waiting on the result channel. Co-authored-by: Cursor <cursoragent@cursor.com>
Verify queued ingests still in the buffer receive ErrIngestQueueClosed when the client shuts down while another ingest is in flight. Co-authored-by: Cursor <cursoragent@cursor.com>
Align agent, subprocess flag default, and docs with the recommended value for large subnet bursts. Co-authored-by: Cursor <cursoragent@cursor.com>
After Close signals shutdown, the consumer drains buffered requests without executing them. In-flight ingest still completes before exit. Co-authored-by: Cursor <cursoragent@cursor.com>
|
Addressed all review feedback in the latest push:
Also includes default buffer size bump to 512 ( |
|
@codex review |
|
@copilot review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 15869d845e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Serialize enqueue with Close via an RWMutex gate so a closed shutdownCh cannot race with a buffered send that would orphan the request. Also watch the consumer done channel while waiting for ingest results. Addresses Codex review on PR #419. Co-authored-by: Cursor <cursoragent@cursor.com>
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6593155644
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Drop the enqueue gate lock so Close can signal shutdown while producers wait for buffer space. Recheck shutdown before executing dequeued work so buffered requests are failed instead of run after Close starts. Addresses Codex review on PR #419. Co-authored-by: Cursor <cursoragent@cursor.com>
|
Addressed latest Codex feedback:
Pushed on latest commit; ingest tests pass. |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 0a74ab2b88
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
leoparente
left a comment
There was a problem hiding this comment.
Overall LGTM. Just get codex reply with no new findings. And recommend a e2e testing with orb-test-lab.
|
Nice work — the serialization approach is clean and the shutdown semantics are well thought out. Two suggestions:
|
davidlanouette
left a comment
There was a problem hiding this comment.
I'll finish looking tomorrow - with fresh eyes.
| defer close(c.done) | ||
|
|
||
| for { | ||
| if c.shutdownSignaled() { |
There was a problem hiding this comment.
This looks like it duplicates the select/case <-c.shutdownCh logic. Is this duplication serving a purpose?
There was a problem hiding this comment.
Yes — the non-blocking shutdownSignaled() check and the blocking select on shutdownCh serve different moments in the loop. The helper catches shutdown that happens during execute() or between iterations; the select wakes the consumer when it is blocked waiting for work. Added comments in the latest push to spell this out.
There was a problem hiding this comment.
Won't line 90 catch all shutdown events from execute too?
If you always want to handle shutdown events before you check the request, wouldn't a cleaner solution be to add a "do nothing" default to the select on L80? That would remove the duplication and accomplish the same thing. (unless it's too early for me.)
Although I think you just need the single select that you have (and remove the non-async check, something like this would remove duplication and ensure that shutdown is checked before c.requests each loop:
func (c *QueuedClient) consumer() {
defer close(c.done)
for {
// Non-blocking check: Close may have run during execute() on the prior
// iteration. The select below handles shutdown while blocked for work.
if c.shutdownSignaled() {
c.drainPendingFailures()
return
}
select {
case req := <-c.requests:
c.execute(req)
default:
// continue the for loop if c.requests isn't ready
}
}
}Although, I think this is significantly simpler and just as correct.
func (c *QueuedClient) consumer() {
defer close(c.done)
for {
select {
case req := <-c.requests:
c.execute(req)
case <-c.done:
c.drainPendingFailures()
return
}
}
}Wire policy manager to a cancellable root context and call cancel before QueuedClient.Close so in-flight Diode ingests can abort on SIGTERM. Document shutdown channel roles and why shutdownSignaled coexists with blocking shutdownCh selects. Co-authored-by: Cursor <cursoragent@cursor.com>
|
@codex review |
Apply a 30s deadline when callers have no context deadline, derived from Grafana production diode-ingester Ingest latency (p99 ~938ms, max bucket 10s). Prevents a hung Diode server from blocking the ingest consumer. Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ad91b4f168
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Cancel rootCtx before server.Stop() so in-flight Diode ingests abort promptly during SIGTERM instead of waiting on scheduler stop timeouts. Use a fresh bounded context for metrics.Shutdown so OTLP can flush after the application context is canceled. Co-authored-by: Cursor <cursoragent@cursor.com>
|
Addressed the two remaining Codex threads in the latest push:
Also consolidated both shutdown paths into a shared Re: @jimjkelly's suggestions — the per-call timeout safety net is already in |
Add //go:build integration tests that run snmp-discovery in dry-run against four lab SNMP simulators, assert multi-target ingest output, and verify SIGTERM shutdown. Expose make test-integration (Docker network) and test-integration-local targets. Co-authored-by: Cursor <cursoragent@cursor.com>
| closeDone := make(chan struct{}) | ||
| go func() { | ||
| require.NoError(t, client.Close()) | ||
| close(closeDone) | ||
| }() |
There was a problem hiding this comment.
What's going on here? The "infrastructure around client.Close() isn't obvious.
It looks like you are trying to async close the client. But, then line 272 is waiting for the go routine to finish.
Is there a reason you don't just close the client and continue - without the closeDone chan and the go routine?
| case <-c.done: | ||
| select { | ||
| case res := <-req.result: | ||
| return res.resp, res.err | ||
| default: | ||
| return nil, ErrIngestQueueClosed | ||
| } |
There was a problem hiding this comment.
Won't this almost always hit the default condition. If there was a value in req.result, it would be caught on line 150 - before this case was selected. Therefore, line 156 won't be run.
Note: if both req.result and c.done are ready at the exact same time, then the runtime will select one at random. So, technically, line 156 could pass. But, the chances of that are pretty low (like thousands to one). Is it worth the complexity?
| defer close(c.done) | ||
|
|
||
| for { | ||
| if c.shutdownSignaled() { |
There was a problem hiding this comment.
Won't line 90 catch all shutdown events from execute too?
If you always want to handle shutdown events before you check the request, wouldn't a cleaner solution be to add a "do nothing" default to the select on L80? That would remove the duplication and accomplish the same thing. (unless it's too early for me.)
Although I think you just need the single select that you have (and remove the non-async check, something like this would remove duplication and ensure that shutdown is checked before c.requests each loop:
func (c *QueuedClient) consumer() {
defer close(c.done)
for {
// Non-blocking check: Close may have run during execute() on the prior
// iteration. The select below handles shutdown while blocked for work.
if c.shutdownSignaled() {
c.drainPendingFailures()
return
}
select {
case req := <-c.requests:
c.execute(req)
default:
// continue the for loop if c.requests isn't ready
}
}
}Although, I think this is significantly simpler and just as correct.
func (c *QueuedClient) consumer() {
defer close(c.done)
for {
select {
case req := <-c.requests:
c.execute(req)
case <-c.done:
c.drainPendingFailures()
return
}
}
}| case err := <-done: | ||
| if err != nil { | ||
| // SIGTERM is expected; only fail on unexpected wait errors. | ||
| if _, ok := err.(*exec.ExitError); !ok { |
There was a problem hiding this comment.
NIT: use if errors.As(err, &exec.ExitError) { instead of casting.
|
🎉 This PR is included in version snmp-discovery-v1.34.0 🎉 The release is available on GitHub release Your semantic-release bot 📦🚀 |
Summary
QueuedClientwrapper arounddiode.Clientthat serializes ingest through a buffered channel with a single consumer goroutine, preventing concurrent OAuth re-auth storms when many SNMP crawl jobs complete at once.--ingest-buffer-size(default 512) and expose it from the agent backend asorb.backends.snmp_discovery.ingest_buffer_size.What changed
orb-discovery/snmp-discovery/ingest/QueuedClientimplementingdiode.Clientwith enqueue backpressure, context cancellation, and idempotent shutdown.cmd/main.gowraps the real diode client at startup and closes it on shutdown.snmp_discoverybackend readsingest_buffer_sizefrom YAML (validated at configure time) and passes--ingest-buffer-sizeto the subprocess.docs/backends/snmp_discovery/README.mdanddocs/config_samples.md.How tested
cd orb-discovery/snmp-discovery && go test ./ingest/... ./policy/...go test ./agent/backend/snmpdiscovery/...make fix-lintFollow-ups (out of scope)
diode-sdk-go: singleflight onauthenticate(), backoff on 502/429.