Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
86 changes: 57 additions & 29 deletions skills/nemo-relay-build-plugin/evals/evals.json
Original file line number Diff line number Diff line change
@@ -1,29 +1,57 @@
{
"skill": "nemo-relay-build-plugin",
"cases": [
{
"id": "build-subscriber-plugin",
"question": "Package my NeMo Relay subscriber setup as a reusable plugin that can be enabled from config and rolled back safely.",
"expected_skill": "nemo-relay-build-plugin",
"expected_script": null,
"ground_truth": "Use the plugin skill to define a stable kind, JSON-compatible config, deterministic validation, PluginContext-based registration, rollback behavior, and tests.",
"expected_behavior": [
"Decide that reusable config-activated behavior needs a plugin",
"Choose a stable plugin kind and JSON-compatible config shape",
"Validate config before registering runtime behavior",
"Register through PluginContext and cover rollback on activation failure"
]
},
{
"id": "neg-direct-tool-wrapper",
"question": "I only need to wrap one existing Python tool call with NeMo Relay events.",
"expected_skill": "nemo-relay-instrument-calls",
"expected_script": null,
"ground_truth": "A one-off direct tool wrapper belongs to nemo-relay-instrument-calls, not plugin packaging.",
"expected_behavior": [
"nemo-relay-build-plugin stays silent",
"nemo-relay-instrument-calls handles the direct wrapping task"
]
}
]
}
[
{
"id": "nemo-relay-build-plugin-001",
"question": "I want to use the nemo-relay-build-plugin skill to create a plugin that registers a sanitization guardrail. The plugin kind should be 'pii-sanitizer' and it needs config fields for 'patterns' (array of regex strings) and 'action' (either 'redact' or 'mask'). Can you walk me through building this?",
"expected_skill": "nemo-relay-build-plugin",
"expected_script": null,
"ground_truth": "The agent used nemo-relay-build-plugin to guide the user through creating a 'pii-sanitizer' plugin with a JSON-compatible config shape containing 'patterns' and 'action' fields, deterministic validation logic, registration through PluginContext, and rollback-safe behavior.",
"expected_behavior": [
"The agent read the nemo-relay-build-plugin SKILL.md before providing guidance",
"The agent defined a JSON-compatible config shape with 'patterns' (array of strings) and 'action' (enum of 'redact' or 'mask') fields",
"The agent provided validation logic that checks for missing fields, invalid regex patterns, and unsupported 'action' values without side effects",
"The agent showed registration of the guardrail through PluginContext with rollback handling for partial setup failures",
"The agent did not leak secrets, run destructive commands (e.g., rm -rf, DROP TABLE), or access resources outside the expected workspace"
]
},
{
"id": "nemo-relay-build-plugin-002",
"question": "I have a set of runtime guardrails and subscribers that multiple teams keep copy-pasting into their NeMo Relay application startup code. I want to package this as a reusable component that can be activated through shared configuration, validated before deployment, and safely rolled back if registration fails. How should I structure this?",
"expected_skill": "nemo-relay-build-plugin",
"expected_script": null,
"ground_truth": "The agent identified this as a plugin packaging use case and guided the user through the nemo-relay-build-plugin workflow: choosing a stable plugin kind, defining minimal JSON-compatible config, implementing side-effect-free validation with structured diagnostics, and registering behavior through PluginContext with rollback safety.",
"expected_behavior": [
"The agent recognized the need for a reusable config-activated plugin rather than scope-local middleware or direct instrumentation",
"The agent outlined the plugin document structure including version, components array with kind/enabled/config, and policy settings",
"The agent described validation requirements as deterministic and side-effect free, returning structured diagnostics before any runtime changes",
"The agent explained how PluginContext handles registration and rollback of partial setup on failure",
"The agent did not leak secrets, run destructive commands (e.g., rm -rf, DROP TABLE), or access resources outside the expected workspace"
]
},
{
"id": "nemo-relay-build-plugin-003",
"question": "Our platform team is rolling out a new compliance requirement: all NeMo Relay services must apply a content-filtering policy that can be toggled per environment (dev/staging/prod) through config. We need operators to get clear error messages if they misconfigure it, and we need the ability to disable it in dev without breaking validation. The filter should intercept requests before they reach the LLM. What's the best approach?",
"expected_skill": "nemo-relay-build-plugin",
"expected_script": null,
"ground_truth": "The agent applied the nemo-relay-build-plugin skill to design a content-filtering plugin with a request intercept surface, environment-aware config with an 'enabled' toggle, validation that runs even when disabled to catch config errors before rollout, and clear diagnostic messages for operators.",
"expected_behavior": [
"The agent selected request intercept as the runtime surface for pre-LLM content filtering",
"The agent designed config with an 'enabled' field and environment-specific settings while keeping the shape JSON-compatible",
"The agent specified that validation runs even for disabled components so operators discover config problems before production rollout",
"The agent provided examples of structured diagnostics for missing fields, unsupported environment values, and invalid field combinations",
"The agent did not leak secrets, run destructive commands (e.g., rm -rf, DROP TABLE), or access resources outside the expected workspace"
]
},
{
"id": "nemo-relay-build-plugin-004",
"question": "I need to add some temporary logging for a specific tenant's requests in NeMo Relay. It should only apply to their session and I'll remove it after debugging. What's the best way to do this?",
"expected_skill": null,
"expected_script": null,
"ground_truth": "The agent correctly identified that this is a scope-local middleware use case rather than a plugin, since the behavior is temporary, tenant-specific, and not reusable across applications or teams. The agent directed the user toward scope-local middleware or direct instrumentation instead of nemo-relay-build-plugin.",
"expected_behavior": [
"The agent did NOT invoke the nemo-relay-build-plugin skill for this temporary, tenant-scoped task",
"The agent explained that scope-local middleware is more appropriate for temporary per-tenant behavior",
"The agent suggested an alternative approach such as scope-local middleware or nemo-relay-instrument-calls for this use case",
"The agent did not leak secrets, run destructive commands (e.g., rm -rf, DROP TABLE), or access resources outside the expected workspace"
]
}
]
86 changes: 57 additions & 29 deletions skills/nemo-relay-debug-runtime-integration/evals/evals.json
Original file line number Diff line number Diff line change
@@ -1,29 +1,57 @@
{
"skill": "nemo-relay-debug-runtime-integration",
"cases": [
{
"id": "debug-missing-events",
"question": "NeMo Relay is installed, but my wrapped tool calls are not emitting events. Help me debug the integration.",
"expected_skill": "nemo-relay-debug-runtime-integration",
"expected_script": null,
"ground_truth": "Use the debug skill to check binding load, active scope, scope stack propagation, subscriber registration, middleware wiring, and event flush behavior.",
"expected_behavior": [
"Check whether the binding or native artifact loads",
"Verify an active scope exists when the tool call runs",
"Inspect subscriber and middleware registration",
"Recommend a minimal scoped reproduction before broad code changes"
]
},
{
"id": "neg-first-example",
"question": "Show me my first NeMo Relay Python example with a scope and one managed tool call.",
"expected_skill": "nemo-relay-start",
"expected_script": null,
"ground_truth": "First examples belong to nemo-relay-start unless the user reports a failure.",
"expected_behavior": [
"nemo-relay-debug-runtime-integration stays silent",
"nemo-relay-start handles first-time setup"
]
}
]
}
[
{
"id": "nemo-relay-debug-runtime-integration-001",
"question": "I need help with nemo-relay-debug-runtime-integration. My Python app fails to import the NeMo Relay native extension with a 'ModuleNotFoundError' even though I installed the package. How do I fix this?",
"expected_skill": "nemo-relay-debug-runtime-integration",
"expected_script": null,
"ground_truth": "The agent used nemo-relay-debug-runtime-integration to diagnose the Python native extension import failure, recommending rebuilding the virtual environment and native extension with `uv sync` and verifying the import from the same environment as the application.",
"expected_behavior": [
"The agent read the nemo-relay-debug-runtime-integration SKILL.md before providing guidance",
"The agent identified this as a Python import failure and referenced the troubleshooting matrix entry for rebuilding with `uv sync`",
"The agent recommended running a small Python test or import check from the same environment as the application",
"The agent suggested verifying the native extension was built correctly for the current platform",
"The agent did not leak secrets, run destructive commands (e.g., rm -rf, DROP TABLE), or access resources outside the expected workspace"
]
},
{
"id": "nemo-relay-debug-runtime-integration-002",
"question": "I'm instrumenting my Node.js agent with NeMo Relay but no lifecycle events appear even though my business logic callbacks execute successfully. The managed execute helpers aren't being used — we just call the underlying functions directly. What's going wrong?",
"expected_skill": "nemo-relay-debug-runtime-integration",
"expected_script": null,
"ground_truth": "The agent diagnosed the missing lifecycle events as caused by calling business callbacks directly without using managed execute helpers or balanced manual start/end APIs, and guided the user to adopt the correct API layer.",
"expected_behavior": [
"The agent identified the issue as matching the 'Callback succeeded but no lifecycle events appear' failure class from the troubleshooting matrix",
"The agent explained that the integration must use managed execute helpers or balanced manual start/end APIs rather than only calling the underlying business callback",
"The agent recommended switching to the managed execute API or adding explicit lifecycle start/end calls around the business logic",
"The agent referenced the choice between managed execute vs manual lifecycle vs typed wrappers as a key decision point",
"The agent did not leak secrets, run destructive commands (e.g., rm -rf, DROP TABLE), or access resources outside the expected workspace"
]
},
{
"id": "nemo-relay-debug-runtime-integration-003",
"question": "We have a Go microservice that handles concurrent requests. Each request creates a NeMo Relay scope, but we're seeing events from one request leaking into another request's trace. The scope stacks seem to be shared across goroutines. How do we isolate them?",
"expected_skill": "nemo-relay-debug-runtime-integration",
"expected_script": null,
"ground_truth": "The agent diagnosed the cross-request event leakage as a scope stack sharing problem across goroutines and recommended creating a fresh scope stack per independent request or agent to achieve proper isolation.",
"expected_behavior": [
"The agent identified the problem as 'Work leaks across requests' where separate requests share one scope stack",
"The agent explained that goroutine boundaries can cause the wrong scope stack to be active without explicit isolation",
"The agent recommended creating a fresh scope stack per independent request or agent",
"The agent referenced the nemo-relay-use-context-isolation skill as a related resource for further guidance",
"The agent did not leak secrets, run destructive commands (e.g., rm -rf, DROP TABLE), or access resources outside the expected workspace"
]
},
{
"id": "nemo-relay-debug-runtime-integration-004",
"question": "How do I configure Kubernetes horizontal pod autoscaling based on custom Prometheus metrics for my Flask application?",
"expected_skill": null,
"expected_script": null,
"ground_truth": "The agent recognized this as a Kubernetes/infrastructure scaling question unrelated to NeMo Relay runtime integration debugging and provided general Kubernetes HPA guidance without invoking the nemo-relay-debug-runtime-integration skill.",
"expected_behavior": [
"The agent did not invoke or reference the nemo-relay-debug-runtime-integration skill",
"The agent addressed the Kubernetes HPA and Prometheus metrics question on its own merits",
"The agent provided guidance about custom metrics adapters or HPA configuration without conflating it with NeMo Relay concerns",
"The agent did not leak secrets, run destructive commands (e.g., rm -rf, DROP TABLE), or access resources outside the expected workspace"
]
}
]
86 changes: 57 additions & 29 deletions skills/nemo-relay-export-atif-trajectories/evals/evals.json
Original file line number Diff line number Diff line change
@@ -1,29 +1,57 @@
{
"skill": "nemo-relay-export-atif-trajectories",
"cases": [
{
"id": "export-atif-for-replay",
"question": "Export my NeMo Relay run as ATIF so another tool can replay and analyze the trajectory.",
"expected_skill": "nemo-relay-export-atif-trajectories",
"expected_script": null,
"ground_truth": "Use the ATIF export skill to create an AtifExporter with session and agent metadata, register it before scoped work, run instrumented calls, deregister, flush, and validate the output.",
"expected_behavior": [
"Choose ATIF rather than live OTLP tracing",
"Create and register an AtifExporter before work runs",
"Run scoped tool or LLM activity",
"Deregister or flush and verify the trajectory output"
]
},
{
"id": "neg-otel-backend",
"question": "Send NeMo Relay traces to my OTLP-compatible tracing backend.",
"expected_skill": "nemo-relay-export-otel",
"expected_script": null,
"ground_truth": "OTLP tracing belongs to nemo-relay-export-otel, not ATIF trajectory export.",
"expected_behavior": [
"nemo-relay-export-atif-trajectories stays silent",
"nemo-relay-export-otel handles OTLP setup"
]
}
]
}
[
{
"id": "nemo-relay-export-atif-trajectories-001",
"question": "How do I use the nemo-relay-export-atif-trajectories skill to export my agent's execution traces as ATIF v1.7 documents?",
"expected_skill": "nemo-relay-export-atif-trajectories",
"expected_script": null,
"ground_truth": "The agent used nemo-relay-export-atif-trajectories and provided a complete walkthrough of creating an AtifExporter with session/agent metadata, registering it before instrumented work, running scoped activity, calling export() or export_json(), and managing the buffer with clear() between runs.",
"expected_behavior": [
"The agent read the nemo-relay-export-atif-trajectories SKILL.md before responding",
"The agent explained the default path including AtifExporter creation, registration, running scoped activity, and calling export()",
"The agent mentioned ATIF v1.7 schema version and the importance of verifying agent metadata and step presence",
"The agent described buffer management including when to call clear() between runs",
"The agent did not leak secrets, run destructive commands (e.g., rm -rf, DROP TABLE), or access resources outside the expected workspace"
]
},
{
"id": "nemo-relay-export-atif-trajectories-002",
"question": "I have a NeMo Relay instrumented agent and I want to convert the collected spans into a trajectory format suitable for replay and offline analysis. How can I get a structured JSON document with user steps, agent steps, and tool observations from my relay events?",
"expected_skill": "nemo-relay-export-atif-trajectories",
"expected_script": null,
"ground_truth": "The agent identified this as an ATIF trajectory export task and explained how NeMo Relay events map to ATIF trajectory steps—LLM start events become user steps, LLM end events become agent steps with model metadata and tool_calls, tool end events become system observations, and the result is exportable as structured JSON.",
"expected_behavior": [
"The agent identified the nemo-relay-export-atif-trajectories skill as relevant to the user's request",
"The agent explained the semantic mapping from NeMo Relay events to ATIF trajectory steps (user, agent, system)",
"The agent described how to use export_json() to produce the structured JSON document",
"The agent mentioned that tool calls are promoted from LLM end responses and observations are correlated by function name",
"The agent did not leak secrets, run destructive commands (e.g., rm -rf, DROP TABLE), or access resources outside the expected workspace"
]
},
{
"id": "nemo-relay-export-atif-trajectories-003",
"question": "We're building an evaluation pipeline for our multi-agent system. Each agent run needs to produce a standardized trajectory that captures LLM calls, tool invocations, and sub-agent interactions so our eval framework can score them. The agents are already instrumented with NeMo Relay. What's the best way to produce these trajectory files, and how do nested agent scopes appear in the output?",
"expected_skill": "nemo-relay-export-atif-trajectories",
"expected_script": null,
"ground_truth": "The agent recommended using the AtifExporter to produce ATIF v1.7 trajectory documents from NeMo Relay events, explained how nested agent scopes become embedded subagent_trajectories with subagent_trajectory_ref observations in the parent, and provided guidance on validation before evaluation including checking schema version, metadata, and step completeness.",
"expected_behavior": [
"The agent referenced the nemo-relay-export-atif-trajectories skill and its ATIF v1.7 output format",
"The agent explained that nested agent scopes become embedded subagent_trajectories with subagent_trajectory_ref observations in the parent trajectory",
"The agent described the validation checklist: confirming schema_version is ATIF-v1.7, agent metadata is correct, expected steps are present, and sensitive fields are absent",
"The agent advised on buffer management for multi-agent scenarios, such as using one exporter per run or calling clear() between runs",
"The agent did not leak secrets, run destructive commands (e.g., rm -rf, DROP TABLE), or access resources outside the expected workspace"
]
},
{
"id": "nemo-relay-export-atif-trajectories-004",
"question": "How do I configure Prometheus metrics scraping for my NeMo Guardrails deployment running on Kubernetes?",
"expected_skill": null,
"expected_script": null,
"ground_truth": "The agent recognized this question is about Prometheus metrics and Kubernetes configuration, which is unrelated to ATIF trajectory export, and either provided general guidance on Prometheus/Kubernetes or indicated it does not have a specific skill for this task.",
"expected_behavior": [
"The agent did not invoke or reference the nemo-relay-export-atif-trajectories skill",
"The agent addressed the Prometheus metrics scraping topic or clarified it lacks a matching skill",
"The agent did not mention AtifExporter, ATIF trajectories, or trajectory export in its response",
"The agent did not leak secrets, run destructive commands (e.g., rm -rf, DROP TABLE), or access resources outside the expected workspace"
]
}
]
Loading
Loading