| title | CloudCostAnomalyHunter | ||||
|---|---|---|---|---|---|
| sdk | docker | ||||
| short_description | OpenEnv-style FinOps benchmark | ||||
| tags |
|
Cloud Cost Anomaly Hunter is an OpenEnv-style FinOps benchmark environment. It simulates NovaTech Inc., a multi-cloud company where an agent investigates cloud billing data, infrastructure metadata, and SaaS license usage to detect waste and recommend remediations.
NovaTech runs workloads across AWS, GCP, and multiple SaaS tools. The agent acts as a FinOps analyst and must find hidden anomalies under step limits, then submit a final report scored by a deterministic grader.
Observation is a typed structured payload represented by Observation in env/models.py.
step: intcurrent step index.billing_summary: BillingSummaryaggregated cost statistics.recent_query_results: Optional[List[Dict[str, Any]]]latest billing query rows.infra_query_result: Optional[ResourceInfo]latest infra lookup payload.flagged_so_far: List[AnomalyFlag]anomaly flags submitted so far.step_reward: floatdense incremental reward.done: boolepisode completion flag.
The environment supports six function-call actions:
query_billing(filter: dict)filters billing rows by exact field matches.query_infra(resource_id: str)retrieves resource metadata.flag_anomaly(resource_id, anomaly_type, severity, reasoning, root_cause_category?)logs anomaly detection.recommend_action(resource_id, action_type_detail, estimated_saving_usd)logs remediation recommendation.write_note(content: str)appends analyst notes to scratchpad.submit_report()finalizes the episode and triggers grading.
Dense reward is implemented in env/reward.py.
- True positive flag:
+0.15 to +0.30depending on task severity weighting. - False positive flag:
-0.05each. - Root cause match (Task 2):
+0.10. - Saving estimate within 20% (Task 2):
+0.05. - Efficient querying (<15 queries at submit):
+0.05. - Step penalty after step 20:
-0.002per extra step.
Input: 90-day billing + infra snapshot. Objective: find 3 injected zombie resources.
Expected baseline range:
- GPT-4o:
0.70 - 0.85
Input: 180-day billing with one injected spike. Objective: detect spike date, service, root cause, and saving estimate.
Expected baseline range:
- GPT-4o:
0.55 - 0.70
Input: 365-day billing + infra snapshot + SaaS licenses. Objective: identify and classify five anomaly types, then provide remediation quality.
Expected baseline range:
- GPT-4o:
0.40 - 0.55
pip install -r requirements.txtdocker build -t cloud-cost-anomaly-hunter .docker run --rm -p 7860:7860 cloud-cost-anomaly-hunterSubmission entrypoint is inference.py at repo root.
This is the OpenEnv submission script expected by validators.
Required variables:
API_BASE_URLMODEL_NAMEHF_TOKEN
Example:
export API_BASE_URL=https://your-openai-compatible-endpoint/v1
export MODEL_NAME=gpt-4o
export HF_TOKEN=your_token
python inference.pyThe script emits structured stdout logs using [START], [STEP], and [END] records.
baseline/run_baseline.py is a local heuristic smoke script for reproducibility checks.
It is not the official OpenEnv submission entrypoint.
export OPENAI_API_KEY=your_key_here
python baseline/run_baseline.pyOn Windows PowerShell:
$env:OPENAI_API_KEY = "your_key_here"
python baseline/run_baseline.py| Task | Expected GPT-4o | Perfect |
|---|---|---|
| Task 1: Zombie Detection | 0.70 - 0.85 | 1.00 |
| Task 2: Spike RCA | 0.55 - 0.70 | 1.00 |
| Task 3: Full Audit | 0.40 - 0.55 | 1.00 |
| Aggregate mean | 0.55 - 0.70 | 1.00 |
- Add anomaly templates in
data/anomaly_templates.json. - Extend synthetic data logic in
env/data_generator.py. - Add a new grader in
tasks/and wire it inenv/environment.py. - Add tests in
tests/for determinism, reward density, and compliance.
Run:
python scripts/pre_submission_validate.pyThis checks core submission prerequisites (metadata, inference entrypoint, tasks/graders, and Space endpoints).