| title | Customer Support Agent Environment V2 | ||||||
|---|---|---|---|---|---|---|---|
| emoji | 🎧 | ||||||
| colorFrom | blue | ||||||
| colorTo | green | ||||||
| sdk | docker | ||||||
| pinned | false | ||||||
| app_port | 8000 | ||||||
| base_path | /web | ||||||
| tags |
|
A production-grade multi-step reinforcement learning environment where an LLM agent resolves customer support tickets through progressive information reveal and real CRM tool calls.
Built for the Meta PyTorch x Scaler OpenEnv Hackathon using the OpenEnv framework.
Most RL environments hand the agent all information upfront and ask it to classify once. This environment works the way real customer support actually does — information arrives gradually, and the agent must update its beliefs at each step. Step 1 Ticket only. Policy hidden. CRM tool not yet called. Agent classifies issue type and severity from limited context.
Step 2 Partial policy revealed. CRM account data returned by tool call. Agent determines eligibility using real account history.
Step 3 Full policy and verified date calculations revealed. Agent finalises recommended action with complete context.
Step 4 Reply guidelines provided. Agent drafts professional reply.
Step 5 Final checklist. Agent confirms and submits all fields.
A naive agent guessing on step 1 scores around 0.28. An agent that reads, uses tool output, and adapts reaches 0.95 by step 5. Learning is visible, measurable, and genuine.
On reset(), the environment automatically calls lookup_account() against
a simulated CRM database. The result is injected into the observation from
step 2 onward as structured account data:
{
"account_email": "leon@email.com",
"account_age_years": 3,
"plan": "premium",
"prior_claims": 0,
"last_charge_date": "2026-01-28",
"subscription_status": "active"
}The agent uses this data to determine eligibility — for example, detecting that a billing charge date was 73 days ago and therefore outside the 60-day dispute window. This turns the environment into a genuine tool-use benchmark, not just a text classification task.
issue_type : billing severity : low action : resolve eligible : False
A customer demands a refund for a charge that is 73 days old. Company policy only allows disputes within 60 days.
The trap: the customer is a loyal 3-year user and expects an exception. Policy explicitly states that loyalty does not override the time window.
Why learning is required: the charge date arrives with the full ticket on
step 2. The 73-day calculation is system-verified on step 3. An agent that
assumes refund and eligible=True on step 1 must correct itself by step 3.
Ticket variants: Leon ($34.99, January), Maria ($19.99, February), David ($49.99, December).
issue_type : product severity : medium action : replace eligible : True
A customer reports a defective product 12 days after purchase, which is within the 15-day window. However they have no receipt — only a photo.
Trap 1: the customer IS eligible because they are within the window.
Trap 2: photos are explicitly rejected as proof of purchase. Without a
valid receipt the action is replace, not refund.
Why learning is required: the with/without receipt policy distinction is only revealed on step 2. The explicit photo rejection rule is confirmed by the system on step 3.
Ticket variants: Nadia (headphones), James (mechanical keyboard), Sophie (smartwatch).
issue_type : billing_and_shipping severity : high action : refund_and_replace eligible : True
A premium customer has three complaints: a billing overcharge after a plan downgrade, a non-delivered order, and uncredited loyalty points.
Trap 1: only two complaints are actionable. Loyalty points are handled exclusively by a separate Loyalty Rewards Team — a red herring.
Trap 2: issue type is billing_and_shipping, not billing_and_product.
The second issue is a delivery failure, not a product defect.
Trap 3: action is refund_and_replace, not refund_and_refund.
Non-delivery resolves with replacement, not cash.
Why learning is required: the loyalty red herring clarification and combined action rules are only revealed on step 3. The agent must update its classification, action, and eligibility across multiple steps.
Ticket variants: Kemal (order #4492), Priya (order #7731), Arjun (order #5519).
Dense typed rewards with per-step breakdown visible in every observation: classification_score 0.20 correct issue_type — full weight all steps severity_score 0.15 correct severity — full weight all steps eligibility_score 0.20 correct eligible — 30% step 1, 60% step 2, 100% step 3+ action_score 0.25 correct action — 20% step 1, 50% step 2, 100% step 3+ reply_score 0.20 reply quality — steps 4-5 only, keyword-proportional efficiency_penalty -0.01 per step after step 3 progress_bonus +0.02 if step_reward exceeds previous step reward
The step-weighted scoring is the key design decision. Eligibility and action scores are intentionally reduced on steps 1-2 because the agent genuinely cannot know these without the policy. This creates a natural learning curve rather than a flat reward signal.
task_easy 0.28 0.48 0.82 0.98 0.95 score 0.702 task_medium 0.45 0.61 0.82 0.99 0.98 score 0.772 task_hard 0.41 0.61 0.82 0.96 0.93 score 0.747
Average Score: 0.740 / 1.0
Rewards increase monotonically through step 4 as the agent receives more information. The small step 5 drop is the efficiency penalty, which is correct RL behaviour — the agent is penalised for needing five steps when a faster agent would converge earlier.
The environment accepts a difficulty parameter on reset():
obs = env.reset(task_id="task_easy", difficulty="hard")easy Explicit step hints, policy keywords highlighted, date calculations shown. medium Standard guidance, policy rules stated without calculation hints. hard Minimal descriptions, raw ticket and policy only, no explicit hints.
This mirrors Reasoning Gym's configurable task parameters and allows the environment to benchmark agents across a range of reasoning abilities.
CustomerSupportV2Action:
issue_type : billing | shipping | product | account |
billing_and_product | billing_and_shipping
severity : low | medium | high
recommended_action : refund | replace | escalate | resolve |
investigate | refund_and_refund | refund_and_replace
eligible : true | false
reply : str # professional reply, required on steps 4-5
step_number : int # current step 1-5CustomerSupportV2Observation:
task_id : str
task_name : str
difficulty : str
ticket : str # changes each step — progressive reveal
policy : str # changes each step — more policy revealed
description : str # step-specific instructions
current_step : int
max_steps : int
step_reward : float # reward for last action
cumulative_reward : float # running total
done : bool
feedback : str # per-field breakdown with hints
last_action_error : str # validation error if action was invalid
reward_breakdown : SupportReward
available_actions : list
episode_id : str
tool_results : str # CRM account data from lookup_account()
env_difficulty : str # difficulty mode for this episode| Endpoint | Method | Description |
|---|---|---|
/health |
GET | Health check — verifies all 3 tasks load correctly |
/tasks |
GET | List all tasks with metadata |
/grader |
POST | Grade a single action against a task |
/reset |
POST | Start a new episode |
/step |
POST | Submit an action, receive next observation |
/state |
GET | Get current episode state |
curl -X POST /grader \
-H "Content-Type: application/json" \
-d '{
"task_id": "task_easy",
"action": {
"issue_type": "billing",
"severity": "low",
"recommended_action": "resolve",
"eligible": false,
"reply": "Dear Leon, our policy limits billing disputes to 60 days and your claim at 73 days falls outside this window so we are resolving this case without a refund.",
"step_number": 5
}
}'The agent uses in-context reinforcement learning via multi-turn conversation:
- Each step observation contains progressively richer ticket, policy, and tool output
- Previous actions and rewards are injected as real assistant/user turns in the conversation history
- The feedback message shows per-field scores and explains what to fix
- The agent is instructed to change only fields where new evidence exists
This is equivalent to policy gradient — the agent updates its policy based on reward signal, step by step.
obs = env.reset(task_id="task_easy")
history = []
for step in range(1, 6):
action = llm.act(
obs.ticket, obs.policy, obs.tool_results,
obs.description, history
)
obs = env.step(action)
history.append({
"action": action,
"reward": obs.step_reward,
"breakdown": obs.reward_breakdown
})
# Each new obs has richer context — agent updates beliefscustomer_support_v2/ ├── inference.py ├── models.py ├── openenv.yaml ├── pyproject.toml ├── tests/ │ └── test_environment.py └── server/ ├── app.py ├── customer_support_v2_environment.py ├── task_definitions.py ├── tools.py ├── init.py ├── Dockerfile └── requirements.txt
# Install
pip install -e .
# Validate OpenEnv spec
openenv validate
# Run tests (13 tests)
python -m pytest tests/ -v
# Start server locally
uvicorn server.app:app --host 0.0.0.0 --port 8000
# Run inference (requires HF_TOKEN)
export HF_TOKEN=hf_your_token_here
python inference.py
# Run in hard difficulty mode
export DIFFICULTY=hard
python inference.py13 passed in 30s
test_reset_returns_observation PASSED test_step_returns_observation PASSED test_score_strictly_between_0_and_1 PASSED test_task_easy_correct_answer_scores_high PASSED test_task_medium_policy_edge_case PASSED test_task_hard_multi_issue PASSED test_invalid_action_gets_penalized PASSED test_episode_ends_after_max_steps PASSED test_done_episode_returns_safe_obs PASSED test_all_tasks_exist PASSED test_state_returns_episode_id PASSED test_progressive_reveal_changes_context PASSED test_health_endpoint PASSED
- Framework: OpenEnv by Meta and Hugging Face
- Server: FastAPI and Uvicorn
- LLM: Qwen/Qwen2.5-72B-Instruct via HuggingFace Router
- Deployment: HuggingFace Spaces with Docker
- Language: Python 3.13
- GitHub: https://github.com/ityshreee/customer-support-env-v2
- HuggingFace Space: https://huggingface.co/spaces/Ityyy/customer-support-v2
- OpenEnv: https://github.com/meta-pytorch/OpenEnv