Skip to content

ityshreee/customer_support_V2

Repository files navigation

title Customer Support Agent Environment V2
emoji 🎧
colorFrom blue
colorTo green
sdk docker
pinned false
app_port 8000
base_path /web
tags
openenv
reinforcement-learning
customer-support
llm-agent
tool-use
progressive-reveal

Customer Support Agent Environment V2

A production-grade multi-step reinforcement learning environment where an LLM agent resolves customer support tickets through progressive information reveal and real CRM tool calls.

Built for the Meta PyTorch x Scaler OpenEnv Hackathon using the OpenEnv framework.


What Makes This Environment Different

Most RL environments hand the agent all information upfront and ask it to classify once. This environment works the way real customer support actually does — information arrives gradually, and the agent must update its beliefs at each step. Step 1 Ticket only. Policy hidden. CRM tool not yet called. Agent classifies issue type and severity from limited context.

Step 2 Partial policy revealed. CRM account data returned by tool call. Agent determines eligibility using real account history.

Step 3 Full policy and verified date calculations revealed. Agent finalises recommended action with complete context.

Step 4 Reply guidelines provided. Agent drafts professional reply.

Step 5 Final checklist. Agent confirms and submits all fields.

A naive agent guessing on step 1 scores around 0.28. An agent that reads, uses tool output, and adapts reaches 0.95 by step 5. Learning is visible, measurable, and genuine.


Real Tool Use — CRM Account Lookup

On reset(), the environment automatically calls lookup_account() against a simulated CRM database. The result is injected into the observation from step 2 onward as structured account data:

{
  "account_email": "leon@email.com",
  "account_age_years": 3,
  "plan": "premium",
  "prior_claims": 0,
  "last_charge_date": "2026-01-28",
  "subscription_status": "active"
}

The agent uses this data to determine eligibility — for example, detecting that a billing charge date was 73 days ago and therefore outside the 60-day dispute window. This turns the environment into a genuine tool-use benchmark, not just a text classification task.


Tasks

task_easy — Late Billing Claim Outside Policy Window

issue_type : billing severity : low action : resolve eligible : False

A customer demands a refund for a charge that is 73 days old. Company policy only allows disputes within 60 days.

The trap: the customer is a loyal 3-year user and expects an exception. Policy explicitly states that loyalty does not override the time window.

Why learning is required: the charge date arrives with the full ticket on step 2. The 73-day calculation is system-verified on step 3. An agent that assumes refund and eligible=True on step 1 must correct itself by step 3.

Ticket variants: Leon ($34.99, January), Maria ($19.99, February), David ($49.99, December).


task_medium — Defective Product Without a Valid Receipt

issue_type : product severity : medium action : replace eligible : True

A customer reports a defective product 12 days after purchase, which is within the 15-day window. However they have no receipt — only a photo.

Trap 1: the customer IS eligible because they are within the window. Trap 2: photos are explicitly rejected as proof of purchase. Without a valid receipt the action is replace, not refund.

Why learning is required: the with/without receipt policy distinction is only revealed on step 2. The explicit photo rejection rule is confirmed by the system on step 3.

Ticket variants: Nadia (headphones), James (mechanical keyboard), Sophie (smartwatch).


task_hard — Billing Overcharge + Missing Delivery + Loyalty Red Herring

issue_type : billing_and_shipping severity : high action : refund_and_replace eligible : True

A premium customer has three complaints: a billing overcharge after a plan downgrade, a non-delivered order, and uncredited loyalty points.

Trap 1: only two complaints are actionable. Loyalty points are handled exclusively by a separate Loyalty Rewards Team — a red herring.

Trap 2: issue type is billing_and_shipping, not billing_and_product. The second issue is a delivery failure, not a product defect.

Trap 3: action is refund_and_replace, not refund_and_refund. Non-delivery resolves with replacement, not cash.

Why learning is required: the loyalty red herring clarification and combined action rules are only revealed on step 3. The agent must update its classification, action, and eligibility across multiple steps.

Ticket variants: Kemal (order #4492), Priya (order #7731), Arjun (order #5519).


Reward Function

Dense typed rewards with per-step breakdown visible in every observation: classification_score 0.20 correct issue_type — full weight all steps severity_score 0.15 correct severity — full weight all steps eligibility_score 0.20 correct eligible — 30% step 1, 60% step 2, 100% step 3+ action_score 0.25 correct action — 20% step 1, 50% step 2, 100% step 3+ reply_score 0.20 reply quality — steps 4-5 only, keyword-proportional efficiency_penalty -0.01 per step after step 3 progress_bonus +0.02 if step_reward exceeds previous step reward

The step-weighted scoring is the key design decision. Eligibility and action scores are intentionally reduced on steps 1-2 because the agent genuinely cannot know these without the policy. This creates a natural learning curve rather than a flat reward signal.


Observed Learning Curve

task_easy 0.28 0.48 0.82 0.98 0.95 score 0.702 task_medium 0.45 0.61 0.82 0.99 0.98 score 0.772 task_hard 0.41 0.61 0.82 0.96 0.93 score 0.747

Average Score: 0.740 / 1.0

Rewards increase monotonically through step 4 as the agent receives more information. The small step 5 drop is the efficiency penalty, which is correct RL behaviour — the agent is penalised for needing five steps when a faster agent would converge earlier.


Configurable Difficulty

The environment accepts a difficulty parameter on reset():

obs = env.reset(task_id="task_easy", difficulty="hard")

easy Explicit step hints, policy keywords highlighted, date calculations shown. medium Standard guidance, policy rules stated without calculation hints. hard Minimal descriptions, raw ticket and policy only, no explicit hints.

This mirrors Reasoning Gym's configurable task parameters and allows the environment to benchmark agents across a range of reasoning abilities.


Action Space

CustomerSupportV2Action:
    issue_type          : billing | shipping | product | account |
                          billing_and_product | billing_and_shipping
    severity            : low | medium | high
    recommended_action  : refund | replace | escalate | resolve |
                          investigate | refund_and_refund | refund_and_replace
    eligible            : true | false
    reply               : str   # professional reply, required on steps 4-5
    step_number         : int   # current step 1-5

Observation Space

CustomerSupportV2Observation:
    task_id             : str
    task_name           : str
    difficulty          : str
    ticket              : str    # changes each step — progressive reveal
    policy              : str    # changes each step — more policy revealed
    description         : str    # step-specific instructions
    current_step        : int
    max_steps           : int
    step_reward         : float  # reward for last action
    cumulative_reward   : float  # running total
    done                : bool
    feedback            : str    # per-field breakdown with hints
    last_action_error   : str    # validation error if action was invalid
    reward_breakdown    : SupportReward
    available_actions   : list
    episode_id          : str
    tool_results        : str    # CRM account data from lookup_account()
    env_difficulty      : str    # difficulty mode for this episode

API Endpoints

Endpoint Method Description
/health GET Health check — verifies all 3 tasks load correctly
/tasks GET List all tasks with metadata
/grader POST Grade a single action against a task
/reset POST Start a new episode
/step POST Submit an action, receive next observation
/state GET Get current episode state

Grader Example

curl -X POST /grader \
  -H "Content-Type: application/json" \
  -d '{
    "task_id": "task_easy",
    "action": {
      "issue_type": "billing",
      "severity": "low",
      "recommended_action": "resolve",
      "eligible": false,
      "reply": "Dear Leon, our policy limits billing disputes to 60 days and your claim at 73 days falls outside this window so we are resolving this case without a refund.",
      "step_number": 5
    }
  }'

How the LLM Agent Learns

The agent uses in-context reinforcement learning via multi-turn conversation:

  1. Each step observation contains progressively richer ticket, policy, and tool output
  2. Previous actions and rewards are injected as real assistant/user turns in the conversation history
  3. The feedback message shows per-field scores and explains what to fix
  4. The agent is instructed to change only fields where new evidence exists

This is equivalent to policy gradient — the agent updates its policy based on reward signal, step by step.

obs = env.reset(task_id="task_easy")
history = []

for step in range(1, 6):
    action = llm.act(
        obs.ticket, obs.policy, obs.tool_results,
        obs.description, history
    )
    obs = env.step(action)
    history.append({
        "action": action,
        "reward": obs.step_reward,
        "breakdown": obs.reward_breakdown
    })
    # Each new obs has richer context — agent updates beliefs

Project Structure

customer_support_v2/ ├── inference.py ├── models.py ├── openenv.yaml ├── pyproject.toml ├── tests/ │ └── test_environment.py └── server/ ├── app.py ├── customer_support_v2_environment.py ├── task_definitions.py ├── tools.py ├── init.py ├── Dockerfile └── requirements.txt


Setup and Running

# Install
pip install -e .

# Validate OpenEnv spec
openenv validate

# Run tests (13 tests)
python -m pytest tests/ -v

# Start server locally
uvicorn server.app:app --host 0.0.0.0 --port 8000

# Run inference (requires HF_TOKEN)
export HF_TOKEN=hf_your_token_here
python inference.py

# Run in hard difficulty mode
export DIFFICULTY=hard
python inference.py

Test Results

13 passed in 30s

test_reset_returns_observation PASSED test_step_returns_observation PASSED test_score_strictly_between_0_and_1 PASSED test_task_easy_correct_answer_scores_high PASSED test_task_medium_policy_edge_case PASSED test_task_hard_multi_issue PASSED test_invalid_action_gets_penalized PASSED test_episode_ends_after_max_steps PASSED test_done_episode_returns_safe_obs PASSED test_all_tasks_exist PASSED test_state_returns_episode_id PASSED test_progressive_reveal_changes_context PASSED test_health_endpoint PASSED


Tech Stack

  • Framework: OpenEnv by Meta and Hugging Face
  • Server: FastAPI and Uvicorn
  • LLM: Qwen/Qwen2.5-72B-Instruct via HuggingFace Router
  • Deployment: HuggingFace Spaces with Docker
  • Language: Python 3.13

Links

About

Multi-task RL environment for training LLM customer support agents. Features progressive information reveal across 5 tasks × 3 variants, continuous rewards, OpenEnv validated, 17 tests passing. Built for Meta PyTorch × Scaler OpenEnv Hackathon.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors