ICML 2026
Xiang Li1, Liping Yi1, Mingze Kong2, Ming Zhang3, Zhongxiang Dai2, Qinghua Hu1
1Tianjin University 2The Chinese University of Hong Kong, Shenzhen 3East China Normal University
[Project Page] [Paper] [arXiv] [Code]
- May 2026: ALSO is accepted to ICML 2026.
ALSO studies online strategy optimization for LLM-based social agents in multi-turn social simulation. In environments such as Sotopia, agents face evolving dialogue contexts and non-stationary opponents, so a static persona or fixed behavioral instruction can lead to repeated deadlocks and poor goal completion.
ALSO formulates turn-level strategy adaptation as an adversarial bandit problem. At each dialogue turn, the system selects a persona-strategy arm, injects the selected social strategy into the agent prompt, observes reward feedback from the interaction, and updates a lightweight neural surrogate for sample-efficient online adaptation. No model weights are fine-tuned.
This repository contains the Sotopia-based implementation used for the paper experiments, including the main ALSO runner, strategy spaces, bandit baselines, evaluation scripts, and focused regression tests.
- Online adaptation: adapts within a single multi-turn interaction instead of relying on offline retraining.
- Adversarial bandit formulation: does not assume a stationary or cooperative opponent.
- Strategy injection: optimizes high-level behavioral strategies at prompt time without fine-tuning LLMs.
- Neural reward surrogate: predicts per-arm rewards from interaction context to reduce exploration cost.
- Sotopia evaluation: supports all/hard scenario splits, bilateral optimization, static baselines, and evolutionary prompt-optimization baselines.
The project page reports that ALSO is best or near-best across Sotopia-All and Sotopia-Hard in the bilateral optimization setting. The table below shows the Overall score summary.
| Model | Split | Original | Instinct | OPRO | EvoPrompt | ALSO |
|---|---|---|---|---|---|---|
| DeepSeek-V3.2 | Sotopia-All | 3.619 | 3.851 | 3.787 | 3.737 | 3.889 |
| DeepSeek-V3.2 | Sotopia-Hard | 3.025 | 3.427 | 3.344 | 3.292 | 3.527 |
| Qwen2.5-72B | Sotopia-All | 3.676 | 3.848 | 3.689 | 3.825 | 3.882 |
| Qwen2.5-72B | Sotopia-Hard | 3.347 | 3.666 | 3.242 | 3.491 | 3.648 |
See the project page for additional ablations, strategy drift analysis, heterogeneous model pairing, and case studies.
.
├── sotopia/ # Sotopia package code
├── tests/ # Upstream Sotopia tests
└── experiments/also/ # ALSO paper artifact
├── core/ # Bandits, strategy spaces, dynamic envs, evaluators
├── conf/main_experiments/ # Tmuxinator configs for smoke and paper runs
├── generated_strategies/ # Small strategy pools used by the strategy loader
├── scripts/generate_strategy_cache.py # Strategy embedding cache generation
├── tests/ # Focused artifact tests
├── calculate_cost.py
├── evaluate_by_tag.py
└── run_bandit_simulation_context.py
Generated runtime artifacts are intentionally excluded from git:
experiments/also/outputs/experiments/also/cache/experiments/also/results/- embedding caches, figures, spreadsheets, and historical intermediate datasets
Use Python 3.10-3.12 and uv.
git clone https://github.com/Babylonehy/ALSO.git
cd ALSO
uv sync --extra api --extra test --extra paperuv run sotopia installThis initializes the Sotopia runtime data needed by the experiment runner. The full paper runs also require access to model APIs and, when using database-backed evaluation, the Sotopia/Redis setup expected by the base Sotopia package.
Copy the template and edit the values for your machine:
cp .env.example .envAt minimum, set the model-provider keys you plan to use:
OPENROUTER_API_KEY=replace_with_your_openrouter_key
OPENAI_API_KEY=replace_with_your_openai_key_if_using_openai_modelsIf you use a remote or password-protected Redis service, also set REDIS_OM_URL in .env. Keep .env private; it is ignored by git.
ALSO needs the Sotopia Redis database for scenario, agent, and episode records. If uv run sotopia install created the Docker service successfully, redis-stack should be running on port 6379.
docker ps | grep redis-stack
docker exec redis-stack redis-cli pingExpected output:
PONG
If the container exists but is stopped, restart it:
docker start redis-stackIf the container does not exist yet, run a non-interactive Sotopia install with Docker and published data:
uv run sotopia install \
--use-docker \
--load-database \
--redis-data-path "$(pwd)" \
--overwrite-existing-dataThen verify that Sotopia can query the loaded data:
uv run python - <<'PY'
from sotopia.database import AgentProfile, EnvironmentProfile
from sotopia.database.env_agent_combo_storage import EnvAgentComboStorage
print("agents:", len(list(AgentProfile.all_pks())))
print("environments:", len(list(EnvironmentProfile.all_pks())))
print("env_agent_combos:", len(list(EnvAgentComboStorage.all_pks())))
PYIf Redis is remote or password-protected, set REDIS_OM_URL in .env before running Python scripts:
REDIS_OM_URL=redis://default:password@host:6379If local proxy variables should not be used for API calls:
unset ALL_PROXY all_proxyPaper-scale configs are written as tmuxinator files.
sudo apt install tmuxinatorRun a one-scenario, two-turn smoke test from the repository root:
tmuxinator start \
-p experiments/also/conf/main_experiments/smoke_test.yml \
project_root=$(pwd)Equivalent direct command:
cd experiments/also
uv run python run_bandit_simulation_context.py \
--batch \
--subset hard \
--max-episodes 1 \
--batch-size 1 \
--selection-mode strategy \
--strategy-version v3 \
--model openrouter/openai/gpt-4o-mini \
--env-model openrouter/openai/gpt-4o-mini \
--reward-eval-model openrouter/openai/gpt-4o-mini \
--bandit-type adversarial \
--optimize both \
--max-turns 2 \
--tag smoke_test \
--output outputs/smoke_test.jsonExpected output:
experiments/also/outputs/smoke_test.json
Full paper runs use strategy mode with the V3 strategy space. Generate the cache once before launching batch experiments:
cd experiments/also
uv run python scripts/generate_strategy_cache.py \
--subset hard \
--strategy-version v3 \
--cache-dir cache/strategy_embeddings_v3_slim \
--skip-existingFrom the repository root:
tmuxinator start \
-p experiments/also/conf/main_experiments/adversarial_v3_hard.yml \
project_root=$(pwd) \
batch=40 \
eta=0.5The main config launches P1-only, P2-only, and bilateral optimization panes for the hard split.
| Method | Config |
|---|---|
| Original / no optimization | experiments/also/conf/main_experiments/baseline_v3.yml |
| ALSO / adversarial bandit | experiments/also/conf/main_experiments/adversarial_v3_hard.yml |
| OPRO | experiments/also/conf/main_experiments/opro_v3.yml |
| EvoPrompt | experiments/also/conf/main_experiments/evoprompt_v3.yml |
| PromptBreeder | experiments/also/conf/main_experiments/promptbreeder_v3.yml |
| Neural UCB | experiments/also/conf/main_experiments/neural_ucb_no_ctx_v3.yml |
Example:
tmuxinator start \
-p experiments/also/conf/main_experiments/opro_v3.yml \
project_root=$(pwd) \
batch=40For a smaller command-line run without tmuxinator:
cd experiments/also
uv run python run_bandit_simulation_context.py \
--selection-mode strategy \
--strategy-version v3 \
--context-embedding \
--embedding-model qwen/qwen3-embedding-8b \
--context-embedding-dim 4096 \
--batch \
--subset hard_small \
--batch-size 14 \
--no-mask-unselected-scores \
--model openrouter/deepseek/deepseek-v3.2 \
--reward-eval-model openrouter/deepseek/deepseek-v3.2 \
--bandit-type adversarial \
--optimize both \
--eta 10 \
--depth 2 \
--max-turns 20 \
--push-to-db \
--strategy-cache-dir cache/strategy_embeddings_v3_slim \
--tag-prefix reproductionList available experiment tags:
cd experiments/also
uv run python evaluate_by_tag.py --list-tagsEvaluate one run:
uv run python evaluate_by_tag.py \
--tag reproduction_bandit_adversarial_both_hard_small \
--eval-set hardCompare multiple runs and export tables:
uv run python evaluate_by_tag.py \
--tags tag_a tag_b tag_c \
--output results/comparison.csv \
--output-xlsx results/comparison.xlsx \
--export-csv results/tables \
--save-allRun focused artifact tests:
uv run pytest experiments/also/tests -qCheck retained entrypoints:
uv run python -m py_compile \
experiments/also/run_bandit_simulation_context.py \
experiments/also/evaluate_by_tag.py \
experiments/also/calculate_cost.py \
experiments/also/scripts/generate_strategy_cache.pyThe experiment environment is built on the Sotopia social simulation framework. The project-page style follows common academic project-page conventions and links to the public ALSO page for figures, ablations, and qualitative examples.