This project is a web-based application that pits a Large Language Model (LLM) against the classic Tower of Hanoi puzzle. You can choose from different LLM providers (Ollama, Gemini, Anthropic, OpenAI) and models to see how they perform. The application provides a visual representation of the game and a chat interface to observe the LLM's reasoning process.
These instructions will guide you through setting up and running the project on your local machine.
First, you need to clone this repository to your computer. Open your terminal or command prompt and run the following command:
git clone https://github.com/borebot/tower_of_hanoi_vs_llm.git
cd tower_of_hanoi_vs_llmIt's highly recommended to use a virtual environment to keep the project's dependencies separate from your system's Python installation.
On macOS and Linux:
python3 -m venv venv
source venv/bin/activateOn Windows:
python -m venv venv
.\venv\Scripts\activateWith your virtual environment activated, install the required Python packages using pip:
pip install -r requirements.txtThe project uses a .env file to manage secret keys and other settings.
- Find the file named
.env.examplein the main directory. - Make a copy of this file and rename it to
.env. - Open the new
.envfile in a text editor and fill in the required values.
Here's a breakdown of the variables:
LLM_PROVIDER: This is set in the UI. (ollama,gemini,claude, oropenai).LLM_MODEL: Set this in the UI as well. The list should auto-populate for Ollama, and there are hardcoded lists of modesl for the cloud providers that you might have to edit if this repo is outdated.OLLAMA_API_URL: If you are using Ollama, set the URL for your Ollama server (e.g.,http://localhost:11434).GEMINI_API_KEY,CLAUDE_API_KEY,OPENAI_API_KEY: Add your API keys for the respective services you want to use. You only need to fill in the key for the provider you are using.
You can change the prompts used to instruct the LLM by editing the config/prompts.json file.
system_prompt: This is the main instruction given to the LLM at the start of the game. It sets the rules and the expected output format.user_prompts: These are the prompts sent to the LLM during the game to inform it of the current state or errors.
The application parses the LLM's response to find a valid move using a regular expression (regex). This is crucial for the game to function correctly.
- Where to find it: The regex is located in the
parse_movemethod within thellm_interface.pyfile. - How it works: The current regex
([A-C])\s+to\s+([A-C])\s*$is designed to find a move in the format "A to C" at the very end of the LLM's output. It's case-insensitive and allows for extra spaces. - How to adjust it: If you find that your chosen LLM provides moves in a different format, you may need to adjust this regex. For example, if the LLM outputs "Move from A to C", you might change the regex to something like
Move from ([A-C]) to ([A-C]). The current system prompt tries to force a certain style of output, but YMMV.
Once you have everything configured, you can start the web server. Run the following command in your terminal (make sure your virtual environment is still active):
flask runThen, open your web browser and go to http://127.0.0.1:5000 (or the address shown in your terminal).
- Provider Testing: This application has been primarily tested with Ollama and Gemini. While support for OpenAI and Anthropic is included, the experience may vary. You might need to adjust the prompts or the regex for optimal performance with these providers.
- Model Performance: The ability of the LLM to play the game correctly heavily depends on the model you choose. Some models may struggle to follow the rules or provide moves in the correct format.
At the end of a run, the full message payload (containing the back-and-forth) will be saved in game_logs. (The log ends at the last message the LLM sends back.)
The UI is equipped with a pause function if you feel the need to nudge the LLM in a certain direction. If you press pause, you can enter a message. Chances are the LLM will ignore the message if the system prompt is strong enough, but it really depends on the model.
Have fun!