An LLM-powered agent designed to streamline and enhance the process of discovering and analyzing prominent computational circuits within large language models, leveraging the circuit-tracer.
git clone https://github.com/vedpatwardhan/llm-interpret-agent.git
cd llm-interpret-agent
uv venv
uv sync
git submodule update --init
cd circuit-tracer
mkdir graph_files
cp graph-metadata.json circuit-tracer/graph_files
uv pip install -e circuit_tracer
Go to Neuronpedia, generate a new graph or use an existing graph, and download the json for it into the circuit-tracer/graph_files folder (through the Graph Info option).
Then copy over the "metadata" in the downloaded json to the circuit-tracer/graph_files/graph-metadata.json as a list item under "graphs".
Start the circuit tracer server,
uv run start_server.py
In order to do the analysis, you'd need to view the attribution graph for your chosen example through the server.
Then identify the input and output nodes you care about along with the overall goal, and edit the main.py with those accordingly (be sure to use the node ids in the format that matches to the clickedId in the url after you click it).
uv run main.py
Once done, it will output a url where you can view the grouping (provided the circuit-tracer server is running).
- Select the input and output nodes that we want to analyze, along with the overall goal we're trying to achieve with the analysis.
- Select the top feature nodes associated with either of the input and output nodes on the influence.
- Check each such node for its relevance to the input-output behaviour of the model based on the top 1% examples.
- Recursively break down the nodes into separate groups, directed by their contents and the overall goal we're trying to achieve. After every grouping step, classify the nodes in that group among its sub-groups and iterate on those that still have more than 5 nodes.
- Generate the url to view the attribution graph with those nodes pinned and grouped.
(The localhost url for this particular graph is here)
- The first section of this example demonstrates how interventions on certain features can vastly affect the output of the model.
- My goal here was primarily to understand the effect of intervening on nodes involved in recognizing Michael Jordan to understand how much they affect the identification of the sport.
- I wanted to have a repeatable process to generate those supernodes in order to quickly find more such nodes that would affect the output.
- Using some of the nodes selected in graph, I was able to significantly reduce the probability of the sport identified (available in the
demo.ipynb).
- There seems some sort of redundancy where there's other features that hold similar information to the ones intervened but aren't activated at first, so it would be useful to have an iterative process to identify new nodes that demonstrate this behaviour.
- The agent doesn't have any control over the actual interventions and observe the effects, so end-to-end access to the full process can improve performance further.

