A research project into generative models that refine text using Vi command sequences.
ViLM is not an editor. It is a research project to design a new type of generative model.
The core idea is to move beyond standard autoregressive generation (which is forward-only) and create a model that generates text by refining it, much like a human writer. This model learns to build its output by generating a sequence of Vi commands, allowing it to go back, delete, insert, and move.
This approach is inspired by diffusion models, where an output is iteratively refined.
- Standard LLM:
prompt -> token 1 -> token 2 -> token 3...(Cannot go back) - ViLM:
prompt -> iHello<Esc> -> Oworld<Esc> -> ggkP<Esc> -> ...
The ViLM model uses a classical transformer architecture, but it is trained as a stateful agent. It learns a policy to edit a text buffer. The "refinement steps" are Vi commands, giving the model the physical ability to navigate and edit its own output in a non-linear way.
The goal is to research a model that can refine text, fix its own mistakes, and "think" in terms of actions rather than just predicting the next token.
This repository is organized logically across environment, model, and data:
./vi_gym/: The Gym Environment. A minimal, from-scratch Rust application simulating the Vi editor deterministically. It exposes an HTTP API (/init_session,/get_state,/act) rendering an XML-like observation of the notepad and cursor, and accepts Vi actions. This acts as our low-latency backend../train_utils/: The Agent & Training Scripts. Python tools for the local inference loop connecting to the Rust server (agent.py), causal dataset generation for ASCII art tasks (generate_causal_dataset.py), and Supervised Fine-Tuning../data/: Datasets. Stores raw and processed training outputs, like our structural ASCII learning dataset.
We are currently building in phases. Phase 1 (SFT) is complete.
We trained an experimental 0.8B parameter model (ViLM-0.8b) fine-tuned on Qwen/Qwen3.5-0.8B-Base. It acts as a behavioral cloning baseline for grid navigation on ASCII-based manipulation tasks.
- Successes: The model successfully learned the strict grammar of Vi. It can navigate (
h,j,k,l), use quantifiers (10j,12o), and understands the relationship between Insert Mode (i,a,o) and escaping (<Esc>). It developed an emergent "Canvas Builder" routing, autonomously opening blank lines and padding space to scaffold drawings. - Limitations: As a 0.8B SFT model, it lacks zero-shot spatial reasoning. It often falls into "macro-loops" (e.g., trying to move up when already at the top boundary) because the deterministic environment rejects the move and returns an identical state. To mitigate this currently, inference requires temperature sampling.
To break the model out of purely imitating patterns and teach it actual spatial reasoning and task completion, we will transition to RL (GRPO/PPO). This step is for the time being not completely solved out yet.
The model is trained on a precise, XML-based communication format.
The LLM's vocabulary must be extended to include:
<BOS>: (Input only) Signals the start of a prompt.</command>: (Output only) Signals the LLM has finished its command sequence.<Esc>: (Output only) The token for theEsckey.<Enter>: (Output only) The token for theEnter/Returnkey (used for literal newline insertion).<Tab>: (Output only) The token for theTabkey.
The LLM generates non-interactive Vi commands based on its knowledge of the state.
Input (from Environment → LLM):
<BOS>
<notepad>
1 |<cursor>const name = "world";
2 |
3 |function hello() {
4 | console.log("Hello, " + name);
5 |}
6 |
7 |function goodbye() {
8 | console.log("Goodbye, " + name);
9 |}
10|
</notepad>
<mode>Normal</mode>
<prompt>Find the `hello` function, copy the whole function, and paste it below the `goodbye` function.</prompt>
<command>Output (from LLM → Environment):
3GV%y9Gp</command>
The LLM generates literal text and must explicitly use <Esc> to return to Normal mode.
Input (from Environment → LLM):
<BOS>
<notepad>
1 |const name = "world";
2 |
3 |function hello() {
4 | console.log("Hello, " + name);<cursor>
5 |}
...
</notepad>
<mode>Insert</mode>
<prompt>Add a new line that says 'This is an example.' and then stop.</prompt>
<command>Output (from LLM → Environment):
<Enter> This is an example.<Esc></command>
If you are looking to test the model dynamically, you need both the Rust backend server and the Python inference loop running.
-
Start the Environment (Rust Server):
cd vi_gym cargo run --release -- --serve -
Run the Agent (Python): (Note: Ensure you have
uvinstalled, as we useuv.lockfor dependency management.)cd train_utils uv run agent.py
This setup hooks the LLM to the deterministic Vi environment, letting you observe it taking actions in real time!
If you expect something even vaguely useful for now, be ready to be disapointed. The current SFT model is very much a proof of concept and struggles with basic instructions, even very close to training data. The next phase of RL training is where we hope to see significant improvements in the model's ability to reason spatially and complete tasks.