Project Period: Jan 13, 2026 – present
This project is under active development. Some components are incomplete or unstable, as the current focus is on validating the overall agent pipeline design rather than full execution.
Practice implementation of a LangChain-based FFmpeg agent
The goal of this practice project is not to build a fully automated video editing system. Instead, it focuses on exploring an agent-based pipeline that translates high-level, natural language content creation goals (e.g., “Create a YouTube Shorts video”) into executable commands.
The primary objective is to observe where and why the pipeline succeeds or fails, and to analyze failure cases not as simple errors, but as signals indicating missing structural information, insufficient intermediate representations, or inadequate learning signals.
At the current stage, the focus is on validating the agent flow and chain structure
rather than on model performance itself. Therefore, the entire pipeline is initially implemented using a well-pretrained
general-purpose language model (e.g., gpt-4o-mini) via API. Once the behavior and failure points of the current agent pipeline are sufficiently observed,
the model will be replaced with locally executable open-weight LLMs or SLMs
(e.g., via Ollama or Hugging Face).
-
a_chain: Goal Decomposition and Task Planning- The purpose of a_chain is to transform a user’s high-level, natural language goal into an ordered sequence of concrete editing tasks.
- Formally, the chain is defined as:
a_chain = a_chain1 | a_chain2 a_chain1 = user_goal | LLM | planning a_chain2 = planning | LLM | task_sequence -
b_chain: Capability Analysis and Intermediate Representation- The b_chain is responsible for interpreting each task generated by a_chain and determining whether it can be directly executed using FFmpeg.
- It consists of two parallel sub-chains:
b_chain = b_chain1 ⊕ b_chain2 b_chain1 = task | LLM | capability_analysis b_chain2 = task | LLM | structured_representationb_chain1(Capability Analysis)- Determines whether a given task is executable via FFmpeg without additional human input.
- For example, tasks such as “review the video” or “decide the clip order” are explicitly classified as non-executable.
b_chain2(Structured Representation)- For tasks deemed executable, the LLM generates a loosely structured intermediate representation.
- Notably, no fixed DSL schema is predefined.
- This design choice allows observation of whether FFmpeg commands can still be generated from inconsistent or partially structured representations.
-
c_chain: Command Synthesis- The final stage, c_chain, is responsible for synthesizing executable FFmpeg commands.
c_chain = execution_context | LLM | ffmpeg_command-
The input to
c_chainis an execution context, which aggregates:- the original task,
- the capability analysis result,
- the structured intermediate representation,
- and information about previously generated files.
Using this bunch of contextual information, the LLM generates a raw FFmpeg commandwithout performing execution itself.
-
Project Structure
Practice_LangChain/
├── agent/
│ ├── a_chain.py # task planning
│ ├── b_chain.py # Capability analysis & intermediate representation
│ ├── c_chain.py # FFmpeg command synthesis
│ └── agent_runner.py # Orchestrates the full multi-chain pipeline
│
├── prompts/
│ ├── a_chain_prompts.py # Prompts for planning and task generation
│ ├── b_chain_prompts.py # Prompts for capability analysis & structuring
│ └── c_chain_prompts.py # Prompts for FFmpeg command generation
│
├── utils/
│ ├── task_parser.py # Parses task sequences from a_chain output
│ └── context_packing.py # Aggregates execution context across chains
│
├── tools/
│ └── ffmpeg_executor.py # Executes FFmpeg from generated commands
│
├── configs/
│ └── llm.py # LLM configuration (API / local model switchable)
│
└── README.md