A local AI video-editing automation toolkit for viral short-form clips, YouTube clipping, masked/inverted captions, transcript-aware B-roll, standalone B-roll discovery, rerenders, and Remotion-based review renders.
Useful search terms this project is built around: AI video editor, YouTube shorts generator, TikTok captions, Reels captions, Remotion captions, automatic B-roll, viral clip finder, faceless video generator, AI shorts automation, podcast clipper, transcript-based video editing, and contextual movie-scene B-roll.
Clone it anywhere:
git clone https://github.com/jongan69/ClipCaptionAI.git
cd ClipCaptionAI
npm install
cp .env.example .env- Make sure
.envcontainsOPENAI_API_KEY=.... - Add
YOUTUBE_API_KEY=...too if you want stronger YouTube B-roll discovery. - Double-click
RUN.command. - Pick a workflow from the menu.
From Terminal:
npm run menuRun a quick local health check:
npm run doctorThe everyday command surface is clipkit:
npm run clipkit -- auto-clips --links links.txt --max-clips 6 --padding-seconds 2
npm run clipkit -- caption --video "/path/to/video.mp4"
npm run clipkit -- enhance --video "/path/to/already-edited.mp4"
npm run clipkit -- broll --prompts broll-prompts.txt --max-downloads 8
npm run clipkit -- rerender --clip 03-your-website-is-leaking-moneyShortcut aliases:
| Command | Use |
|---|---|
npm run menu |
Open the interactive workflow menu. |
npm run doctor |
Check Node, npm, ffmpeg, ffprobe, yt-dlp, .env, and keys. |
npm run clip:auto |
Auto-clip YouTube videos from a links file. |
npm run caption:auto |
Caption any existing video without picking new clips. |
npm run video:enhance |
Add contextual B-roll and captions to an existing edit. |
npm run broll:find |
Find standalone B-roll from text prompts. |
npm run rerender:clip |
Rerender a generated clip after text/style fixes. |
More detailed walkthroughs live in docs/WORKFLOWS.md. GitHub-safe publishing notes live in docs/GITHUB.md.
Each run creates a fresh folder:
outputs/run-YYYY-MM-DD-HHMMSS/
links.txt
manifest.json
caption-style.json
downloads/
generated-assets/
captioned-clips/
The current outputs folder is kept clean by separating every run into its own dated folder. Temporary Remotion media staging is cleaned after each full run finishes.
Run the complete YouTube auto-clipping workflow from links.txt:
npm run processForce AI to pick new clips instead of reusing an existing selection.json:
npm run process -- --reselectLimit the number of clips per source video:
npm run process -- --max-clips 6The double-click RUN.command uses MAX_CLIPS=6 by default. To temporarily change the one-click cap from Terminal:
MAX_CLIPS=10 /Users/jonathangan/Desktop/ClipCaptionAI/RUN.commandAdd more lead-in and tail padding around each AI-selected clip:
npm run process -- --padding-seconds 3Render all selected clips as 9:16 while keeping the full horizontal video visible with black bars:
npm run process -- --vertical-containUse a different links file:
npm run process -- --links "/path/to/links.txt"Useful process options:
| Option | Meaning |
|---|---|
--links FILE |
Links file. Defaults to local links.txt, then /Users/jonathangan/Desktop/Full-Vids/links.txt. |
--out-dir DIR |
Output root. Defaults to outputs. |
--run-name NAME |
Custom run folder name instead of run-YYYY-MM-DD-HHMMSS. |
--max-clips N |
Clips to select per video. Default 3. |
--min-seconds N |
Minimum AI-selected core clip length. |
--max-seconds N |
Maximum AI-selected core clip length. |
--padding-seconds N |
Extra seconds before and after the selected moment. Default 2. |
--review-width N |
Width used when cutting intermediate clips. Default 1280. |
--review-fps N |
Render FPS for review clips. Default 15. |
--selection-model ID |
OpenAI model used for selecting clips. |
--style-config FILE |
Caption style JSON. Defaults to caption-style.json. |
--scene-library DIR |
Folder of tagged scene clips used for context-matched cutaways. |
--context-scenes |
Force-enable transcript-matched scene inserts for this run. |
--disable-context-scenes |
Force-disable scene inserts for this run. |
--reselect |
Ignore existing AI selections and choose again. |
--vertical |
Render as 1080x1920 with video cropped to fill. |
--vertical-contain |
Render as 1080x1920 with full video contained and black bars. |
List clips in the latest run:
npm run rerender:clip -- --listList clips in an older run:
npm run rerender:clip -- --run "/Users/jonathangan/Desktop/ClipCaptionAI/outputs/run-2026-06-20-182812" --listOpen the listed .captions.json file, edit only the "text" values, then rerender:
npm run rerender:clip -- --clip 1Rerender a named clip from an older run:
npm run rerender:clip -- \
--run "/Users/jonathangan/Desktop/ClipCaptionAI/outputs/run-2026-06-20-182812" \
--clip "03-your-website-is-leaking-money"Rerender that older clip as 9:16 contain with black bars:
npm run rerender:clip -- \
--run "/Users/jonathangan/Desktop/ClipCaptionAI/outputs/run-2026-06-20-182812" \
--clip "03-your-website-is-leaking-money" \
--vertical-containBy default, rerenders write *.corrected.mp4 next to the original. To overwrite the original captioned clip:
npm run rerender:clip -- --clip 1 --replaceUseful rerender:clip options:
| Option | Meaning |
|---|---|
--run DIR |
Run folder to use. Defaults to latest outputs/run-*. |
--clip ID |
Clip number, title fragment, slug, or full .captions.json path. |
--list |
Print editable clips for a run. |
--replace |
Overwrite *.captioned.mp4 instead of creating *.corrected.mp4. |
--out FILE |
Write to a custom output path. |
--vertical |
Rerender as 1080x1920 cropped fill. |
--vertical-contain |
Rerender as 1080x1920 contained with black bars. |
--foreground-video FILE |
Optional transparent foreground/subject layer rendered above captions. |
--position NAME |
Override caption position for this render. |
--style-config FILE |
Use a different style JSON. |
--highlight-words CSV |
Override highlighted words for this render. |
You can optionally mix in tagged cutaway footage so the clip bounces between the original speaker footage and context-matched cinematic scenes. By default, the same scene clip is used at most once inside a generated short.
This works with a local curated scene library, and it can also auto-build that library from YouTube videos.
One-off mix command:
npm run scene:mix -- \
--video "/path/to/raw-clip.mp4" \
--captions "/path/to/raw-clip.captions.json" \
--out "/path/to/raw-clip.scene-mix.mp4"Enable it for the full pipeline:
npm run process -- --context-scenesThe same mixed source is reused automatically by rerender:clip when a *.scene-mix.mp4 exists next to the raw clip.
One-off YouTube ingest:
npm run scene:ingest:youtube-cc -- \
--query "money motivation movie scene" \
--max-downloads 2 \
--max-duration-seconds 60If caption-style.json has contextScenes.youtubeIngest.enabled: true, the mixer can also auto-ingest matching clips while it plans cutaways from the transcript.
Use this when you are editing manually and only want related B-roll clips, without running transcription, AI clip selection, captions, or rendering.
- Put one B-roll idea per line in
broll-prompts.txt. - Double-click
BROLL.command, or run:
cd /Users/jonathangan/Desktop/ClipCaptionAI
npm run broll:findEvery run creates a separate folder:
outputs/broll-run-YYYY-MM-DD-HHMMSS/
prompts.txt
manifest.json
01-first-prompt/
clips/
01-01-example.mp4
01-01-example.mp4.scene.json
02-second-prompt/
clips/
The clips are also cached in scene-library/, so repeated prompts do not redownload the same YouTube video when it already exists locally.
Useful broll:find options:
| Option | Meaning |
|---|---|
--prompts FILE |
Prompt text file. Defaults to broll-prompts.txt. |
--out-dir DIR |
Output root. Defaults to outputs. |
--run-name NAME |
Custom output folder name. |
--scene-library DIR |
Reusable clip cache. Defaults to scene-library. |
--max-results N |
YouTube results searched per prompt. |
--max-downloads N |
Clips selected/copied per prompt. |
--max-duration-seconds N |
Reject source videos longer than this. Default 60. |
--min-candidate-score N |
Search score cutoff. Defaults to 5 for manual B-roll finding. |
--max-expanded-queries N |
Search variants per prompt. Defaults to 5. |
--movie-scenes |
Search movie/TV/pop-culture scene queries instead of stock B-roll queries. |
--channel-id ID |
Restrict YouTube search to one channel. |
--no-copy |
Fill/update scene-library only, without creating prompt clip copies. |
Examples:
npm run broll:find -- --prompts "/path/to/ideas.txt" --max-downloads 5npm run broll:find -- \
--prompts broll-prompts.txt \
--run-name broll-money-scenes \
--max-results 12 \
--max-downloads 4 \
--max-duration-seconds 45Movie/TV scene style:
npm run broll:find -- \
--prompts broll-prompts-budapest-movie-scenes.txt \
--run-name broll-budapest-movie-scenes \
--movie-scenes \
--max-results 8 \
--max-downloads 2 \
--max-duration-seconds 240Movie-scene results are candidate references, not rights-cleared assets. Review source, quality, and permissions before using them in a public post.
Use this when you already have a mostly edited video and want ClipCaptionAI to add timed B-roll cutaways plus captions on top. It does not select/cut a new short from a long source. It keeps the full base video timeline and audio, then adds visual inserts where the transcript context benefits from motion or movie/TV-style references.
cd /Users/jonathangan/Desktop/ClipCaptionAI
npm run broll:enhance -- --video "/path/to/already-edited.mp4"Every run creates:
outputs/enhance-run-YYYY-MM-DD-HHMMSS/
manifest.json
assets/
original-name.base-1080x1920.mp4
original-name.captions.json
original-name.broll-mix.mp4
original-name.broll-mix.scene-plan.json
original-name.broll-mix.pop-culture-scenes.json
final/
original-name.broll-captioned.mp4
Useful broll:enhance options:
| Option | Meaning |
|---|---|
--video FILE |
Already-edited base video to enhance. |
--captions FILE |
Use an existing captions JSON instead of transcribing. |
--run-name NAME |
Custom output folder name. |
--max-insertions N |
Override how many B-roll cutaways can be planned. |
--fps N |
Final render FPS. Default 24. |
--fit cover|contain |
Normalize source into 9:16 frame. Default contain. |
--transcription-prompt TEXT |
Helpful words/names for transcription accuracy. |
--disable-youtube-ingest |
Use only clips already in scene-library. |
--movie-scenes |
Prefer movie/TV scene B-roll. This is now the default for broll:enhance. |
--stock-broll |
Use the older literal/stock-style B-roll search for this run. |
--pop-culture-research |
Force movie/TV reference query enrichment. |
--no-render |
Stop after transcription and B-roll mix. |
Example:
npm run broll:enhance -- \
--video "/Users/jonathangan/Desktop/0702.MP4" \
--run-name budapest-existing-edit \
--max-insertions 12 \
--pop-culture-researchThe scene planner can use iconic movie/TV/cartoon/anime/reality/sports-doc references to improve the actual B-roll YouTube searches. For each planned insertion it infers the emotional meaning, finds recognizable scene concepts, then injects those scene searches into the same query list used by YouTube ingest and scene scoring.
The JSON trace files, and optional Markdown trace files, show why certain movie/TV search queries were added.
It runs automatically when caption-style.json has:
"contextScenes": {
"popCultureResearch": {
"enabled": true,
"model": "gpt-4.1",
"candidatesPerSegment": 8,
"useForYoutubeQueries": true,
"maxQueriesPerInsertion": 4,
"minQueryConfidence": 9,
"writeMarkdown": false
}
}Each planned scene mix writes trace files like:
01-example.scene-mix.pop-culture-scenes.json
01-example.scene-mix.pop-culture-scenes.md
Run the pop-culture query pass manually for an existing scene plan:
npm run scene:research-pop-culture -- \
--scene-plan "/path/to/clip.scene-mix.scene-plan.json"Useful options:
| Option | Meaning |
|---|---|
--pop-culture-research |
Force-enable movie/TV candidate research for a scene mix. |
--disable-pop-culture-research |
Skip movie/TV candidate research for a scene mix. |
--model ID |
Model for the manual research command. |
--candidates N |
Candidate scenes per segment, from 5 to 10. |
--json-only |
Skip the companion Markdown report. |
Rights note: public YouTube availability is not treated as clearance. The query system prefers official clip/trailer/promo searches when possible, but you should still use official/licensed/owned/public-domain/stock/AI-generated footage, or manually review rights before using any movie or TV clip.
Drop sound files into sfx-library/, then standardize and index them:
npm run sfx:standardizeFull pipeline runs automatically add low-volume contextual SFX when caption-style.json has soundEffects.enabled: true. The final render uses *.sfx-mix.mp4 as its source, and each mix writes a *.sfx-plan.json next to it so you can inspect exactly which sounds were chosen. By default, the same SFX file is used at most once inside a generated short.
One-off SFX mix:
npm run sfx:mix -- \
--video "/path/to/clip-or-scene-mix.mp4" \
--captions "/path/to/clip.captions.json" \
--out "/path/to/clip.sfx-mix.mp4"Useful full-run options:
| Option | Meaning |
|---|---|
--sound-effects |
Force-enable automatic SFX for this run. |
--disable-sound-effects |
Skip SFX mixing for this run. |
--sfx-library DIR |
Use a different indexed SFX library folder. |
Transcribe one video or clip:
npm run transcribe -- \
--video "/path/to/clip.mp4" \
--out "work/clip.captions.json"Render one clip with an existing captions file:
npm run render:clip -- \
--video "/path/to/clip.mp4" \
--captions "work/clip.captions.json" \
--out "outputs/clip.captioned.mp4"Render only a small frame range for a fast proof:
npm run render:clip -- \
--video "/path/to/clip.mp4" \
--captions "work/clip.captions.json" \
--out "outputs/proof.mp4" \
--frames 140-180Render one clip as 9:16 contain:
npm run render:clip -- \
--video "/path/to/clip.mp4" \
--captions "work/clip.captions.json" \
--out "outputs/clip.vertical-contain.mp4" \
--vertical-containUseful render:clip options:
| Option | Meaning |
|---|---|
--video FILE |
Required source video. |
--captions FILE |
Required caption JSON. |
--out FILE |
Required rendered mp4 path. |
--width N / --height N |
Force output dimensions. |
--fps N |
Force output FPS. |
--vertical |
1080x1920 cropped fill. |
--vertical-contain |
1080x1920 contained with black bars. |
--foreground-video FILE |
Optional transparent foreground/subject layer rendered above captions. |
--fit cover|contain |
CSS video fit. Normally controlled by caption-style.json. |
--position NAME |
left-hook, right-hook, lower-left, center-bottom, or center-impact. |
--combine-ms N |
Caption grouping window. |
--highlight-words CSV |
Words to render in the alternate font. |
--text-opacity N |
Caption fill opacity from 0 to 1. |
--uppercase |
Force caption text uppercase. |
--frames START-END |
Render only a frame range for proofing. |
Edit:
caption-style.json
The one-click runner, npm run process, npm run smart:clips, npm run render:clip, and npm run rerender:clip all read this file automatically unless you pass --style-config.
| Field | Example | What it controls |
|---|---|---|
position |
"center-impact" |
Caption preset. Supported: left-hook, right-hook, lower-left, center-bottom, center-impact. |
customPosition |
{ "right": "9%", "top": "48%" } |
Overrides preset CSS positioning. Use percentages or CSS lengths. |
verticalContain |
true |
Exports 1080x1920 and keeps the full source visible with black bars. |
outputAspect |
"9:16" |
Makes render commands treat the output as vertical. Use "source" for source aspect. |
fit |
"contain" |
Video object fit. contain shows the full video; cover fills/crops. |
videoFilter |
"contrast(1.08) saturate(1.14)" |
CSS filter applied to the video before captions. Use null for none. |
videoBorderRadius |
"38px" |
Rounds the video corners, useful for a repost/screen-recording feel. |
backgroundOverlay |
CSS gradient or null |
Optional readability overlay above video and behind captions. |
For full horizontal video inside 9:16:
{
"verticalContain": true,
"outputAspect": "9:16",
"fit": "contain"
}For normal source-aspect exports:
{
"verticalContain": false,
"outputAspect": "source",
"fit": "cover"
}caption-style.json can also drive transcript-matched scene inserts:
| Field | Example | What it controls |
|---|---|---|
contextScenes.enabled |
true |
Turns scene mixing on for process / smart:clips. |
contextScenes.libraryDir |
"./scene-library" |
Folder containing your scene clips or an index.json manifest. |
contextScenes.planningModel |
"gpt-4.1-mini" |
OpenAI model used to choose cutaway timing/query ideas. |
contextScenes.maxInsertionsPerClip |
10 |
Maximum scene inserts in one short clip. Higher values create faster visual pacing. |
contextScenes.minInsertionSeconds |
0.7 |
Minimum cutaway duration. |
contextScenes.maxInsertionSeconds |
2.6 |
Maximum cutaway duration. |
contextScenes.minGapSeconds |
0.2 |
Minimum gap between cutaways. |
contextScenes.edgeBufferSeconds |
0.6 |
Keeps cutaways away from the first/last part of the clip. |
contextScenes.targetCoverageRatio |
0.5 |
Planner target for how much of the finished short should be cutaway/B-roll footage. |
contextScenes.maxCoverageRatio |
0.55 |
Hard cap for how much of the clip can be replaced by scene inserts. |
contextScenes.transcriptChunkWords |
8 |
Transcript chunk size sent to the planner. |
contextScenes.allowSceneReuseWithinClip |
false |
When false, each scene clip can be used only once inside a generated short. If no unused match is strong enough, that insert is skipped. |
contextScenes.popCultureResearch.enabled |
true |
Adds iconic movie/TV/cartoon/anime/reality/sports-doc scene concepts to each cutaway's YouTube searches. |
contextScenes.popCultureResearch.maxQueriesPerInsertion |
4 |
Maximum pop-culture scene searches injected into each cutaway query set. |
contextScenes.popCultureResearch.minQueryConfidence |
9 |
Minimum confidence for a pop-culture scene query to be used automatically. |
contextScenes.queryStyle.queriesPerInsertion |
3 |
Number of distinct AI-generated YouTube search phrases requested for each planned cutaway. |
contextScenes.queryStyle.maxExpandedQueriesPerBase |
1 |
Number of search variants to run for each AI query after style expansion. Keep low to avoid too much generic stock footage. |
contextScenes.queryStyle.minCandidateScore |
20 |
Minimum metadata score before a searched video can be downloaded. Raise this to be pickier. |
contextScenes.queryStyle.movieSceneMinCandidateScore |
20 |
Optional stricter score floor used when --movie-scenes is active. |
contextScenes.queryStyle.minCoreQueryMatches |
2 |
Required number of non-style query ideas, such as homeless or car, that should match candidate metadata before download. |
contextScenes.queryStyle.preferMotion |
true |
Tells the planner/search expander to favor moving shots and visible action. |
contextScenes.queryStyle.preferCinematic |
true |
Tells the planner/search expander to favor cinematic/commercial-looking clips. |
contextScenes.queryStyle.preferMovieScenes |
true |
Biases search expansion and scoring toward official movie/TV scene clips instead of generic stock footage. |
contextScenes.queryStyle.avoidTalkingHeads |
true |
Downranks podcasts, interviews, reactions, lectures, and similar low-cutaway-value results. |
contextScenes.queryStyle.officialClipBoost |
12 |
Ranking boost for results that look like official clips or known scene channels. |
contextScenes.queryStyle.movieSceneBoost |
14 |
Ranking boost for titles/descriptions that explicitly look like movie, film, TV, or iconic scene clips. |
contextScenes.queryStyle.stockFootagePenalty |
18 |
Penalty applied to stock, royalty-free, no-copyright, generic B-roll, and ad-like results when movie-scene mode is on. |
contextScenes.queryStyle.watermarkPenalty |
55 |
Strong penalty for metadata suggesting watermarks, preview-only footage, or stock-library previews. |
contextScenes.queryStyle.trailerPenalty |
14 |
Penalty for trailers/teasers/promos when you want actual scene inserts. |
contextScenes.queryStyle.lowQualityPenalty |
16 |
Penalty for top-10 lists, recaps, essays, reactions, fan edits, and Shorts-style reposts. |
contextScenes.queryStyle.nonScenePenalty |
22 |
Penalty when movie-scene mode is active but a result does not look like an actual movie/TV scene result. |
contextScenes.queryStyle.styleModifiers |
["cinematic", "4k"] |
Search words added to query variants and boosted during candidate ranking. |
contextScenes.queryStyle.themeBoosts |
["money", "discipline"] |
Theme words that get a small ranking boost when found in candidate metadata. |
contextScenes.queryStyle.avoidTerms |
["podcast", "slideshow"] |
Words/phrases that downrank candidate videos before download. |
contextScenes.youtubeIngest.enabled |
true |
Lets the mixer auto-ingest YouTube clips into scene-library. |
contextScenes.youtubeIngest.maxResultsPerQuery |
4 |
YouTube search results fetched for each planner query. |
contextScenes.youtubeIngest.maxDownloadsPerQuery |
1 |
Maximum new scene clip downloaded for each planner query. Keeping this low spreads downloads across more distinct queries. |
contextScenes.youtubeIngest.maxDurationSeconds |
60 |
Skips videos longer than this. |
contextScenes.youtubeIngest.channelId |
null |
Optional channel restriction for scene ingest. |
Example:
{
"contextScenes": {
"enabled": true,
"libraryDir": "./scene-library",
"planningModel": "gpt-4.1-mini",
"maxInsertionsPerClip": 10,
"minInsertionSeconds": 0.7,
"maxInsertionSeconds": 2.6,
"minGapSeconds": 0.2,
"edgeBufferSeconds": 0.6,
"targetCoverageRatio": 0.5,
"maxCoverageRatio": 0.55,
"transcriptChunkWords": 8,
"allowSceneReuseWithinClip": false,
"popCultureResearch": {
"enabled": true,
"model": "gpt-4.1",
"candidatesPerSegment": 8,
"useForYoutubeQueries": true,
"maxQueriesPerInsertion": 4,
"minQueryConfidence": 9,
"writeMarkdown": false
},
"queryStyle": {
"queriesPerInsertion": 3,
"maxExpandedQueriesPerBase": 1,
"minCandidateScore": 12,
"minCoreQueryMatches": 2,
"preferMotion": true,
"preferCinematic": true,
"avoidTalkingHeads": true,
"styleModifiers": ["cinematic", "4k", "close up", "dramatic", "slow motion", "commercial", "b roll"],
"themeBoosts": ["money", "discipline", "faith", "urgency", "luxury", "transformation", "motivation"],
"avoidTerms": ["podcast", "interview", "reaction", "slideshow", "lyrics", "compilation", "news", "talk show", "meme", "anime", "cartoon", "gameplay", "music video", "trailer", "funny", "recreates", "ishowspeed", "streamer", "vlog", "prank", "challenge", "shorts", "instruction", "instructions", "tutorial", "review", "unboxing", "toy", "charging", "how to", "product", "killed", "kills", "stabbed", "stabbing", "shooting", "shot", "carjacking", "sheriff", "deputies", "deputy", "county", "hcso", "says", "police", "crime", "suspect", "arrested", "dead", "death", "homicide", "subscribe", "report", "reporter", "breaking", "cbs", "fox", "abc", "nbc", "ktla", "couple", "romantic", "romance", "kissing", "kiss", "sound", "effect", "effects", "sfx"]
},
"youtubeIngest": {
"enabled": true,
"maxResultsPerQuery": 4,
"maxDownloadsPerQuery": 1,
"maxDurationSeconds": 60,
"channelId": null
}
}
}| Field | Example | What it controls |
|---|---|---|
soundEffects.enabled |
true |
Turns automatic SFX mixing on for process / smart:clips. |
soundEffects.libraryDir |
"./sfx-library" |
Folder containing standardized SFX files and index.json. |
soundEffects.volume |
0.065 |
Base SFX volume. Keep this low so the speaker stays dominant. |
soundEffects.originalAudioVolume |
1 |
Original clip audio volume before SFX are mixed in. |
soundEffects.maxEffectsPerClip |
8 |
Maximum SFX events in one generated short. |
soundEffects.minGapSeconds |
2.2 |
Minimum spacing between SFX events. |
soundEffects.edgeBufferSeconds |
0.45 |
Avoids SFX right at the first/last frame. |
soundEffects.maxSfxDurationSeconds |
1.2 |
Trims long sounds so effects stay punchy. |
soundEffects.sceneTransitionSfxEnabled |
true |
Adds transition-style SFX at context-scene cutaway starts. |
soundEffects.captionKeywordSfxEnabled |
true |
Adds SFX on caption words that match configured context keywords. |
soundEffects.allowReuseWithinClip |
false |
When false, each SFX file can be used only once inside a generated short. If no unused sound is available, the event is skipped. |
soundEffects.transitionVolumeMultiplier |
0.78 |
Makes cutaway transition sounds quieter/louder than base volume. |
soundEffects.keywordVolumeMultiplier |
1 |
Makes caption keyword sounds quieter/louder than base volume. |
soundEffects.contextKeywords |
{ "money": ["cash"] } |
Category-to-keyword map used to pick matching SFX. Add terms here to steer context matching. |
Scene library options:
- Drop short scene clips anywhere inside
scene-library/. - Add a sidecar file next to a clip like
my-scene.mp4.scene.json. - Or create a top-level
scene-library/index.json.
When youtubeIngest.enabled is on and YOUTUBE_API_KEY is present, the pipeline can also add new YouTube clips into this library automatically from transcript-matched search queries.
Sidecar example:
{
"title": "Trading floor celebration",
"source": "The Wolf of Wall Street",
"description": "High-energy money, winning, status, excess",
"tags": ["money", "wealth", "winning", "power", "celebration"],
"startSeconds": 0,
"endSeconds": 2.8
}Index example:
{
"scenes": [
{
"id": "wolf-office-01",
"file": "wolf/trading-floor-01.mp4",
"title": "Trading floor celebration",
"source": "The Wolf of Wall Street",
"description": "Money, winning, power, ambition",
"tags": ["money", "winning", "power", "ambition"],
"startSeconds": 0,
"endSeconds": 2.8
}
]
}| Field | Example | What it controls |
|---|---|---|
normalFontFamily |
"\"Arial Rounded MT Bold\", \"Avenir Next\", sans-serif" |
Font stack for normal words. |
highlightFontFamily |
"\"SignPainter\", \"Snell Roundhand\", cursive" |
Font stack for highlighted keywords. |
normalFontWeight |
950 |
Weight for normal words. |
highlightFontWeight |
400 |
Weight for highlighted words. |
normalFontStyle |
"normal" |
Style for normal words. |
highlightFontStyle |
"normal" or "italic" |
Style for highlighted words. |
uppercase |
false |
Converts displayed caption text to uppercase. |
Highlighted words are chosen in this order:
--highlight-wordsif passed to a render command.- AI-selected
highlightWordsfromselection.json. highlightedWordsfromcaption-style.json.- Automatic strongest-word highlighting for the visible caption tokens.
You can force certain words globally:
{
"highlightedWords": ["worth", "money", "discipline", "purpose"]
}| Field | Example | What it controls |
|---|---|---|
combineTokensWithinMilliseconds |
620 |
Groups nearby words into the same caption moment. Higher values keep more words together. |
captionLayout |
"inline-wrap" |
Caption token layout. Supported: stacked, inline, inline-wrap. |
visibleTokensBefore |
1 |
Number of previous words to keep visible beside the active word. |
visibleTokensAfter |
1 |
Number of upcoming words to preview beside the active word. |
motionPreset |
"center-pop" |
Built-in movement. Supported: static, center-pop, center-to-left, center-to-right, float. |
motionKeyframes |
See below | Custom movement over each caption beat. Overrides motionPreset when present. |
baseFontSizeRatio |
0.086 |
Caption size as a ratio of the smaller output dimension. |
minFontSize |
42 |
Minimum caption font size in pixels. |
lineHeight |
0.82 |
Vertical rhythm when words stack. |
gapRatio |
0.05 |
Gap between stacked words as a ratio of font size. |
letterSpacing |
"0.01em" |
CSS letter spacing for caption text. |
maxCaptionWidth |
"92%" |
Maximum caption block width. Useful for huge centered captions. |
activeScale |
1 |
Scale for the currently spoken token. |
inactiveScale |
0.62 |
Scale for the previous token still visible. |
highlightScale |
1.62 |
Extra scale for highlighted keywords. |
activePopStartScale |
0.72 |
Start scale for the pop-in animation. Lower means more pop. |
motionKeyframes run from at: 0 to at: 1 during each caption beat. xPercent and yPercent are viewport percentages, so -24 moves the caption left by about 24vw.
{
"position": "center-impact",
"captionLayout": "inline-wrap",
"motionKeyframes": [
{ "at": 0, "xPercent": 0, "yPercent": 0, "scale": 0.78, "opacity": 0 },
{ "at": 0.16, "xPercent": 0, "yPercent": 0, "scale": 1.16, "opacity": 1 },
{ "at": 0.58, "xPercent": -24, "yPercent": 0, "scale": 0.92, "opacity": 1 },
{ "at": 1, "xPercent": -24, "yPercent": 0, "scale": 0.9, "opacity": 0.98 }
]
}| Field | Example | What it controls |
|---|---|---|
textColor |
"#ffffff" |
Caption fill color. Hex colors respect textOpacity. |
textOpacity |
0.92 |
Caption fill opacity. Keep around 0.88-0.95 for translucent but readable text. |
normalTextColor |
"#f5f1ea" |
Optional color override for non-highlighted words. |
highlightTextColor |
"#f8f4ed" |
Optional color override for highlighted words. |
normalTextOpacityMultiplier |
1 |
Multiplies textOpacity for normal words. |
highlightTextOpacityMultiplier |
1.04 |
Multiplies textOpacity for highlighted words. |
textBlendMode |
"difference" |
Global CSS mix-blend-mode for all caption text. |
normalTextBlendMode |
"difference" |
Blend mode override for normal words. |
highlightTextBlendMode |
"difference" |
Blend mode override for highlighted words. |
normalTextFilterCss |
"contrast(1.08) saturate(0.92)" |
Extra CSS filter for normal words. |
highlightTextFilterCss |
"contrast(1.12) saturate(0.96)" |
Extra CSS filter for highlighted words. |
shadowColor |
"rgba(0, 0, 0, 0.55)" |
Main stroke/shadow color. |
normalStrokeRatio |
0.045 |
Stroke width for normal words relative to font size. |
highlightStrokeRatio |
0.012 |
Stroke width for highlighted words. |
minStrokePx |
1 |
Minimum text stroke width in pixels. |
normalTextShadow |
null or CSS |
Custom text-shadow for normal words. |
highlightTextShadow |
null or CSS |
Custom text-shadow for highlighted words. |
dropShadow |
null or CSS filter |
Custom CSS filter, usually drop-shadow(...). |
Shadow fields can use {fontSize} as a template value:
{
"normalTextShadow": "0 8px 12px rgba(0,0,0,0.75)",
"highlightTextShadow": "0 6px 10px rgba(0,0,0,0.55), 0 0 18px rgba(255,255,255,0.18)",
"dropShadow": "drop-shadow(0 5px 3px rgba(0,0,0,0.35))"
}Use null to let the renderer use its defaults:
{
"normalTextShadow": null,
"highlightTextShadow": null,
"dropShadow": null
}For the “letters react to the footage underneath” look, use blend modes:
{
"textColor": "#f5f1ea",
"textOpacity": 0.62,
"textBlendMode": "difference",
"normalTextFilterCss": "contrast(1.08) saturate(0.92)",
"highlightTextFilterCss": "contrast(1.12) saturate(0.96)"
}The renderer supports an optional foreground/alpha video above captions:
npm run rerender:clip -- \
--clip "/path/to/clip.captions.json" \
--foreground-video "/path/to/transparent-subject-layer.webm"This makes the layer order: background video, captions, foreground subject. If the foreground video has transparency, captions appear to pass behind the subject. The pipeline does not automatically generate subject masks yet; that requires a segmentation step that outputs a transparent foreground video for each clip.
Run AI selection on a local video:
npm run smart:clips -- \
--video "/path/to/original.mp4" \
--out-dir "outputs/smart-clips" \
--max-clips 6The selector looks for clips that are viral-worthy, motivational, inspirational, spiritually interesting, concrete, emotionally resonant, or high-retention. It avoids bland intros and tries to choose complete thought arcs.
Useful smart:clips options:
| Option | Meaning |
|---|---|
--video FILE |
Required source video. |
--out-dir DIR |
Final captioned clip folder. |
--work-dir DIR |
Intermediate clips/transcripts folder. |
--max-clips N |
Number of clips to create. |
--min-seconds N |
Minimum selected core length. Default 18. |
--max-seconds N |
Maximum selected core length. Default 55. |
--padding-seconds N |
Extra seconds before/after. Default 2. |
--reselect |
Ask AI to choose clips again. |
--vertical |
1080x1920 cropped fill. |
--vertical-contain |
1080x1920 contained with black bars. |
--selection-model ID |
OpenAI model for selection. |
npm run transcribe extracts mono 16kHz MP3 audio, splits long audio into chunks, retries transient OpenAI errors, and writes both caption tokens and the full transcription response.
Useful options:
| Option | Meaning |
|---|---|
--model ID |
Transcription model. Default whisper-1. |
--prompt TEXT |
Context words to improve transcription. |
--retries N |
Retry count for transient failures. Default 5. |
--audio-bitrate RATE |
Temporary audio bitrate. Default 48k. |
--chunk-seconds N |
Chunk length for longer audio. Default 180. |
Caption files can be either a raw array or an object with a captions array:
{
"captions": [
{
"text": " worth",
"startMs": 4900,
"endMs": 5400,
"timestampMs": 5150,
"confidence": null
}
]
}When fixing transcription manually, edit the "text" values. Keep startMs, endMs, and timestampMs unchanged unless you intentionally want to retime the captions.
npm run sample:props
npm run studioThen open the CaptionedClip composition in Remotion Studio and load work/sample-props.json as props.