feat(media): image/video/audio project kinds via od media generate#11
feat(media): image/video/audio project kinds via od media generate#11
Conversation
- Updated project name in package.json, package-lock.json, and README files. - Changed CLI commands and references from "ocd" to "od". - Adjusted file structure references in documentation and code to reflect new naming conventions. - Enhanced .gitignore to include new runtime data files. - Updated metadata in LICENSE file to match new project name.
- Introduced CONTRIBUTING.md and CONTRIBUTING.zh-CN.md to provide clear instructions for contributors. - Outlined contribution types, local setup instructions, and merging criteria for skills and design systems. - Enhanced README files to reference the new contributing guidelines.
- Clarified DECK_FRAMEWORK_DIRECTIVE description in both English and Chinese README files to specify conditions for deck kind without a skill seed. - Added detailed workflow instructions in deck-framework.ts to emphasize the importance of copying the framework before adding content. - Enhanced discovery.ts to reinforce the framework-first approach for deck projects. - Updated system.ts to ensure proper handling of deck projects with and without bound skills, preventing re-authorship of scaling and navigation logic.
- Clarified DECK_FRAMEWORK_DIRECTIVE description in both English and Chinese README files to specify conditions for deck kind without a skill seed. - Added detailed workflow instructions in deck-framework.ts to emphasize the importance of copying the framework before adding content. - Enhanced discovery.ts to reinforce the framework-first approach for deck projects. - Updated system.ts to ensure proper handling of deck projects with and without bound skills, preventing re-authorship of scaling and navigation logic.
… into feat/optimize-naming
- Added a "Star us" section in both English and Chinese README files to encourage users to star the project on GitHub. - Included a new image asset for the star promotion. - Introduced a new HTML file for a dedicated star promotion page. - Updated .gitignore to exclude new cursor-related files.
… generate dispatcher Extends Open Design from web-only to a multi-modal creation tool. The unifying contract is one code-agent loop driven by skills + project metadata + prompt constraints; for non-web surfaces the agent shells out to a single dispatcher (`od media generate`) that the daemon routes per (surface, model). - Types: new Surface union, MediaAspect / AudioKind, image/video/audio ProjectKind + ProjectMetadata fields, video/audio ProjectFileKind. - NewProjectPanel: top-level surface picker + Image / Video / Audio forms with model, aspect, length, duration, voice, audio-kind pickers. - ExamplesTab + DesignSystemsTab: surface filter row that scopes before mode / scenario / category filters. - FileViewer / FileWorkspace: native <video> and <audio> previews and matching tab icons. - Daemon: parses `od.surface` and `> Surface:` blockquotes; recognises mp4 / webm / mov / mp3 / wav / ogg / m4a / flac extensions; spawns agents with OD_BIN / OD_DAEMON_URL / OD_PROJECT_ID / OD_PROJECT_DIR env so any code-agent CLI with shell access can call the dispatcher. - daemon/media.js + daemon/media-models.js: surface-agnostic dispatcher with stub providers that emit deterministic placeholder bytes (1x1 PNG, valid mp4 ftyp, mp3 frame / silent WAV) so the framework works without API keys; real provider integrations slot in later. - daemon/cli.js: `od media generate --surface ... --model ...` subcommand routes to POST /api/projects/:id/media/generate and prints one JSON line for the agent to parse. - prompts/media-contract.ts: hard contract pinned LAST in the system prompt for image/video/audio surfaces — env vars, exact invocation, registered model IDs per surface, six workflow rules. system.ts metadata block updated to point at the contract. - Seed skills: image-poster, video-shortform, audio-jingle each ship a SKILL.md with `mode/surface: image|video|audio` and a stylized example.html preview, and instruct the agent to dispatch via the contract. Made-with: Cursor
Introduce non-web media surfaces (image, video, audio) as first-class project kinds. The unifying contract is "skill workflow + project metadata tell the agent WHAT to make; one shell command — od media generate — is HOW bytes are produced", so any code-agent CLI with shell access can drive it without bespoke tools. - Frontend: New Project panel gains Image/Video/Audio tabs with model picker, aspect/length/duration controls, and audio kind/voice selection. Examples and Design Systems tabs gain layered sections. FileViewer renders the generated image/video/audio files. - Shared registry: src/media/models.ts is the single source of truth for image/video/audio model IDs, aspects, and defaults — consumed by the picker AND the daemon dispatcher. - Prompts: media-contract.ts is pinned LAST in the system prompt for media surfaces so its hard rules (call od media generate, don't emit binary in <artifact>, allowed model IDs) win over softer earlier wording. - Daemon: new media.js dispatcher + media-models.js JSON view of the registry; cli.js gets the `od media generate` subcommand wired up via server.js / projects.js so the daemon writes files back into the project dir. - Skills: audio-jingle, image-poster, video-shortform seed examples for the three surfaces. Made-with: Cursor
Bring in the parallel media-surfaces branch from PR #12. Tree is already identical to HEAD (same od media generate work landed independently), so this is a history-only merge to consolidate the two branches.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 976a6eadf2
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
lefarcen
left a comment
There was a problem hiding this comment.
Review summary (COMMENT — not approving)
Headline: PR #11 and PR #12 are byte-identical duplicates. I ran diff <(gh pr diff 11) <(gh pr diff 12) — same 3747-line diff, same 28 files, same +2902/-78. Both branches (cursor/289994c1, cursor/47ca13ab) are Cursor worktrees from @pftom, opened ~14 seconds apart. The only meaningful difference is metadata: PR #11 has the richer description (composition diagram, architecture explanation, hyperframes-composition note), and its head is more recent — its last commit 8719c082 (Apr 28 14:46Z) merges PR #12's branch into this one to consolidate. Recommendation: land #11 (or close both and rebase a single fresh PR), close #12 as duplicate. No issue is linked; please add one (or note that the work follows the implicit non-web-surfaces direction).
Test plan: every checkbox in this PR's description is [ ] unchecked — including pnpm typecheck / pnpm build / smoke test. PR #12 reports those green ([x]). Worth either copying #12's test results across or actually re-running before merge so reviewers can see green on the chosen PR.
Core architecture is sound. The media/models.ts ↔ daemon/media-models.js registry, the media-contract.ts pinned-last-prompt, the od media generate dispatcher, and the OD_BIN/OD_PROJECT_ID env injection compose cleanly. Stub providers emit valid byte signatures (PNG, mp4 ftyp, mp3 frame, RIFF WAV) so the round-trip is testable without API keys.
Top concerns (inline below): (1) hand-mirrored registry has no enforcement that JS/TS stay in sync — the comment promises 'tests in verify' but I see none in the diff; (2) POST /api/projects/:id/media/generate has no auth/rate-limit and accepts an agent-supplied output filename (sanitized but unbounded re-writes); (3) the env-injection only covers the spawn path — confirm no other agent-spawn site is missed; (4) the contract's stub-provider disclaimer can mislead users into thinking they got real bytes; (5) prompt-side metadata duplicates the contract's 'no ' rule three times; (6) i18n is bilingual-complete (good).
lefarcen
left a comment
There was a problem hiding this comment.
Small correction to my earlier review.
I claimed this PR's body is richer than #12's — that's backwards. #12's description is the more detailed one (ASCII architecture diagram, file-by-file breakdown of frontend / daemon / skills sections, explicit composition note about the upcoming hyperframes worktree). This PR's body is the shorter "skill workflow + project metadata tells the agent WHAT, od media generate is HOW" framing plus a provider-stub disclaimer.
The recommendation still stands: keep this PR (#11) as the keeper because its HEAD 8719c082 is a merge of #12's branch into this one — i.e. this branch is the consolidated history, and #12 is the one to close. But consider lifting #12's body onto this PR before closing, since it's a better artifact for future archaeology.
All the technical concerns from my earlier review (registry mirror without a sync test, --output overwrite path, /api/projects/:id/media/generate rate-limit / size-cap / CORS posture, OD_DAEMON_URL hard-coded loopback, stub-provider disclaimer not flowing through to the user, anti-<artifact> rule duplicated four times) all still apply.
Apologies for the body-comparison slip.
…t dedupe)
- Surface-aware model validation in generateMedia: reject mismatched
(surface, audioKind, model) tuples up-front so an audio model id can
no longer route through the image path.
- Drop hidden designSystemId / inspirations when the New Project panel
surface is image / video / audio so a stale web-tab pick can't bleed
into media projects (the picker is hidden, so users couldn't see or
clear it).
- Single source of truth for the media model registry: src/media/
models.data.json, consumed by both src/media/models.ts and the
daemon's media-models.js. No more hand-mirrored arrays drifting.
- Collision-safe writes: generateMedia auto-suffixes
poster.png -> poster-2.png on filename collision instead of silently
clobbering an existing artifact.
- Harden /api/projects/:id/media/generate:
- 64KB body cap dispatched at the global JSON parser (vs 4MB elsewhere)
- explicit project-id regex check (with decode round-trip)
- reject cross-origin POSTs whose Origin header does not match the
daemon
- cap prompt / output / voice string lengths inside generateMedia
- distinguish 413 from 400 in the route handler
- Derive OD_DAEMON_URL from a single DAEMON_HOST constant shared with
the listen() bind, so changing the bind host can't drift the agent's
callback URL silently.
- Add a SOLE-spawn-site comment so future agent-launch paths don't
forget the OD env injection block.
- New workflow rule #7 in MEDIA_GENERATION_CONTRACT: agent must
surface stub providerNote ("stub-png", "stub-mp4", ...) to the user
rather than narrating placeholder bytes as a real generation.
- Drop the duplicated "Do NOT emit <artifact>" lines from each per-
surface metadata block in renderMetadataBlock — the canonical rule
lives only in the contract block now.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
👋 Thanks @pftom — extending OD to image/video/audio with a single This is the keeper of the two media PRs (its HEAD is the merge of #12 into this branch). Inline concerns are mostly the "before real providers land" hardening:
Architecture is the right shape. ✅ |
lefarcen
left a comment
There was a problem hiding this comment.
Hey @pftom! 🎉 Wow, this is a substantial feature — adding image/video/audio surfaces with a clean, tool-agnostic contract. The od media generate dispatcher is elegant, and I love how it works across any code-agent CLI without custom tool definitions. Pinning the contract LAST in the system prompt is clever — hard rules win.
Found 8 items worth attention (mix of P2 verification + P3 polish). Most are edge-case hardening + reasoning gaps in the new SKILL.md files. No P1 blockers.
See inline comments below 👇
| const ctx = { | ||
| surface, | ||
| model, | ||
| prompt: prompt || '', |
There was a problem hiding this comment.
P2 TOCTOU race: uniqueFilename checks await pathExists(target), then later await writeFile(target, bytes). If two concurrent od media generate calls pick the same name (same timestamp), both see "file doesn't exist" and write to the same path — second one silently overwrites the first. Rare in practice (requires sub-millisecond collision), but the function comment promises collision safety. Fix: use fs.promises.open(target, 'wx') (exclusive write) and catch EEXIST to retry.
|
|
||
| const body = { | ||
| surface, | ||
| model: flags.model, |
There was a problem hiding this comment.
P3 Missing validation: the CLI parses --length and --duration as Number(flags.length) but doesn't validate they're positive integers. A malicious/confused agent could pass --length=-5 or --length=banana, silently getting NaN in the POST body. The dispatcher checks typeof but Number('banana') is number (albeit NaN). Suggest: const len = Number(flags.length); if (!Number.isFinite(len) || len <= 0) { console.error('--length must be a positive number'); process.exit(2); }
| const expectedLocal = `http://localhost:${port}`; | ||
| if (origin !== expected && origin !== expectedLocal) { | ||
| return res.status(403).json({ error: 'cross-origin denied' }); | ||
| } |
There was a problem hiding this comment.
P3 Comment stale: says "The 64kb body-size cap for this route is applied by the dispatching JSON middleware in startServer() above" — but the middleware is 100+ lines earlier and not obviously "above" when reading this route. Consider: // Body size cap: see the jsonSmall middleware ~100 lines up, applied per-route before parsing.
| - \`OD_DAEMON_URL\` — base URL of the local daemon, e.g. \`http://127.0.0.1:7456\`. | ||
|
|
||
| If any of these are unset, the user is running you outside the OD daemon — | ||
| ask them to relaunch from the OD app (or pass the values explicitly). |
There was a problem hiding this comment.
P3 Reasoning gap (Lens B): the contract says "verify with `echo`" but doesn't show how (new users might type echo OD_BIN instead of echo \"$OD_BIN\"). Add one concrete example: (verify with \echo "$OD_PROJECT_ID"` — it should print the project UUID)`
|
|
||
| `audioKind`, `audioModel`, `audioDuration` (seconds), and (for speech) | ||
| `voice`. Branch by `audioKind` and use the values verbatim — no | ||
| clarifying form unless something is marked `(unknown — ask)`. |
There was a problem hiding this comment.
P2 Reasoning gap (Lens B — unstated assumption): Step 0 says "use the values verbatim" but Step 2 says "Compose the prompt ... Use the format the upstream model prefers." What if the metadata's audioModel is suno-v5 but the user's chat message says "make it Udio style"? The skill doesn't say which wins. Suggest adding a tiebreaker rule: "Metadata is authoritative unless the user's current message explicitly contradicts it (e.g. 'switch to Udio')." (This matches the contract's intent but the skill should state it.)
| 3. **Palette + textures** — hex anchors when the user gave a brand | ||
| palette; otherwise a 3-word mood tag (e.g. "muted ochre + ink"). | ||
| 4. **Camera / lens** — only if the user wants photographic realism | ||
| ("85mm portrait, shallow DOF") or a specific film stock. |
There was a problem hiding this comment.
P3 Reasoning gap (Lens B — quantification missing): Step 1 prescribes a 5-point prompt structure but doesn't say how long each section should be. A junior user might write 2 sentences per point = 10-sentence prompt = way over the model's comfort zone. Add a rough token budget: e.g. "Aim for 1-2 sentences per point; total ~100-150 words. Longer prompts don't improve quality for most image models."
|
|
||
| `videoModel`, `videoLength` (seconds), `videoAspect`. These are | ||
| hard-locks — clamp the prompt to whatever the chosen model supports | ||
| (Seedance 2 caps at 10s; Kling 4 supports up to 10s + image-to-video; |
There was a problem hiding this comment.
P2 Reasoning gap (Lens B — failure mode not covered): Step 1 has a shotlist table with "Motion" = "What moves, at what pace? Subject motion vs camera motion." — but doesn't warn that most current text-to-video models struggle with complex multi-object motion. A user planning "character walks left while car drives right while leaves blow" will get disappointing results. Add a constraint note: "Current models (Seedance 2, Kling 3/4, Veo 3) handle 1-2 motion elements well; 3+ often drift or freeze. Prioritize the key motion."
| <label className="newproj-label">{t('newproj.videoLengthLabel')}</label> | ||
| <div className="pill-grid"> | ||
| {lengths.map((s) => ( | ||
| <button |
There was a problem hiding this comment.
P3 (minor): pickDefaultSkill logic prefers s.surface === surface && s.mode === surface, falls back to s.mode === surface. The comment says "legacy skills authored without `surface` still get picked up" — but this assumes mode was always set correctly. What if a skill has mode: 'prototype', surface: 'image' (authoring error)? It would never match the image surface (first condition fails on mode, second condition fails on mode !== 'image'). Not a real-world issue today (no such skills ship), but the fallback could be more robust: const modeMatch = skills.find(s => s.mode === surface || s.surface === surface); (match either field).
Dismissed — accidental empty approval; defer to the prior COMMENTED review.
Summary
Adds non-web media surfaces (image, video, audio) as first-class project
kinds. The unifying contract is:
This keeps the design tool-agnostic: any code-agent CLI with shell access
(Claude Code, Codex, Gemini, OpenCode, Cursor Agent, Qwen, …) can drive
media generation without bespoke tool integrations.
Changes
Frontend
aspect / length / duration controls, audio-kind and voice selection.
new media skills sit alongside prototype / slides / interactive video.
image/*,video/*, andaudio/*files inline (next to the existing HTML preview / source views).
Shared registry
+ 'src/media/models.ts' +is the single source of truth for image / video /audio model IDs, aspects, and defaults. Both the picker and the daemon
dispatcher consume it so they cannot drift.
Prompts
+ 'src/prompts/media-contract.ts' +is pinned last in the system prompt formedia surfaces. Its hard rules (call
+ 'od media generate' +, do not embedbinary in
+ '' +, allowed model IDs per surface) override anysofter wording earlier in the prompt stack.
Daemon
+ 'daemon/media.js' +dispatcher ++ 'daemon/media-models.js' +JSON view of theregistry.
+ 'daemon/cli.js' +exposes+ 'od media generate' +as a subcommand, wired through+ 'server.js' +/+ 'projects.js' +so the daemon writes generated files back intothe project dir and the FileViewer picks them up automatically.
Skills
+ 'audio-jingle' +,+ 'image-poster' +,+ 'video-shortform' +— each with a+ 'SKILL.md' +workflow and a representative+ 'example.html' +thumbnail.Provider note
The provider integrations behind specific model IDs (gpt-image-2,
seedance-2, suno-v5, …) may still be stubs — the dispatcher returns
success and a placeholder file. The contract stays the same; bytes get
sharper as real provider integrations land.
Test plan
+ 'pnpm install' +and+ 'pnpm typecheck' +pass after the media additions.+ 'pnpm dev:all' +boots; new project panel shows Image / Video / Audio tabs.project where the system prompt ends with the media contract.
+ 'od media generate' +returns a JSON line and writes a file under+ 'OD_PROJECT_DIR' +; FileViewer renders it.the new layered sections.
Made with Cursor