Skip to content

feat(media): image/video/audio project kinds via od media generate#11

Open
pftom wants to merge 14 commits intomainfrom
cursor/289994c1
Open

feat(media): image/video/audio project kinds via od media generate#11
pftom wants to merge 14 commits intomainfrom
cursor/289994c1

Conversation

@pftom
Copy link
Copy Markdown
Contributor

@pftom pftom commented Apr 28, 2026

Summary

Adds non-web media surfaces (image, video, audio) as first-class project
kinds. The unifying contract is:

skill workflow + project metadata tell the agent what to make;
one shell command — od media generate — is how bytes are produced.

This keeps the design tool-agnostic: any code-agent CLI with shell access
(Claude Code, Codex, Gemini, OpenCode, Cursor Agent, Qwen, …) can drive
media generation without bespoke tool integrations.

Changes

Frontend

  • New Project panel gains Image / Video / Audio tabs with model picker,
    aspect / length / duration controls, audio-kind and voice selection.
  • Examples and Design Systems tabs gain layered sections so the
    new media skills sit alongside prototype / slides / interactive video.
  • FileViewer renders generated image/*, video/*, and audio/*
    files inline (next to the existing HTML preview / source views).
  • Icons and i18n strings (en + zh-CN) added for the new surfaces.

Shared registry

  • + 'src/media/models.ts' + is the single source of truth for image / video /
    audio model IDs, aspects, and defaults. Both the picker and the daemon
    dispatcher consume it so they cannot drift.

Prompts

  • + 'src/prompts/media-contract.ts' + is pinned last in the system prompt for
    media surfaces. Its hard rules (call + 'od media generate' + , do not embed
    binary in + '' + , allowed model IDs per surface) override any
    softer wording earlier in the prompt stack.

Daemon

  • New + 'daemon/media.js' + dispatcher + + 'daemon/media-models.js' + JSON view of the
    registry.
  • + 'daemon/cli.js' + exposes + 'od media generate' + as a subcommand, wired through
    + 'server.js' + / + 'projects.js' + so the daemon writes generated files back into
    the project dir and the FileViewer picks them up automatically.

Skills

  • Seed skills for the three surfaces: + 'audio-jingle' + , + 'image-poster' + ,
    + 'video-shortform' + — each with a + 'SKILL.md' + workflow and a representative
    + 'example.html' + thumbnail.

Provider note

The provider integrations behind specific model IDs (gpt-image-2,
seedance-2, suno-v5, …) may still be stubs — the dispatcher returns
success and a placeholder file. The contract stays the same; bytes get
sharper as real provider integrations land.

Test plan

  • + 'pnpm install' + and + 'pnpm typecheck' + pass after the media additions.
  • + 'pnpm dev:all' + boots; new project panel shows Image / Video / Audio tabs.
  • Creating an image / video / audio project lands the agent in a
    project where the system prompt ends with the media contract.
  • + 'od media generate' + returns a JSON line and writes a file under
    + 'OD_PROJECT_DIR' + ; FileViewer renders it.
  • Examples tab and Design Systems tab still render correctly with
    the new layered sections.

Made with Cursor

pftom added 12 commits April 28, 2026 14:48
- Updated project name in package.json, package-lock.json, and README files.
- Changed CLI commands and references from "ocd" to "od".
- Adjusted file structure references in documentation and code to reflect new naming conventions.
- Enhanced .gitignore to include new runtime data files.
- Updated metadata in LICENSE file to match new project name.
- Introduced CONTRIBUTING.md and CONTRIBUTING.zh-CN.md to provide clear instructions for contributors.
- Outlined contribution types, local setup instructions, and merging criteria for skills and design systems.
- Enhanced README files to reference the new contributing guidelines.
- Clarified DECK_FRAMEWORK_DIRECTIVE description in both English and Chinese README files to specify conditions for deck kind without a skill seed.
- Added detailed workflow instructions in deck-framework.ts to emphasize the importance of copying the framework before adding content.
- Enhanced discovery.ts to reinforce the framework-first approach for deck projects.
- Updated system.ts to ensure proper handling of deck projects with and without bound skills, preventing re-authorship of scaling and navigation logic.
- Clarified DECK_FRAMEWORK_DIRECTIVE description in both English and Chinese README files to specify conditions for deck kind without a skill seed.
- Added detailed workflow instructions in deck-framework.ts to emphasize the importance of copying the framework before adding content.
- Enhanced discovery.ts to reinforce the framework-first approach for deck projects.
- Updated system.ts to ensure proper handling of deck projects with and without bound skills, preventing re-authorship of scaling and navigation logic.
- Added a "Star us" section in both English and Chinese README files to encourage users to star the project on GitHub.
- Included a new image asset for the star promotion.
- Introduced a new HTML file for a dedicated star promotion page.
- Updated .gitignore to exclude new cursor-related files.
… generate dispatcher

Extends Open Design from web-only to a multi-modal creation tool. The
unifying contract is one code-agent loop driven by skills + project
metadata + prompt constraints; for non-web surfaces the agent shells
out to a single dispatcher (`od media generate`) that the daemon
routes per (surface, model).

- Types: new Surface union, MediaAspect / AudioKind, image/video/audio
  ProjectKind + ProjectMetadata fields, video/audio ProjectFileKind.
- NewProjectPanel: top-level surface picker + Image / Video / Audio
  forms with model, aspect, length, duration, voice, audio-kind pickers.
- ExamplesTab + DesignSystemsTab: surface filter row that scopes
  before mode / scenario / category filters.
- FileViewer / FileWorkspace: native <video> and <audio> previews and
  matching tab icons.
- Daemon: parses `od.surface` and `> Surface:` blockquotes; recognises
  mp4 / webm / mov / mp3 / wav / ogg / m4a / flac extensions; spawns
  agents with OD_BIN / OD_DAEMON_URL / OD_PROJECT_ID / OD_PROJECT_DIR
  env so any code-agent CLI with shell access can call the dispatcher.
- daemon/media.js + daemon/media-models.js: surface-agnostic dispatcher
  with stub providers that emit deterministic placeholder bytes
  (1x1 PNG, valid mp4 ftyp, mp3 frame / silent WAV) so the framework
  works without API keys; real provider integrations slot in later.
- daemon/cli.js: `od media generate --surface ... --model ...`
  subcommand routes to POST /api/projects/:id/media/generate and
  prints one JSON line for the agent to parse.
- prompts/media-contract.ts: hard contract pinned LAST in the system
  prompt for image/video/audio surfaces — env vars, exact invocation,
  registered model IDs per surface, six workflow rules. system.ts
  metadata block updated to point at the contract.
- Seed skills: image-poster, video-shortform, audio-jingle each ship a
  SKILL.md with `mode/surface: image|video|audio` and a stylized
  example.html preview, and instruct the agent to dispatch via the
  contract.

Made-with: Cursor
Introduce non-web media surfaces (image, video, audio) as first-class
project kinds. The unifying contract is "skill workflow + project
metadata tell the agent WHAT to make; one shell command — od media
generate — is HOW bytes are produced", so any code-agent CLI with
shell access can drive it without bespoke tools.

- Frontend: New Project panel gains Image/Video/Audio tabs with model
  picker, aspect/length/duration controls, and audio kind/voice
  selection. Examples and Design Systems tabs gain layered sections.
  FileViewer renders the generated image/video/audio files.
- Shared registry: src/media/models.ts is the single source of truth
  for image/video/audio model IDs, aspects, and defaults — consumed
  by the picker AND the daemon dispatcher.
- Prompts: media-contract.ts is pinned LAST in the system prompt for
  media surfaces so its hard rules (call od media generate, don't
  emit binary in <artifact>, allowed model IDs) win over softer
  earlier wording.
- Daemon: new media.js dispatcher + media-models.js JSON view of the
  registry; cli.js gets the `od media generate` subcommand wired up
  via server.js / projects.js so the daemon writes files back into
  the project dir.
- Skills: audio-jingle, image-poster, video-shortform seed examples
  for the three surfaces.

Made-with: Cursor
@pftom pftom marked this pull request as ready for review April 28, 2026 14:44
Bring in the parallel media-surfaces branch from PR #12. Tree is
already identical to HEAD (same od media generate work landed
independently), so this is a history-only merge to consolidate the
two branches.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 976a6eadf2

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread daemon/media.js Outdated
Comment thread src/components/NewProjectPanel.tsx
Copy link
Copy Markdown
Contributor

@lefarcen lefarcen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review summary (COMMENT — not approving)

Headline: PR #11 and PR #12 are byte-identical duplicates. I ran diff <(gh pr diff 11) <(gh pr diff 12) — same 3747-line diff, same 28 files, same +2902/-78. Both branches (cursor/289994c1, cursor/47ca13ab) are Cursor worktrees from @pftom, opened ~14 seconds apart. The only meaningful difference is metadata: PR #11 has the richer description (composition diagram, architecture explanation, hyperframes-composition note), and its head is more recent — its last commit 8719c082 (Apr 28 14:46Z) merges PR #12's branch into this one to consolidate. Recommendation: land #11 (or close both and rebase a single fresh PR), close #12 as duplicate. No issue is linked; please add one (or note that the work follows the implicit non-web-surfaces direction).

Test plan: every checkbox in this PR's description is [ ] unchecked — including pnpm typecheck / pnpm build / smoke test. PR #12 reports those green ([x]). Worth either copying #12's test results across or actually re-running before merge so reviewers can see green on the chosen PR.

Core architecture is sound. The media/models.tsdaemon/media-models.js registry, the media-contract.ts pinned-last-prompt, the od media generate dispatcher, and the OD_BIN/OD_PROJECT_ID env injection compose cleanly. Stub providers emit valid byte signatures (PNG, mp4 ftyp, mp3 frame, RIFF WAV) so the round-trip is testable without API keys.

Top concerns (inline below): (1) hand-mirrored registry has no enforcement that JS/TS stay in sync — the comment promises 'tests in verify' but I see none in the diff; (2) POST /api/projects/:id/media/generate has no auth/rate-limit and accepts an agent-supplied output filename (sanitized but unbounded re-writes); (3) the env-injection only covers the spawn path — confirm no other agent-spawn site is missed; (4) the contract's stub-provider disclaimer can mislead users into thinking they got real bytes; (5) prompt-side metadata duplicates the contract's 'no ' rule three times; (6) i18n is bilingual-complete (good).

Comment thread daemon/media-models.js Outdated
Comment thread daemon/media.js Outdated
Comment thread daemon/server.js
Comment thread daemon/server.js
Comment thread src/prompts/media-contract.ts Outdated
Comment thread src/prompts/system.ts Outdated
Copy link
Copy Markdown
Contributor

@lefarcen lefarcen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small correction to my earlier review.

I claimed this PR's body is richer than #12's — that's backwards. #12's description is the more detailed one (ASCII architecture diagram, file-by-file breakdown of frontend / daemon / skills sections, explicit composition note about the upcoming hyperframes worktree). This PR's body is the shorter "skill workflow + project metadata tells the agent WHAT, od media generate is HOW" framing plus a provider-stub disclaimer.

The recommendation still stands: keep this PR (#11) as the keeper because its HEAD 8719c082 is a merge of #12's branch into this one — i.e. this branch is the consolidated history, and #12 is the one to close. But consider lifting #12's body onto this PR before closing, since it's a better artifact for future archaeology.

All the technical concerns from my earlier review (registry mirror without a sync test, --output overwrite path, /api/projects/:id/media/generate rate-limit / size-cap / CORS posture, OD_DAEMON_URL hard-coded loopback, stub-provider disclaimer not flowing through to the user, anti-<artifact> rule duplicated four times) all still apply.

Apologies for the body-comparison slip.

…t dedupe)

- Surface-aware model validation in generateMedia: reject mismatched
  (surface, audioKind, model) tuples up-front so an audio model id can
  no longer route through the image path.
- Drop hidden designSystemId / inspirations when the New Project panel
  surface is image / video / audio so a stale web-tab pick can't bleed
  into media projects (the picker is hidden, so users couldn't see or
  clear it).
- Single source of truth for the media model registry: src/media/
  models.data.json, consumed by both src/media/models.ts and the
  daemon's media-models.js. No more hand-mirrored arrays drifting.
- Collision-safe writes: generateMedia auto-suffixes
  poster.png -> poster-2.png on filename collision instead of silently
  clobbering an existing artifact.
- Harden /api/projects/:id/media/generate:
  - 64KB body cap dispatched at the global JSON parser (vs 4MB elsewhere)
  - explicit project-id regex check (with decode round-trip)
  - reject cross-origin POSTs whose Origin header does not match the
    daemon
  - cap prompt / output / voice string lengths inside generateMedia
  - distinguish 413 from 400 in the route handler
- Derive OD_DAEMON_URL from a single DAEMON_HOST constant shared with
  the listen() bind, so changing the bind host can't drift the agent's
  callback URL silently.
- Add a SOLE-spawn-site comment so future agent-launch paths don't
  forget the OD env injection block.
- New workflow rule #7 in MEDIA_GENERATION_CONTRACT: agent must
  surface stub providerNote ("stub-png", "stub-mp4", ...) to the user
  rather than narrating placeholder bytes as a real generation.
- Drop the duplicated "Do NOT emit <artifact>" lines from each per-
  surface metadata block in renderMetadataBlock — the canonical rule
  lives only in the contract block now.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@lefarcen
Copy link
Copy Markdown
Contributor

👋 Thanks @pftom — extending OD to image/video/audio with a single od media generate contract is a really clean architectural bet. The single-source-of-truth registry (src/media/models.ts) consumed by both picker and dispatcher, and the media-contract pinned LAST so it overrides earlier prompt layers, are both nicely thought out. 🎨🙏

This is the keeper of the two media PRs (its HEAD is the merge of #12 into this branch).

Inline concerns are mostly the "before real providers land" hardening:

  • ⚠️ Stub providers ship unconditionally — would gate behind OD_MEDIA_ALLOW_STUBS for prod
  • 🔒 POST /api/projects/:id/media/generate has no rate-limit / size-cap; agent-supplied --output can clobber existing files (no overwrite:false guard)
  • 💡 daemon/media-models.js mirrors src/media/models.ts by hand — codegen or a sync test would prevent drift

Architecture is the right shape. ✅

@lefarcen lefarcen added the enhancement New feature or request label Apr 29, 2026
lefarcen
lefarcen previously approved these changes May 2, 2026
Copy link
Copy Markdown
Contributor

@lefarcen lefarcen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved.

@lefarcen lefarcen added the feature New feature or enhancement label May 2, 2026
@lefarcen lefarcen self-requested a review May 2, 2026 03:21
Copy link
Copy Markdown
Contributor

@lefarcen lefarcen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @pftom! 🎉 Wow, this is a substantial feature — adding image/video/audio surfaces with a clean, tool-agnostic contract. The od media generate dispatcher is elegant, and I love how it works across any code-agent CLI without custom tool definitions. Pinning the contract LAST in the system prompt is clever — hard rules win.

Found 8 items worth attention (mix of P2 verification + P3 polish). Most are edge-case hardening + reasoning gaps in the new SKILL.md files. No P1 blockers.

See inline comments below 👇

Comment thread daemon/media.js
const ctx = {
surface,
model,
prompt: prompt || '',
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 TOCTOU race: uniqueFilename checks await pathExists(target), then later await writeFile(target, bytes). If two concurrent od media generate calls pick the same name (same timestamp), both see "file doesn't exist" and write to the same path — second one silently overwrites the first. Rare in practice (requires sub-millisecond collision), but the function comment promises collision safety. Fix: use fs.promises.open(target, 'wx') (exclusive write) and catch EEXIST to retry.

Comment thread daemon/cli.js

const body = {
surface,
model: flags.model,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 Missing validation: the CLI parses --length and --duration as Number(flags.length) but doesn't validate they're positive integers. A malicious/confused agent could pass --length=-5 or --length=banana, silently getting NaN in the POST body. The dispatcher checks typeof but Number('banana') is number (albeit NaN). Suggest: const len = Number(flags.length); if (!Number.isFinite(len) || len <= 0) { console.error('--length must be a positive number'); process.exit(2); }

Comment thread daemon/server.js
const expectedLocal = `http://localhost:${port}`;
if (origin !== expected && origin !== expectedLocal) {
return res.status(403).json({ error: 'cross-origin denied' });
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 Comment stale: says "The 64kb body-size cap for this route is applied by the dispatching JSON middleware in startServer() above" — but the middleware is 100+ lines earlier and not obviously "above" when reading this route. Consider: // Body size cap: see the jsonSmall middleware ~100 lines up, applied per-route before parsing.

- \`OD_DAEMON_URL\` — base URL of the local daemon, e.g. \`http://127.0.0.1:7456\`.

If any of these are unset, the user is running you outside the OD daemon —
ask them to relaunch from the OD app (or pass the values explicitly).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 Reasoning gap (Lens B): the contract says "verify with `echo`" but doesn't show how (new users might type echo OD_BIN instead of echo \"$OD_BIN\"). Add one concrete example: (verify with \echo "$OD_PROJECT_ID"` — it should print the project UUID)`


`audioKind`, `audioModel`, `audioDuration` (seconds), and (for speech)
`voice`. Branch by `audioKind` and use the values verbatim — no
clarifying form unless something is marked `(unknown — ask)`.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Reasoning gap (Lens B — unstated assumption): Step 0 says "use the values verbatim" but Step 2 says "Compose the prompt ... Use the format the upstream model prefers." What if the metadata's audioModel is suno-v5 but the user's chat message says "make it Udio style"? The skill doesn't say which wins. Suggest adding a tiebreaker rule: "Metadata is authoritative unless the user's current message explicitly contradicts it (e.g. 'switch to Udio')." (This matches the contract's intent but the skill should state it.)

3. **Palette + textures** — hex anchors when the user gave a brand
palette; otherwise a 3-word mood tag (e.g. "muted ochre + ink").
4. **Camera / lens** — only if the user wants photographic realism
("85mm portrait, shallow DOF") or a specific film stock.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 Reasoning gap (Lens B — quantification missing): Step 1 prescribes a 5-point prompt structure but doesn't say how long each section should be. A junior user might write 2 sentences per point = 10-sentence prompt = way over the model's comfort zone. Add a rough token budget: e.g. "Aim for 1-2 sentences per point; total ~100-150 words. Longer prompts don't improve quality for most image models."


`videoModel`, `videoLength` (seconds), `videoAspect`. These are
hard-locks — clamp the prompt to whatever the chosen model supports
(Seedance 2 caps at 10s; Kling 4 supports up to 10s + image-to-video;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Reasoning gap (Lens B — failure mode not covered): Step 1 has a shotlist table with "Motion" = "What moves, at what pace? Subject motion vs camera motion." — but doesn't warn that most current text-to-video models struggle with complex multi-object motion. A user planning "character walks left while car drives right while leaves blow" will get disappointing results. Add a constraint note: "Current models (Seedance 2, Kling 3/4, Veo 3) handle 1-2 motion elements well; 3+ often drift or freeze. Prioritize the key motion."

<label className="newproj-label">{t('newproj.videoLengthLabel')}</label>
<div className="pill-grid">
{lengths.map((s) => (
<button
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P3 (minor): pickDefaultSkill logic prefers s.surface === surface && s.mode === surface, falls back to s.mode === surface. The comment says "legacy skills authored without `surface` still get picked up" — but this assumes mode was always set correctly. What if a skill has mode: 'prototype', surface: 'image' (authoring error)? It would never match the image surface (first condition fails on mode, second condition fails on mode !== 'image'). Not a real-world issue today (no such skills ship), but the fallback could be more robust: const modeMatch = skills.find(s => s.mode === surface || s.surface === surface); (match either field).

@lefarcen lefarcen dismissed their stale review May 2, 2026 03:26

Dismissed — accidental empty approval; defer to the prior COMMENTED review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request feature New feature or enhancement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants