threatcode · pull · Jun 12, 2026 · Jun 12, 2026 · Jun 12, 2026 · Jun 12, 2026
diff --git a/infra/experimental/agent-skills/fuzzing-go-expert/SKILL.md b/infra/experimental/agent-skills/fuzzing-go-expert/SKILL.md
@@ -87,6 +87,10 @@ compile_native_go_fuzzer github.com/owner/repo/pkg2 FuzzBar fuzz_bar
 - Dictionaries go in `$OUT/<fuzzer_name>.dict` as plaintext token files.
 - Alternatively, add seeds directly via `f.Add(...)` in the harness — these
   are compiled in and used as the initial corpus.
+- For targets that parse a structured format, generating seeds with a script
+  beats hand-picking a few files — random mutation rarely passes the parser's
+  early checks. See the [structured seed generation
+  reference](../oss-fuzz-engineer/references/structured_seed_generation.md).
 
 ## Characteristics of good Go fuzzing harnesses
 

diff --git a/infra/experimental/agent-skills/fuzzing-jvm-expert/SKILL.md b/infra/experimental/agent-skills/fuzzing-jvm-expert/SKILL.md
@@ -143,6 +143,10 @@ and adjust JAR paths accordingly.
 
 - Zip seed files to `$OUT/<fuzzer_name>_seed_corpus.zip`.
 - Place dictionaries at `$OUT/<fuzzer_name>.dict`.
+- For targets that parse a structured format, generating seeds with a script
+  beats hand-picking a few files — random mutation rarely passes the parser's
+  early checks. See the [structured seed generation
+  reference](../oss-fuzz-engineer/references/structured_seed_generation.md).
 
 ## Characteristics of good JVM fuzzing harnesses
 

diff --git a/infra/experimental/agent-skills/fuzzing-memory-unsafe-expert/SKILL.md b/infra/experimental/agent-skills/fuzzing-memory-unsafe-expert/SKILL.md
@@ -55,4 +55,22 @@ python3 infra/helper.py check_build <project_name>
 - Always document the rationale for design decisions in the fuzzing harness, and the rationale for why the harness is expected to find bugs. This can be done in a markdown file in the same directory as the fuzzing harness, or in comments in the code of the fuzzing harness itself.
 - Look for function entrypoints that are exposed to untrusted input, and try to design fuzzing harnesses that target these entrypoints. This is often the most effective way to find security bugs.
 - When extending existing fuzzing harnesses, always validate that the existing code coverage does not digress. You should empirically evaluate this and give a justification that no digression has happened, or if it has happened then you should give a justification for why the digression is acceptable in light of the achieved extension.
-- When extending fuzzing harnesses you should give justification for the impact of bugs that they will find.
+- When extending fuzzing harnesses you should give justification for the impact of bugs that they will find.
+
+### Seed corpus and structured generation
+
+A good harness needs a good initial corpus. Place seed files in
+`$OUT/<fuzzer_name>_seed_corpus.zip` and dictionaries in
+`$OUT/<fuzzer_name>.dict`.
+
+For targets that parse a structured format (binary containers like ELF/PE, or
+codec/network bitstreams, or text grammars), a few hand-picked sample files
+are rarely enough: random mutation almost never gets past the parser's magic /
+length / checksum checks, so the deep parsing code stays dark. The most
+effective approach is a **script that constructs structurally-valid inputs
+from scratch**, run from `build.sh` and appended to the corpus. It is
+reproducible, needs no external samples, and lets you target specific
+dark-but-reachable code identified from coverage. See the OSS-Fuzz engineer
+skill's [structured seed generation
+reference](../oss-fuzz-engineer/references/structured_seed_generation.md) for
+the full workflow and `projects/vlc/generate_seeds.py` for a worked example.
diff --git a/infra/experimental/agent-skills/fuzzing-python-expert/SKILL.md b/infra/experimental/agent-skills/fuzzing-python-expert/SKILL.md
@@ -112,6 +112,10 @@ produces an executable in `$OUT` named after the `.py` file.
   `$OUT/<fuzzer_name>_seed_corpus.zip`.
 - Dictionaries go to `$OUT/<fuzzer_name>.dict` — especially valuable for
   text-format parsers (JSON, XML, YAML, CSV, etc.).
+- For targets that parse a structured format, generating seeds with a script
+  beats hand-picking a few files — random mutation rarely passes the parser's
+  early checks. See the [structured seed generation
+  reference](../oss-fuzz-engineer/references/structured_seed_generation.md).
 
 ## Characteristics of good Python fuzzing harnesses
 

diff --git a/infra/experimental/agent-skills/fuzzing-rust-expert/SKILL.md b/infra/experimental/agent-skills/fuzzing-rust-expert/SKILL.md
@@ -115,6 +115,11 @@ ENV RUSTUP_TOOLCHAIN=nightly-2025-07-03
   automatically picked up by cargo-fuzz and can be zipped for OSS-Fuzz.
 - To ship a corpus with OSS-Fuzz copy a zip to `$OUT/<target_name>_seed_corpus.zip`.
 - Dictionaries go to `$OUT/<target_name>.dict`.
+- For targets that parse a structured format, generating seeds with a script
+  beats hand-picking a few files — random mutation rarely passes the parser's
+  early checks (note: cargo-fuzz's `arbitrary` is the better route when the
+  target takes typed data rather than a byte format). See the [structured seed
+  generation reference](../oss-fuzz-engineer/references/structured_seed_generation.md).
 
 ## Characteristics of good Rust fuzzing harnesses
 

diff --git a/infra/experimental/agent-skills/oss-fuzz-engineer/SKILL.md b/infra/experimental/agent-skills/oss-fuzz-engineer/SKILL.md
@@ -47,6 +47,8 @@ A useful approach for extending a project is to study the latest code coverage r
 
 Reading the source code and identifying "important-looking" functions is not sufficient — important functions are frequently already covered. Coverage data from `summary.json` is the authoritative source of truth for what needs work.
 
+**Structured seed generation.** Adding a new harness is not the only way to extend coverage — often the existing harnesses already reach dark code, but the corpus never produces inputs valid enough to enter it. When a target parses a structured format (binary containers, codec/network bitstreams, text grammars), a script that constructs structurally-valid inputs from scratch is frequently the highest-leverage, lowest-review-cost improvement: random bytes rarely pass a parser's early magic/length/checksum checks, so the deep logic stays dark until seeded. Drive this the same coverage-first way: pick reachable files that are dark in `summary.json`, generate seeds that target them, validate each one actually parses, append them to the existing corpora (never replace), and confirm the union does not digress. See the [structured seed generation reference](references/structured_seed_generation.md) for the full workflow, construction techniques, per-fuzzer tailoring, and pitfalls, and `projects/vlc/generate_seeds.py` for a worked example.
+
 Use the local code coverage feature of the `python3 infra/helper.py` tool to generate code coverage reports for fuzz targets locally, for example to validate the code coverage achieved by a new fuzz target. This can be done by running `python3 infra/helper.py introspector --coverage-only PROJECT_NAME` and then studying the generated report in e.g. build/out/PROJECT_NAME/report. Some examples of this include:
 
 ```

diff --git a/...imental/agent-skills/oss-fuzz-engineer/references/structured_seed_generation.md b/...imental/agent-skills/oss-fuzz-engineer/references/structured_seed_generation.md
@@ -0,0 +1,206 @@
+# Structured seed generation
+
+Many fuzz targets parse a structured format: a binary container (ELF, PE,
+Mach-O, archives), a network/codec bitstream (MPEG-TS, HEIF, DV), or a text
+grammar (assembly, a config/definition language). For these, random bytes
+almost never get past the parser's first validity checks (magic numbers,
+length fields, checksums), so the fuzzer wastes effort at the entrance and the
+deep parsing code stays dark.
+
+A small script that **constructs structurally-valid inputs from scratch** is
+the highest-leverage fix: it gives libFuzzer starting points that already pass
+the early checks, so mutation explores the real logic. This is far more
+effective than a handful of hand-picked sample files, and it is reproducible,
+self-contained (no external corpus), and easy to extend.
+
+The canonical example in this repository is
+[`projects/vlc/generate_seeds.py`](../../../../../projects/vlc/generate_seeds.py),
+which builds MPEG-TS, HEIF, DV, VC-1, CDG and MUS streams from first
+principles. Study it before writing your own.
+
+## When to use this
+
+Use a generator script when **coverage shows reachable-but-dark parser code**
+and the format is structured. Do not write seeds for code that is already
+well covered, or for code that is unreachable for reasons a seed cannot fix
+(see "Seed-limited vs harness-limited" below).
+
+## Workflow
+
+1. **Select targets from coverage, not intuition.** Fetch the project's
+   public `summary.json` (see [code_coverage.md](code_coverage.md)), parse the
+   per-file line percentages, and pick files that are **reachable by an
+   existing harness** but sit at low coverage (e.g. < 30%). The production
+   report reflects the full accumulated corpus, so it is the authoritative
+   "what is still dark" signal.
+
+2. **Construct seeds with a script.** Write a `generate_seeds.py` that emits
+   one file per structural variant into a `seeds/<group>/` tree. See
+   "Construction techniques" below.
+
+3. **Validate every seed actually parses — and reaches the target.** A seed
+   that fails the magic/header check yields *zero* coverage. Check each one
+   with the real tool first — e.g. `readelf`/`objdump`/`file` for object files,
+   or run the harness binary on it and confirm it is processed rather than
+   rejected. Then confirm with a coverage run that the seed actually moves the
+   *intended* dark file's coverage; "it parses" is necessary but not
+   sufficient.
+
+4. **Wire it into `build.sh`, appending — never replacing.** Run the script at
+   build time and **add** the seeds to the existing corpus zips so no original
+   seed is lost:
+
+   ```sh
+   python3 $SRC/generate_seeds.py $SRC/generated_seeds
+   for t in target_a target_b; do
+     zip -j $OUT/fuzz_${t}_seed_corpus.zip $SRC/generated_seeds/seeds/<group>/*
+   done
+   ```
+
+   Copy the script in via the `Dockerfile` (`COPY generate_seeds.py $SRC/`).
+
+5. **Measure: no digression, and quantify the gain.** Run coverage on the
+   union (baseline corpus + generated seeds) and confirm it is **>= baseline**
+   (appending guarantees this; verify it). To show the seeds reach genuinely
+   new code, compare per-file covered-line *counts* against the production
+   report: if a generated seed covers more lines of a file than the whole
+   production corpus does, those extra lines are provably new (pigeonhole).
+
+6. **Iterate.** Re-read coverage after adding seeds, find the next dark-but-
+   reachable branch, and add a variant for it. A few rounds of generate ->
+   measure -> target-the-next-gap typically unlock far more than one large
+   batch, and keep each change easy to review.
+
+## Construction techniques (from `projects/vlc/generate_seeds.py`)
+
+- **Build the framing exactly.** Honor packet boundaries, box/section length
+  fields, and alignment. An off-by-one length usually makes the parser bail
+  before the interesting code.
+- **Compute checksums in the script.** Formats that carry a CRC/hash reject
+  inputs with a wrong one at the header. Implement the checksum (e.g. VLC's
+  `crc32_mpeg`) so sections validate and parsing continues.
+- **Pack fields with `struct`.** Use explicit endianness and the format's
+  reserved-bit conventions, e.g. `struct.pack('>H', 0xE000 | pid)`.
+- **Compose small builders.** Build primitives that nest into larger
+  structures (packet -> PES -> table -> stream); this keeps the script
+  readable and lets you produce many variants cheaply.
+- **Emit multiple variants per format.** Different header values, versions,
+  optional sections and edge-case sizes hit different branches. One
+  parameterized builder over many variants (e.g. one ELF builder over dozens
+  of `e_machine` values) can unlock a whole family of per-target backends.
+- **Map each seed group to the code it targets** in comments, and note what
+  the previous corpus failed to reach — this is the rationale a reviewer needs.
+- **Keep seeds small.** libFuzzer favours small inputs; a minimal-but-valid
+  seed mutates faster and more usefully than a large one. Build the smallest
+  structure that reaches the target code.
+- **Be deterministic.** The script runs on every build, so the corpus must be
+  byte-identical each time — no timestamps, no RNG, no wall-clock. Vary
+  outputs by an explicit index/parameter, not randomness.
+
+## Minimal skeleton
+
+`projects/vlc/generate_seeds.py` is the full reference, but it is large; start
+from this shape and grow it. The script takes a corpus root and writes one
+file per variant under `seeds/<group>/`:
+
+```python
+#!/usr/bin/env python3
+import os, struct, sys
+
+def make_widget(variant):
+    # Build the smallest structurally-valid input that reaches the target.
+    # Honor magic, length fields and checksums; vary by `variant`.
+    body = struct.pack('<I', variant)              # ... real structure here
+    return b'WDGT' + struct.pack('<I', len(body)) + body
+
+def main(root):
+    out = os.path.join(root, 'seeds', 'widget')
+    os.makedirs(out, exist_ok=True)
+    for v in range(4):                             # deterministic variants
+        with open(os.path.join(out, f'widget-{v}.bin'), 'wb') as f:
+            f.write(make_widget(v))
+
+if __name__ == '__main__':
+    main(sys.argv[1])
+```
+
+Wire it into `build.sh` (and `COPY generate_seeds.py $SRC/` in the Dockerfile):
+
+```sh
+python3 $SRC/generate_seeds.py $SRC/generated_seeds
+zip -j $OUT/fuzz_widget_seed_corpus.zip $SRC/generated_seeds/seeds/widget/*
+```
+
+## Per-fuzzer tailoring
+
+Tailor seeds to a specific fuzzer **only when its input contract differs** from
+a generic parser:
+
+- A harness gated on a specific target/architecture (it rejects non-matching
+  inputs) should receive only matching seeds — anything else is inert.
+- A harness that exercises a narrow path (e.g. one that only follows
+  separate-debug-file links, not full debug-section dumping) wants seeds for
+  *that* path, not the general format.
+
+For the common case — several harnesses that all parse the same format — a
+single shared, diverse corpus is correct; splitting it per fuzzer adds
+maintenance for no gain (libFuzzer cross-pollinates, and variety helps all of
+them).
+
+## Dictionaries
+
+A generator is a natural place to also emit libFuzzer dictionaries
+(`$OUT/<fuzzer>.dict`) — magic bytes, tag names, keywords. Dictionaries help
+the mutator synthesize tokens it would rarely discover byte-by-byte. VLC emits
+both seeds and `dictionaries/*.dict` from the same script.
+
+## Seed-limited vs harness-limited code
+
+Before generating seeds, confirm the dark code is actually reachable by an
+existing harness. Some code cannot be reached by any input:
+
+- Options disabled in the harness (a `// dump_x` left commented out).
+- Build-time exclusions (e.g. a project built with `--disable-ld` cannot reach
+  linker code).
+- Format ambiguity where the tool refuses to pick a target and bails.
+
+If the code is harness-limited, no seed will help — that needs a harness
+change, which is out of scope for seed work. Note the distinction explicitly
+rather than generating seeds that cannot move coverage.
+
+## Measurement pitfalls
+
+- **Validate the header first.** The most common waste is a seed the parser
+  rejects immediately; it contributes nothing.
+- **Some harnesses break the coverage tooling.** Targets that call `exit()` on
+  bad input or leak memory can make libFuzzer's `-merge` coverage step produce
+  no profile, especially on small or mixed corpora. This is a tooling
+  limitation, not a seed defect; measure such targets on a homogeneous,
+  valid-only corpus, and rely on per-seed validation plus the established
+  principle that a structured starting corpus helps a previously-unseeded
+  harness.
+- **Do not mutate a coverage build's `$OUT`.** Manually `rm`/copying files
+  inside `build/out/<project>` of a coverage build corrupts its state and
+  makes `helper.py coverage` fail for *all* corpora; rebuild if that happens.
+  Use `helper.py coverage --corpus-dir <dir>` on a clean build to measure a
+  specific corpus.
+
+## When a generator is not enough
+
+A static seed corpus gets the fuzzer past the front door, but for formats with
+deep internal structure (length-prefixed trees, checksummed sub-records) the
+mutator can still corrupt structure faster than it explores logic. If coverage
+plateaus despite good seeds, the next step is structure-aware fuzzing — a
+libFuzzer custom mutator, `FuzzedDataProvider` to split the input, or a
+grammar/`protobuf`-based mutator. That is harness/tooling work beyond seed
+generation, but the seeds you built remain a valuable starting corpus for it.
+
+## Checklist
+
+- [ ] Targets chosen from `summary.json` (reachable, low coverage), not intuition.
+- [ ] Confirmed the dark code is seed-limited, not harness-limited.
+- [ ] Generator is deterministic and emits small, minimal-but-valid seeds.
+- [ ] Each seed validated: it parses *and* moves the intended file's coverage.
+- [ ] Seeds appended to existing corpora (never replaced); script copied in via Dockerfile.
+- [ ] Union coverage measured: no digression, gain quantified vs production.
+- [ ] Each seed group's target code and rationale documented in comments.
diff --git a/projects/binutils/Dockerfile b/projects/binutils/Dockerfile
@@ -16,9 +16,10 @@
 
 FROM gcr.io/oss-fuzz-base/base-builder
 RUN apt-get update && apt-get install -y make texinfo libgmp-dev libmpfr-dev
-RUN apt-get install -y flex bison
+RUN apt-get update && apt-get install -y flex bison
 RUN git clone --depth=1 https://github.com/DavidKorczynski/binary-samples binary-samples
 RUN git clone --recursive --depth 1 git://sourceware.org/git/binutils-gdb.git binutils-gdb
 WORKDIR $SRC
 COPY build.sh $SRC/
 COPY fuzz_*.c $SRC/
+COPY generate_seeds.py $SRC/
diff --git a/projects/binutils/build.sh b/projects/binutils/build.sh
@@ -175,6 +175,25 @@ fi
 for fuzzname in readelf_pef readelf_elf32_csky readelf_elf64_mmix readelf_elf32_littlearm readelf_elf32_bigarm objdump objdump_safe nm objcopy bfd windres addr2line dwarf; do
   cp $SRC/binary-samples/oss-fuzz-binutils/general_seeds.zip $OUT/fuzz_${fuzzname}_seed_corpus.zip
 done
+
+# Generate structured seeds (see generate_seeds.py) and append them to the
+# relevant corpora; existing seeds are retained.
+python3 $SRC/generate_seeds.py $SRC/generated_seeds
+
+# Object-file seeds -> object-consuming fuzzers.
+GEN_OBJ_SEEDS=$(find $SRC/generated_seeds/seeds/elf_reloc \
+    $SRC/generated_seeds/seeds/dwarf $SRC/generated_seeds/seeds/elf_meta \
+    $SRC/generated_seeds/seeds/archive -type f)
+for fuzzname in readelf readelf_pef readelf_elf32_csky readelf_elf64_mmix \
+    readelf_elf32_littlearm readelf_elf32_bigarm objdump objdump_safe nm \
+    objcopy bfd addr2line dwarf; do
+  zip -j $OUT/fuzz_${fuzzname}_seed_corpus.zip $GEN_OBJ_SEEDS
+done
+
+# Format-specific seeds for the otherwise-unseeded fuzz_as and fuzz_dlltool.
+zip -j $OUT/fuzz_as_seed_corpus.zip $SRC/generated_seeds/seeds/gas/seed.s
+zip -j $OUT/fuzz_dlltool_seed_corpus.zip \
+    $SRC/generated_seeds/seeds/dlltool/seed.def
 # Seed targeted the pef file format
 cp $SRC/binary-samples/oss-fuzz-binutils/fuzz_bfd_ext_seed_corpus.zip $OUT/fuzz_bfd_ext_seed_corpus.zip
-Original file line number
+Diff line change
@@ Expand Up @@
     Reading the source code and identifying "important-looking" functions is not sufficient — important functions are frequently already covered. Coverage data from `summary.json` is the authoritative source of truth for what needs work.
+    **Structured seed generation.** Adding a new harness is not the only way to extend coverage — often the existing harnesses already reach dark code, but the corpus never produces inputs valid enough to enter it. When a target parses a structured format (binary containers, codec/network bitstreams, text grammars), a script that constructs structurally-valid inputs from scratch is frequently the highest-leverage, lowest-review-cost improvement: random bytes rarely pass a parser's early magic/length/checksum checks, so the deep logic stays dark until seeded. Drive this the same coverage-first way: pick reachable files that are dark in `summary.json`, generate seeds that target them, validate each one actually parses, append them to the existing corpora (never replace), and confirm the union does not digress. See the [structured seed generation reference](references/structured_seed_generation.md) for the full workflow, construction techniques, per-fuzzer tailoring, and pitfalls, and `projects/vlc/generate_seeds.py` for a worked example.
     Use the local code coverage feature of the `python3 infra/helper.py` tool to generate code coverage reports for fuzz targets locally, for example to validate the code coverage achieved by a new fuzz target. This can be done by running `python3 infra/helper.py introspector --coverage-only PROJECT_NAME` and then studying the generated report in e.g. build/out/PROJECT_NAME/report. Some examples of this include:
     ```
@@ Expand Down @@