diff --git a/infra/experimental/agent-skills/fuzzing-go-expert/SKILL.md b/infra/experimental/agent-skills/fuzzing-go-expert/SKILL.md index 4c4fcae50174..d325c8932f1d 100644 --- a/infra/experimental/agent-skills/fuzzing-go-expert/SKILL.md +++ b/infra/experimental/agent-skills/fuzzing-go-expert/SKILL.md @@ -87,6 +87,10 @@ compile_native_go_fuzzer github.com/owner/repo/pkg2 FuzzBar fuzz_bar - Dictionaries go in `$OUT/.dict` as plaintext token files. - Alternatively, add seeds directly via `f.Add(...)` in the harness — these are compiled in and used as the initial corpus. +- For targets that parse a structured format, generating seeds with a script + beats hand-picking a few files — random mutation rarely passes the parser's + early checks. See the [structured seed generation + reference](../oss-fuzz-engineer/references/structured_seed_generation.md). ## Characteristics of good Go fuzzing harnesses diff --git a/infra/experimental/agent-skills/fuzzing-jvm-expert/SKILL.md b/infra/experimental/agent-skills/fuzzing-jvm-expert/SKILL.md index 7b9a7ce5b2cd..67e1324d3c6f 100644 --- a/infra/experimental/agent-skills/fuzzing-jvm-expert/SKILL.md +++ b/infra/experimental/agent-skills/fuzzing-jvm-expert/SKILL.md @@ -143,6 +143,10 @@ and adjust JAR paths accordingly. - Zip seed files to `$OUT/_seed_corpus.zip`. - Place dictionaries at `$OUT/.dict`. +- For targets that parse a structured format, generating seeds with a script + beats hand-picking a few files — random mutation rarely passes the parser's + early checks. See the [structured seed generation + reference](../oss-fuzz-engineer/references/structured_seed_generation.md). ## Characteristics of good JVM fuzzing harnesses diff --git a/infra/experimental/agent-skills/fuzzing-memory-unsafe-expert/SKILL.md b/infra/experimental/agent-skills/fuzzing-memory-unsafe-expert/SKILL.md index f90af7da7684..fbdcd587fafc 100644 --- a/infra/experimental/agent-skills/fuzzing-memory-unsafe-expert/SKILL.md +++ b/infra/experimental/agent-skills/fuzzing-memory-unsafe-expert/SKILL.md @@ -55,4 +55,22 @@ python3 infra/helper.py check_build - Always document the rationale for design decisions in the fuzzing harness, and the rationale for why the harness is expected to find bugs. This can be done in a markdown file in the same directory as the fuzzing harness, or in comments in the code of the fuzzing harness itself. - Look for function entrypoints that are exposed to untrusted input, and try to design fuzzing harnesses that target these entrypoints. This is often the most effective way to find security bugs. - When extending existing fuzzing harnesses, always validate that the existing code coverage does not digress. You should empirically evaluate this and give a justification that no digression has happened, or if it has happened then you should give a justification for why the digression is acceptable in light of the achieved extension. -- When extending fuzzing harnesses you should give justification for the impact of bugs that they will find. \ No newline at end of file +- When extending fuzzing harnesses you should give justification for the impact of bugs that they will find. + +### Seed corpus and structured generation + +A good harness needs a good initial corpus. Place seed files in +`$OUT/_seed_corpus.zip` and dictionaries in +`$OUT/.dict`. + +For targets that parse a structured format (binary containers like ELF/PE, or +codec/network bitstreams, or text grammars), a few hand-picked sample files +are rarely enough: random mutation almost never gets past the parser's magic / +length / checksum checks, so the deep parsing code stays dark. The most +effective approach is a **script that constructs structurally-valid inputs +from scratch**, run from `build.sh` and appended to the corpus. It is +reproducible, needs no external samples, and lets you target specific +dark-but-reachable code identified from coverage. See the OSS-Fuzz engineer +skill's [structured seed generation +reference](../oss-fuzz-engineer/references/structured_seed_generation.md) for +the full workflow and `projects/vlc/generate_seeds.py` for a worked example. \ No newline at end of file diff --git a/infra/experimental/agent-skills/fuzzing-python-expert/SKILL.md b/infra/experimental/agent-skills/fuzzing-python-expert/SKILL.md index b1fc8f1557de..e656bc2674a4 100644 --- a/infra/experimental/agent-skills/fuzzing-python-expert/SKILL.md +++ b/infra/experimental/agent-skills/fuzzing-python-expert/SKILL.md @@ -112,6 +112,10 @@ produces an executable in `$OUT` named after the `.py` file. `$OUT/_seed_corpus.zip`. - Dictionaries go to `$OUT/.dict` — especially valuable for text-format parsers (JSON, XML, YAML, CSV, etc.). +- For targets that parse a structured format, generating seeds with a script + beats hand-picking a few files — random mutation rarely passes the parser's + early checks. See the [structured seed generation + reference](../oss-fuzz-engineer/references/structured_seed_generation.md). ## Characteristics of good Python fuzzing harnesses diff --git a/infra/experimental/agent-skills/fuzzing-rust-expert/SKILL.md b/infra/experimental/agent-skills/fuzzing-rust-expert/SKILL.md index 904e4d1c5e3e..7a06335a4375 100644 --- a/infra/experimental/agent-skills/fuzzing-rust-expert/SKILL.md +++ b/infra/experimental/agent-skills/fuzzing-rust-expert/SKILL.md @@ -115,6 +115,11 @@ ENV RUSTUP_TOOLCHAIN=nightly-2025-07-03 automatically picked up by cargo-fuzz and can be zipped for OSS-Fuzz. - To ship a corpus with OSS-Fuzz copy a zip to `$OUT/_seed_corpus.zip`. - Dictionaries go to `$OUT/.dict`. +- For targets that parse a structured format, generating seeds with a script + beats hand-picking a few files — random mutation rarely passes the parser's + early checks (note: cargo-fuzz's `arbitrary` is the better route when the + target takes typed data rather than a byte format). See the [structured seed + generation reference](../oss-fuzz-engineer/references/structured_seed_generation.md). ## Characteristics of good Rust fuzzing harnesses diff --git a/infra/experimental/agent-skills/oss-fuzz-engineer/SKILL.md b/infra/experimental/agent-skills/oss-fuzz-engineer/SKILL.md index 1eab52e063fa..86281b4b72bd 100644 --- a/infra/experimental/agent-skills/oss-fuzz-engineer/SKILL.md +++ b/infra/experimental/agent-skills/oss-fuzz-engineer/SKILL.md @@ -47,6 +47,8 @@ A useful approach for extending a project is to study the latest code coverage r Reading the source code and identifying "important-looking" functions is not sufficient — important functions are frequently already covered. Coverage data from `summary.json` is the authoritative source of truth for what needs work. +**Structured seed generation.** Adding a new harness is not the only way to extend coverage — often the existing harnesses already reach dark code, but the corpus never produces inputs valid enough to enter it. When a target parses a structured format (binary containers, codec/network bitstreams, text grammars), a script that constructs structurally-valid inputs from scratch is frequently the highest-leverage, lowest-review-cost improvement: random bytes rarely pass a parser's early magic/length/checksum checks, so the deep logic stays dark until seeded. Drive this the same coverage-first way: pick reachable files that are dark in `summary.json`, generate seeds that target them, validate each one actually parses, append them to the existing corpora (never replace), and confirm the union does not digress. See the [structured seed generation reference](references/structured_seed_generation.md) for the full workflow, construction techniques, per-fuzzer tailoring, and pitfalls, and `projects/vlc/generate_seeds.py` for a worked example. + Use the local code coverage feature of the `python3 infra/helper.py` tool to generate code coverage reports for fuzz targets locally, for example to validate the code coverage achieved by a new fuzz target. This can be done by running `python3 infra/helper.py introspector --coverage-only PROJECT_NAME` and then studying the generated report in e.g. build/out/PROJECT_NAME/report. Some examples of this include: ``` diff --git a/infra/experimental/agent-skills/oss-fuzz-engineer/references/structured_seed_generation.md b/infra/experimental/agent-skills/oss-fuzz-engineer/references/structured_seed_generation.md new file mode 100644 index 000000000000..f424cf9b634e --- /dev/null +++ b/infra/experimental/agent-skills/oss-fuzz-engineer/references/structured_seed_generation.md @@ -0,0 +1,206 @@ +# Structured seed generation + +Many fuzz targets parse a structured format: a binary container (ELF, PE, +Mach-O, archives), a network/codec bitstream (MPEG-TS, HEIF, DV), or a text +grammar (assembly, a config/definition language). For these, random bytes +almost never get past the parser's first validity checks (magic numbers, +length fields, checksums), so the fuzzer wastes effort at the entrance and the +deep parsing code stays dark. + +A small script that **constructs structurally-valid inputs from scratch** is +the highest-leverage fix: it gives libFuzzer starting points that already pass +the early checks, so mutation explores the real logic. This is far more +effective than a handful of hand-picked sample files, and it is reproducible, +self-contained (no external corpus), and easy to extend. + +The canonical example in this repository is +[`projects/vlc/generate_seeds.py`](../../../../../projects/vlc/generate_seeds.py), +which builds MPEG-TS, HEIF, DV, VC-1, CDG and MUS streams from first +principles. Study it before writing your own. + +## When to use this + +Use a generator script when **coverage shows reachable-but-dark parser code** +and the format is structured. Do not write seeds for code that is already +well covered, or for code that is unreachable for reasons a seed cannot fix +(see "Seed-limited vs harness-limited" below). + +## Workflow + +1. **Select targets from coverage, not intuition.** Fetch the project's + public `summary.json` (see [code_coverage.md](code_coverage.md)), parse the + per-file line percentages, and pick files that are **reachable by an + existing harness** but sit at low coverage (e.g. < 30%). The production + report reflects the full accumulated corpus, so it is the authoritative + "what is still dark" signal. + +2. **Construct seeds with a script.** Write a `generate_seeds.py` that emits + one file per structural variant into a `seeds//` tree. See + "Construction techniques" below. + +3. **Validate every seed actually parses — and reaches the target.** A seed + that fails the magic/header check yields *zero* coverage. Check each one + with the real tool first — e.g. `readelf`/`objdump`/`file` for object files, + or run the harness binary on it and confirm it is processed rather than + rejected. Then confirm with a coverage run that the seed actually moves the + *intended* dark file's coverage; "it parses" is necessary but not + sufficient. + +4. **Wire it into `build.sh`, appending — never replacing.** Run the script at + build time and **add** the seeds to the existing corpus zips so no original + seed is lost: + + ```sh + python3 $SRC/generate_seeds.py $SRC/generated_seeds + for t in target_a target_b; do + zip -j $OUT/fuzz_${t}_seed_corpus.zip $SRC/generated_seeds/seeds//* + done + ``` + + Copy the script in via the `Dockerfile` (`COPY generate_seeds.py $SRC/`). + +5. **Measure: no digression, and quantify the gain.** Run coverage on the + union (baseline corpus + generated seeds) and confirm it is **>= baseline** + (appending guarantees this; verify it). To show the seeds reach genuinely + new code, compare per-file covered-line *counts* against the production + report: if a generated seed covers more lines of a file than the whole + production corpus does, those extra lines are provably new (pigeonhole). + +6. **Iterate.** Re-read coverage after adding seeds, find the next dark-but- + reachable branch, and add a variant for it. A few rounds of generate -> + measure -> target-the-next-gap typically unlock far more than one large + batch, and keep each change easy to review. + +## Construction techniques (from `projects/vlc/generate_seeds.py`) + +- **Build the framing exactly.** Honor packet boundaries, box/section length + fields, and alignment. An off-by-one length usually makes the parser bail + before the interesting code. +- **Compute checksums in the script.** Formats that carry a CRC/hash reject + inputs with a wrong one at the header. Implement the checksum (e.g. VLC's + `crc32_mpeg`) so sections validate and parsing continues. +- **Pack fields with `struct`.** Use explicit endianness and the format's + reserved-bit conventions, e.g. `struct.pack('>H', 0xE000 | pid)`. +- **Compose small builders.** Build primitives that nest into larger + structures (packet -> PES -> table -> stream); this keeps the script + readable and lets you produce many variants cheaply. +- **Emit multiple variants per format.** Different header values, versions, + optional sections and edge-case sizes hit different branches. One + parameterized builder over many variants (e.g. one ELF builder over dozens + of `e_machine` values) can unlock a whole family of per-target backends. +- **Map each seed group to the code it targets** in comments, and note what + the previous corpus failed to reach — this is the rationale a reviewer needs. +- **Keep seeds small.** libFuzzer favours small inputs; a minimal-but-valid + seed mutates faster and more usefully than a large one. Build the smallest + structure that reaches the target code. +- **Be deterministic.** The script runs on every build, so the corpus must be + byte-identical each time — no timestamps, no RNG, no wall-clock. Vary + outputs by an explicit index/parameter, not randomness. + +## Minimal skeleton + +`projects/vlc/generate_seeds.py` is the full reference, but it is large; start +from this shape and grow it. The script takes a corpus root and writes one +file per variant under `seeds//`: + +```python +#!/usr/bin/env python3 +import os, struct, sys + +def make_widget(variant): + # Build the smallest structurally-valid input that reaches the target. + # Honor magic, length fields and checksums; vary by `variant`. + body = struct.pack('.dict`) — magic bytes, tag names, keywords. Dictionaries help +the mutator synthesize tokens it would rarely discover byte-by-byte. VLC emits +both seeds and `dictionaries/*.dict` from the same script. + +## Seed-limited vs harness-limited code + +Before generating seeds, confirm the dark code is actually reachable by an +existing harness. Some code cannot be reached by any input: + +- Options disabled in the harness (a `// dump_x` left commented out). +- Build-time exclusions (e.g. a project built with `--disable-ld` cannot reach + linker code). +- Format ambiguity where the tool refuses to pick a target and bails. + +If the code is harness-limited, no seed will help — that needs a harness +change, which is out of scope for seed work. Note the distinction explicitly +rather than generating seeds that cannot move coverage. + +## Measurement pitfalls + +- **Validate the header first.** The most common waste is a seed the parser + rejects immediately; it contributes nothing. +- **Some harnesses break the coverage tooling.** Targets that call `exit()` on + bad input or leak memory can make libFuzzer's `-merge` coverage step produce + no profile, especially on small or mixed corpora. This is a tooling + limitation, not a seed defect; measure such targets on a homogeneous, + valid-only corpus, and rely on per-seed validation plus the established + principle that a structured starting corpus helps a previously-unseeded + harness. +- **Do not mutate a coverage build's `$OUT`.** Manually `rm`/copying files + inside `build/out/` of a coverage build corrupts its state and + makes `helper.py coverage` fail for *all* corpora; rebuild if that happens. + Use `helper.py coverage --corpus-dir ` on a clean build to measure a + specific corpus. + +## When a generator is not enough + +A static seed corpus gets the fuzzer past the front door, but for formats with +deep internal structure (length-prefixed trees, checksummed sub-records) the +mutator can still corrupt structure faster than it explores logic. If coverage +plateaus despite good seeds, the next step is structure-aware fuzzing — a +libFuzzer custom mutator, `FuzzedDataProvider` to split the input, or a +grammar/`protobuf`-based mutator. That is harness/tooling work beyond seed +generation, but the seeds you built remain a valuable starting corpus for it. + +## Checklist + +- [ ] Targets chosen from `summary.json` (reachable, low coverage), not intuition. +- [ ] Confirmed the dark code is seed-limited, not harness-limited. +- [ ] Generator is deterministic and emits small, minimal-but-valid seeds. +- [ ] Each seed validated: it parses *and* moves the intended file's coverage. +- [ ] Seeds appended to existing corpora (never replaced); script copied in via Dockerfile. +- [ ] Union coverage measured: no digression, gain quantified vs production. +- [ ] Each seed group's target code and rationale documented in comments. diff --git a/projects/binutils/Dockerfile b/projects/binutils/Dockerfile index 1327819c93e4..aa553a538334 100644 --- a/projects/binutils/Dockerfile +++ b/projects/binutils/Dockerfile @@ -16,9 +16,10 @@ FROM gcr.io/oss-fuzz-base/base-builder RUN apt-get update && apt-get install -y make texinfo libgmp-dev libmpfr-dev -RUN apt-get install -y flex bison +RUN apt-get update && apt-get install -y flex bison RUN git clone --depth=1 https://github.com/DavidKorczynski/binary-samples binary-samples RUN git clone --recursive --depth 1 git://sourceware.org/git/binutils-gdb.git binutils-gdb WORKDIR $SRC COPY build.sh $SRC/ COPY fuzz_*.c $SRC/ +COPY generate_seeds.py $SRC/ diff --git a/projects/binutils/build.sh b/projects/binutils/build.sh index 7a2603b18d1a..1e22bcabbe83 100755 --- a/projects/binutils/build.sh +++ b/projects/binutils/build.sh @@ -175,6 +175,25 @@ fi for fuzzname in readelf_pef readelf_elf32_csky readelf_elf64_mmix readelf_elf32_littlearm readelf_elf32_bigarm objdump objdump_safe nm objcopy bfd windres addr2line dwarf; do cp $SRC/binary-samples/oss-fuzz-binutils/general_seeds.zip $OUT/fuzz_${fuzzname}_seed_corpus.zip done + +# Generate structured seeds (see generate_seeds.py) and append them to the +# relevant corpora; existing seeds are retained. +python3 $SRC/generate_seeds.py $SRC/generated_seeds + +# Object-file seeds -> object-consuming fuzzers. +GEN_OBJ_SEEDS=$(find $SRC/generated_seeds/seeds/elf_reloc \ + $SRC/generated_seeds/seeds/dwarf $SRC/generated_seeds/seeds/elf_meta \ + $SRC/generated_seeds/seeds/archive -type f) +for fuzzname in readelf readelf_pef readelf_elf32_csky readelf_elf64_mmix \ + readelf_elf32_littlearm readelf_elf32_bigarm objdump objdump_safe nm \ + objcopy bfd addr2line dwarf; do + zip -j $OUT/fuzz_${fuzzname}_seed_corpus.zip $GEN_OBJ_SEEDS +done + +# Format-specific seeds for the otherwise-unseeded fuzz_as and fuzz_dlltool. +zip -j $OUT/fuzz_as_seed_corpus.zip $SRC/generated_seeds/seeds/gas/seed.s +zip -j $OUT/fuzz_dlltool_seed_corpus.zip \ + $SRC/generated_seeds/seeds/dlltool/seed.def # Seed targeted the pef file format cp $SRC/binary-samples/oss-fuzz-binutils/fuzz_bfd_ext_seed_corpus.zip $OUT/fuzz_bfd_ext_seed_corpus.zip diff --git a/projects/binutils/generate_seeds.py b/projects/binutils/generate_seeds.py new file mode 100644 index 000000000000..1aa9d6165ed0 --- /dev/null +++ b/projects/binutils/generate_seeds.py @@ -0,0 +1,949 @@ +#!/usr/bin/env python3 +# Copyright 2026 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +# Generates structured seed corpora for the binutils OSS-Fuzz fuzz targets, +# built from scratch (no cross-toolchain). The historical general_seeds.zip is +# x86-dominated and ships only fully-linked binaries, leaving the per-arch BFD +# ELF backends, the DWARF readers, ELF metadata paths, and the unseeded +# fuzz_as/fuzz_dlltool harnesses dark. This script emits, under seeds//: +# +# elf_reloc/ relocatable ELF across ~50 architectures, each .rela/.rel +# spanning the arch's reloc types -> elfNN-.c howtos +# dwarf/ ELF with a wide set of .debug_* sections (v4/v5) -> dwarf.c +# elf_meta/ object attributes, notes, symbol versioning, .dynamic +# archive/ ar archives wrapping the above +# gas/ i386 assembly (.s) -> fuzz_as +# dlltool/ module-definition (.def) -> fuzz_dlltool +# +# Usage: generate_seeds.py + +import os +import struct +import sys + + +# ────────────────────────────────────────────────────────────────────────── +# ELF constants +# ────────────────────────────────────────────────────────────────────────── +ELFCLASS32, ELFCLASS64 = 1, 2 +ELFDATA2LSB, ELFDATA2MSB = 1, 2 +ET_REL = 1 +EV_CURRENT = 1 + +SHT_PROGBITS = 1 +SHT_SYMTAB = 2 +SHT_STRTAB = 3 +SHT_RELA = 4 +SHT_NOBITS = 8 +SHT_REL = 9 + +SHF_WRITE = 0x1 +SHF_ALLOC = 0x2 +SHF_EXECINSTR = 0x4 + +STB_GLOBAL = 1 +STT_NOTYPE, STT_OBJECT, STT_FUNC, STT_SECTION = 0, 1, 2, 3 + +# EM_*, ELF class, ELF data (endianness), and the highest relocation type +# number worth emitting for each architecture (from include/elf/.h). +# Format: name -> (e_machine, elfclass, data, use_rela, max_reloc, e_flags) +ARCHES = { + "riscv64": (243, ELFCLASS64, ELFDATA2LSB, True, 65, 0), + "riscv32": (243, ELFCLASS32, ELFDATA2LSB, True, 65, 0), + "loongarch64": (258, ELFCLASS64, ELFDATA2LSB, True, 130, 0), + "loongarch32": (258, ELFCLASS32, ELFDATA2LSB, True, 130, 0), + "csky": (252, ELFCLASS32, ELFDATA2LSB, True, 64, 0), + "aarch64": (183, ELFCLASS64, ELFDATA2LSB, True, 600, 0), + "ppc64": (21, ELFCLASS64, ELFDATA2LSB, True, 254, 0), + "ppc": (20, ELFCLASS32, ELFDATA2MSB, True, 255, 0), + "mips": (8, ELFCLASS32, ELFDATA2MSB, False, 254, 0), + "mips64": (8, ELFCLASS64, ELFDATA2MSB, True, 254, 0), + "arm": (40, ELFCLASS32, ELFDATA2LSB, False, 255, 0), + "s390": (22, ELFCLASS64, ELFDATA2MSB, True, 90, 0), + "sparc": (2, ELFCLASS32, ELFDATA2MSB, True, 252, 0), + "sparcv9": (43, ELFCLASS64, ELFDATA2MSB, True, 252, 0), + "sh": (42, ELFCLASS32, ELFDATA2LSB, True, 255, 0), + "m68k": (4, ELFCLASS32, ELFDATA2MSB, True, 68, 0), + "microblaze": (189, ELFCLASS32, ELFDATA2MSB, True, 33, 0), + # Additional architectures whose BFD ELF backends are fully dark (0%) or + # near-dark in the production OSS-Fuzz coverage report, simply because the + # corpus contains no object of that machine type. e_machine + endianness + # must match the canonical target or BFD will not select the backend. + "pru": (144, ELFCLASS32, ELFDATA2LSB, True, 40, 0), + "ip2k": (101, ELFCLASS32, ELFDATA2MSB, True, 14, 0), + "fr30": (84, ELFCLASS32, ELFDATA2MSB, True, 12, 0), + "m68hc11": (70, ELFCLASS32, ELFDATA2MSB, True, 24, 0), + "xstormy16": (0xad45, ELFCLASS32, ELFDATA2LSB, True, 129, 0), + "epiphany": (0x1223, ELFCLASS32, ELFDATA2LSB, True, 16, 0), + "ft32": (222, ELFCLASS32, ELFDATA2LSB, True, 14, 0), + "moxie": (223, ELFCLASS32, ELFDATA2MSB, True, 4, 0), + "rx": (173, ELFCLASS32, ELFDATA2LSB, True, 40, 0), + "rl78": (197, ELFCLASS32, ELFDATA2LSB, True, 40, 0), + "mn10300": (89, ELFCLASS32, ELFDATA2LSB, True, 36, 0), + "cr16": (177, ELFCLASS32, ELFDATA2LSB, True, 32, 0), + "crx": (114, ELFCLASS32, ELFDATA2LSB, True, 20, 0), + "mep": (0xF00D, ELFCLASS32, ELFDATA2MSB, True, 24, 0), + "nds32": (167, ELFCLASS32, ELFDATA2LSB, True, 60, 0), + "or1k": (92, ELFCLASS32, ELFDATA2MSB, True, 56, 0), + "m32r": (88, ELFCLASS32, ELFDATA2MSB, True, 50, 0), + "tilegx": (191, ELFCLASS64, ELFDATA2LSB, True, 130, 0), + "tilepro": (188, ELFCLASS32, ELFDATA2LSB, True, 90, 0), + "metag": (174, ELFCLASS32, ELFDATA2LSB, True, 62, 0), + "vax": (75, ELFCLASS32, ELFDATA2LSB, True, 12, 0), + "frv": (0x5441, ELFCLASS32, ELFDATA2MSB, True, 60, 0), + # Iteration 3: further dark ELF backends (elf32-xtensa.c 6893 lines @0%, + # elf32/64-kvx.c, elf32-arc.c, elf32-v850.c, elf32-score.c, ...). + "xtensa": (94, ELFCLASS32, ELFDATA2LSB, True, 60, 0), + "arc": (93, ELFCLASS32, ELFDATA2LSB, True, 60, 0), # EM_ARC_COMPACT + "avr": (83, ELFCLASS32, ELFDATA2LSB, True, 40, 0), + "cris": (76, ELFCLASS32, ELFDATA2LSB, True, 40, 0), + "d10v": (85, ELFCLASS32, ELFDATA2MSB, True, 20, 0), + "h8300": (46, ELFCLASS32, ELFDATA2MSB, True, 30, 0), + "iq2000": (0xFEBA, ELFCLASS32, ELFDATA2MSB, True, 20, 0), + "kvx": (256, ELFCLASS64, ELFDATA2LSB, True, 100, 0), + "lm32": (138, ELFCLASS32, ELFDATA2MSB, True, 30, 0), + "m32c": (120, ELFCLASS32, ELFDATA2LSB, True, 30, 0), + "msp430": (105, ELFCLASS32, ELFDATA2LSB, True, 40, 0), + "mt": (0x2530, ELFCLASS32, ELFDATA2MSB, True, 12, 0), + "score": (135, ELFCLASS32, ELFDATA2LSB, True, 40, 0), + "v850": (87, ELFCLASS32, ELFDATA2LSB, True, 40, 0), + "bpf": (247, ELFCLASS64, ELFDATA2LSB, True, 12, 0), + # Matching seeds for the arch-targeted readelf fuzzers (mmix, big-endian + # arm); csky and little-endian arm are already covered above. + "mmix": (80, ELFCLASS64, ELFDATA2MSB, True, 40, 0), + "armbe": (40, ELFCLASS32, ELFDATA2MSB, False, 255, 0), +} + + +class StringTable: + """An ELF string table: index 0 is the empty string.""" + + def __init__(self): + self.buf = bytearray(b"\x00") + self.offsets = {"": 0} + + def add(self, s): + if s in self.offsets: + return self.offsets[s] + off = len(self.buf) + self.offsets[s] = off + self.buf += s.encode() + b"\x00" + return off + + def bytes(self): + return bytes(self.buf) + + +class ElfObject: + """Builds a minimal but structurally valid ET_REL ELF object. + + Sections are appended in order; offsets and the section header table are + laid out by build(). Section index 0 is the reserved SHN_UNDEF entry. + """ + + def __init__(self, e_machine, elfclass=ELFCLASS64, data=ELFDATA2LSB, + e_flags=0): + self.machine = e_machine + self.elfclass = elfclass + self.data = data + self.e_flags = e_flags + self.end = "<" if data == ELFDATA2LSB else ">" + self.is64 = elfclass == ELFCLASS64 + self.shstrtab = StringTable() + # Each section: dict(name, type, flags, link, info, addralign, + # entsize, data) + self.sections = [dict(name="", type=0, flags=0, link=0, info=0, + addralign=0, entsize=0, data=b"")] + + def add_section(self, name, stype, data=b"", flags=0, link=0, info=0, + addralign=1, entsize=0): + self.shstrtab.add(name) + self.sections.append(dict(name=name, type=stype, flags=flags, + link=link, info=info, addralign=addralign, + entsize=entsize, data=bytes(data))) + return len(self.sections) - 1 + + def section_index(self, name): + for i, s in enumerate(self.sections): + if s["name"] == name: + return i + return 0 + + def build(self): + e = self.end + # Append the section-header string table as the final section. + shstr_idx = len(self.sections) + self.shstrtab.add(".shstrtab") + self.sections.append(dict(name=".shstrtab", type=SHT_STRTAB, + flags=0, link=0, info=0, addralign=1, + entsize=0, data=self.shstrtab.bytes())) + + ehsize = 64 if self.is64 else 52 + shentsize = 64 if self.is64 else 40 + + # Lay out section payloads after the ELF header. + offset = ehsize + for s in self.sections: + if s["type"] == 0 or s["type"] == SHT_NOBITS: + s["offset"] = 0 if s["type"] == 0 else offset + continue + align = max(s["addralign"], 1) + if offset % align: + offset += align - (offset % align) + s["offset"] = offset + offset += len(s["data"]) + + # Section header table goes after all payloads, 8-byte aligned. + if offset % 8: + offset += 8 - (offset % 8) + shoff = offset + + # ELF header. + ident = bytearray(16) + ident[0:4] = b"\x7fELF" + ident[4] = self.elfclass + ident[5] = self.data + ident[6] = EV_CURRENT + out = bytearray(ident) + if self.is64: + out += struct.pack(e + "HHIQQQIHHHHHH", + ET_REL, self.machine, EV_CURRENT, + 0, 0, shoff, self.e_flags, + ehsize, 0, 0, shentsize, + len(self.sections), shstr_idx) + else: + out += struct.pack(e + "HHIIIIIHHHHHH", + ET_REL, self.machine, EV_CURRENT, + 0, 0, shoff, self.e_flags, + ehsize, 0, 0, shentsize, + len(self.sections), shstr_idx) + + # Section payloads. + for s in self.sections: + if s["type"] == 0 or s["type"] == SHT_NOBITS: + continue + while len(out) < s["offset"]: + out += b"\x00" + out += s["data"] + + # Section header table. + while len(out) < shoff: + out += b"\x00" + for s in self.sections: + name_off = self.shstrtab.offsets[s["name"]] + size = 0 if s["type"] == 0 else len(s["data"]) + if self.is64: + out += struct.pack(e + "IIQQQQIIQQ", + name_off, s["type"], s["flags"], 0, + s["offset"], size, s["link"], s["info"], + s["addralign"], s["entsize"]) + else: + out += struct.pack(e + "IIIIIIIIII", + name_off, s["type"], s["flags"], 0, + s["offset"], size, s["link"], s["info"], + s["addralign"], s["entsize"]) + return bytes(out) + + # ── symbol / relocation helpers ────────────────────────────────────── + + def sym(self, name_off, info, shndx, value=0, size=0): + e = self.end + if self.is64: + return struct.pack(e + "IBBHQQ", name_off, info, 0, shndx, + value, size) + return struct.pack(e + "IIIBBH", name_off, value, size, info, 0, + shndx) + + def r_info(self, symidx, rtype): + if self.is64: + return (symidx << 32) | (rtype & 0xffffffff) + return (symidx << 8) | (rtype & 0xff) + + def rela(self, offset, symidx, rtype, addend=0): + e = self.end + if self.is64: + return struct.pack(e + "QQq", offset, self.r_info(symidx, rtype), + addend) + return struct.pack(e + "IIi", offset, self.r_info(symidx, rtype), + addend) + + def rel(self, offset, symidx, rtype): + e = self.end + if self.is64: + return struct.pack(e + "QQ", offset, self.r_info(symidx, rtype)) + return struct.pack(e + "II", offset, self.r_info(symidx, rtype)) + + +# ────────────────────────────────────────────────────────────────────────── +# Multi-architecture relocatable objects +# ────────────────────────────────────────────────────────────────────────── +def make_reloc_object(arch): + """A relocatable ELF object whose relocation section spans the + architecture's reloc type range, exercising elfNN-.c howto lookups + and the generic reloc readers in objdump/readelf.""" + machine, elfclass, data, use_rela, max_reloc, eflags = ARCHES[arch] + obj = ElfObject(machine, elfclass, data, eflags) + + # .text with a little content for relocs to point at. + text = obj.add_section(".text", SHT_PROGBITS, b"\x00" * 256, + flags=SHF_ALLOC | SHF_EXECINSTR, addralign=4) + obj.add_section(".data", SHT_PROGBITS, b"\x00" * 64, + flags=SHF_ALLOC | SHF_WRITE, addralign=4) + + # Symbol + string tables: one section symbol, a few named globals. + strtab = StringTable() + syms = [obj.sym(0, 0, 0)] # null symbol + syms.append(obj.sym(0, (STB_GLOBAL << 4) | STT_SECTION, text)) + names = ["foo", "bar", "_start", "data_sym"] + first_global = len(syms) + for n in names: + no = strtab.add(n) + syms.append(obj.sym(no, (STB_GLOBAL << 4) | STT_FUNC, text, 0, 4)) + strtab_idx = obj.add_section(".strtab", SHT_STRTAB, strtab.bytes()) + symtab_idx = obj.add_section( + ".symtab", SHT_SYMTAB, b"".join(syms), link=strtab_idx, + info=first_global, addralign=8, + entsize=24 if obj.is64 else 16) + + # Relocation section spanning the architecture's reloc types. + nsyms = len(syms) + entries = [] + off = 0 + for rtype in range(0, max_reloc + 1): + symidx = 1 + (rtype % (nsyms - 1)) if nsyms > 1 else 0 + if use_rela: + entries.append(obj.rela(off % 256, symidx, rtype, addend=rtype)) + else: + entries.append(obj.rel(off % 256, symidx, rtype)) + off += 4 + if use_rela: + obj.add_section(".rela.text", SHT_RELA, b"".join(entries), + link=symtab_idx, info=text, addralign=8, + entsize=24 if obj.is64 else 12) + else: + obj.add_section(".rel.text", SHT_REL, b"".join(entries), + link=symtab_idx, info=text, addralign=4, + entsize=16 if obj.is64 else 8) + return obj.build() + + +# ────────────────────────────────────────────────────────────────────────── +# DWARF debug information +# ────────────────────────────────────────────────────────────────────────── +def _uleb(v): + out = bytearray() + while True: + b = v & 0x7f + v >>= 7 + if v: + out.append(b | 0x80) + else: + out.append(b) + break + return bytes(out) + + +# DWARF constants +DW_TAG_compile_unit = 0x11 +DW_TAG_subprogram = 0x2e +DW_TAG_base_type = 0x24 +DW_TAG_variable = 0x34 +DW_CHILDREN_yes, DW_CHILDREN_no = 1, 0 +DW_AT_name = 0x03 +DW_AT_producer = 0x25 +DW_AT_language = 0x13 +DW_AT_low_pc = 0x11 +DW_AT_high_pc = 0x12 +DW_AT_comp_dir = 0x1b +DW_AT_stmt_list = 0x10 +DW_AT_byte_size = 0x0b +DW_AT_encoding = 0x3e +DW_AT_type = 0x49 +DW_FORM_addr = 0x01 +DW_FORM_data1 = 0x0b +DW_FORM_data2 = 0x05 +DW_FORM_data4 = 0x06 +DW_FORM_string = 0x08 +DW_FORM_strp = 0x0e +DW_FORM_ref4 = 0x13 +DW_FORM_sec_offset = 0x17 + + +def make_dwarf_object(version=4, is64=True): + """ELF object with hand-built DWARF .debug_* sections, exercising the + DWARF readers in binutils/dwarf.c and bfd/dwarf2.c.""" + machine = 62 if is64 else 3 # x86-64 / i386 host arches + elfclass = ELFCLASS64 if is64 else ELFCLASS32 + obj = ElfObject(machine, elfclass, ELFDATA2LSB) + obj.add_section(".text", SHT_PROGBITS, b"\x90" * 64, + flags=SHF_ALLOC | SHF_EXECINSTR, addralign=16) + + dstr = StringTable() + p_off = dstr.add("GNU C generated-seed " + str(version)) + n_off = dstr.add("seed.c") + cd_off = dstr.add("/seed") + + # .debug_abbrev: one CU abbrev (code 1) + one base_type (code 2). + abbrev = bytearray() + abbrev += _uleb(1) + _uleb(DW_TAG_compile_unit) + bytes([DW_CHILDREN_yes]) + for at, form in [(DW_AT_producer, DW_FORM_strp), + (DW_AT_language, DW_FORM_data2), + (DW_AT_name, DW_FORM_strp), + (DW_AT_comp_dir, DW_FORM_strp), + (DW_AT_low_pc, DW_FORM_addr), + (DW_AT_high_pc, DW_FORM_data4), + (DW_AT_stmt_list, DW_FORM_sec_offset)]: + abbrev += _uleb(at) + _uleb(form) + abbrev += _uleb(0) + _uleb(0) + abbrev += _uleb(2) + _uleb(DW_TAG_base_type) + bytes([DW_CHILDREN_no]) + for at, form in [(DW_AT_byte_size, DW_FORM_data1), + (DW_AT_encoding, DW_FORM_data1), + (DW_AT_name, DW_FORM_string)]: + abbrev += _uleb(at) + _uleb(form) + abbrev += _uleb(0) + _uleb(0) + abbrev += _uleb(0) # end of abbrev table + + addr_size = 8 if is64 else 4 + # .debug_info CU body (after the unit-length + header fields). + body = bytearray() + body += _uleb(1) # abbrev code 1 (CU) + body += struct.pack(" .debug_line + body += _uleb(2) # abbrev code 2 (base) + body += bytes([4, 5]) + b"int\x00" # byte_size, enc, name + body += _uleb(0) # end of children + + info = bytearray() + if version >= 5: + # DWARF5 CU header: unit_length, version, unit_type, addr_size, + # debug_abbrev_offset. + hdr = struct.pack(">= 7 + if (v == 0 and not (b & 0x40)) or (v == -1 and (b & 0x40)): + more = False + else: + b |= 0x80 + out.append(b) + return bytes(out) + + +def _extra_dwarf_sections(version, is64): + """A wide set of minimal-but-parseable DWARF sections, one per + display_debug_* reader in dwarf.c (a tolerant dumper).""" + asz = 8 if is64 else 4 + A = (lambda v: struct.pack(" + llbody += bytes([0x08]) + A(0x1000) + _uleb(0x40) # DW_LLE_start_length + llbody += _uleb(len(expr)) + expr + llbody += bytes([0x00]) # DW_LLE_end_of_list + ll += struct.pack("= 3 else 1]) + b"\x00" # ver, aug + if version >= 4: + cie_body = bytes([4]) + b"\x00" + bytes([asz, 0]) # +asz,seg + cie_body += _uleb(1) + _sleb(-4) + _uleb(0) # caf, daf, ret_reg + cie_body += bytes([0x0c, 0x07, 0x00]) # DW_CFA_def_cfa r7,0 + cie = struct.pack("= 5: + # v5 directory/file tables use entry-format descriptors. + dir_fmt = _uleb(1) + _uleb(1) + _uleb(DW_FORM_string) # DW_LNCT_path + dirs = _uleb(1) + b"/seed\x00" + file_fmt = (_uleb(2) + _uleb(1) + _uleb(DW_FORM_string) + + _uleb(2) + _uleb(DW_FORM_data1)) # path, dir idx + files = _uleb(1) + b"seed.c\x00" + _uleb(0) + pre = struct.pack("= 5: + prologue += dir_fmt + dirs + file_fmt + files + else: + prologue += b"\x00" # end of include_dirs + prologue += b"seed.c\x00" + _uleb(0) + _uleb(0) + _uleb(0) + prologue += b"\x00" # end of file_names + + # A tiny line-number program: set address, advance line, copy, end seq. + prog = bytearray() + prog += bytes([0, 9, 2]) + struct.pack(" ZERO\n" + "\tjne .L1\n" + "\t.else\n" + "\tje .L1\n" + "\t.endif\n" + "\tPUSHALL %edx, TWO\n" + "\t.rept 3\n" + "\tnop\n" + "\t.endr\n" + "\tpaddd %xmm0, %xmm1\n" + "\tmovaps %xmm2, %xmm3\n" + "\trep movsb\n" + "\tlock incl (%eax)\n" + ".L1:\n" + "\tleave\n" + "\tret\n" + "\t.cfi_endproc\n" + "\t.size _start, .-_start\n" + ).encode() + + +def make_def_seed(): + """A Windows module-definition (.def) file exercising the dlltool def + grammar (LIBRARY/EXPORTS/IMPORTS/SECTIONS/... in binutils/defparse.y).""" + return ( + "LIBRARY \"seed.dll\" BASE=0x10000000\n" + "EXPORTS\n" + " AddNumbers @1\n" + " SubNumbers @2 NONAME\n" + " GetData = internal_get_data\n" + " globalState @4 DATA\n" + " PrivateFn @5 PRIVATE\n" + " ColdFn @6 == realname\n" + "IMPORTS\n" + " helper = other.dll.helper_impl\n" + " by_ord = other.dll.7\n" + "SECTIONS\n" + " .text EXECUTE READ\n" + " .data READ WRITE\n" + " .shared SHARED\n" + "DESCRIPTION \"seed module-definition file\"\n" + "STACKSIZE 0x100000, 0x1000\n" + "HEAPSIZE 0x100000, 0x1000\n" + "VERSION 1.2\n" + ).encode() + + +# ────────────────────────────────────────────────────────────────────────── +# Separate-debug-file links (for fuzz_dwarf, which only loads these) +# ────────────────────────────────────────────────────────────────────────── +def make_debuglink_object(): + """ELF object carrying .gnu_debuglink, .gnu_debugaltlink and .debug_sup, + the sections fuzz_dwarf's load_separate_debug_files actually parses.""" + obj = ElfObject(62, ELFCLASS64, ELFDATA2LSB) # x86-64 host + obj.add_section(".text", SHT_PROGBITS, b"\x00" * 16, + flags=SHF_ALLOC | SHF_EXECINSTR, addralign=4) + + name = b"seed.debug\x00" + link = name + b"\x00" * ((-len(name)) % 4) + b"\x01\x02\x03\x04" # +CRC32 + obj.add_section(".gnu_debuglink", SHT_PROGBITS, link) + + altname = b"seed.alt.debug\x00" + obj.add_section(".gnu_debugaltlink", SHT_PROGBITS, + altname + b"\x11" * 20) # filename + build-id + + sup = struct.pack("\n") + for name, data in members: + nm = (name + "/")[:16].ljust(16) + hdr = "%s%-12d%-6d%-6d%-8s%-10d`\n" % (nm, 0, 0, 0, "100644", + len(data)) + out += hdr.encode() + out += data + if len(data) % 2: + out += b"\n" + return bytes(out) + + +# ────────────────────────────────────────────────────────────────────────── +# Driver +# ────────────────────────────────────────────────────────────────────────── +def write(path, data): + with open(path, "wb") as f: + f.write(data) + + +def main(root): + seeds = os.path.join(root, "seeds") + + reloc_dir = os.path.join(seeds, "elf_reloc") + os.makedirs(reloc_dir, exist_ok=True) + for arch in ARCHES: + write(os.path.join(reloc_dir, "reloc-%s.o" % arch), + make_reloc_object(arch)) + + dwarf_dir = os.path.join(seeds, "dwarf") + os.makedirs(dwarf_dir, exist_ok=True) + for ver in (4, 5): + for bits, is64 in (("64", True), ("32", False)): + write(os.path.join(dwarf_dir, "dwarf%d-%s.o" % (ver, bits)), + make_dwarf_object(ver, is64)) + + meta_dir = os.path.join(seeds, "elf_meta") + os.makedirs(meta_dir, exist_ok=True) + write(os.path.join(meta_dir, "meta-gnu-x86_64.o"), + make_elf_meta_object(62, ELFCLASS64, ELFDATA2LSB, "gnu")) + write(os.path.join(meta_dir, "meta-aeabi-arm.o"), + make_elf_meta_object(40, ELFCLASS32, ELFDATA2LSB, "aeabi")) + write(os.path.join(meta_dir, "meta-gnu-aarch64.o"), + make_elf_meta_object(183, ELFCLASS64, ELFDATA2LSB, "gnu")) + + # Text seeds for the otherwise-unseeded fuzz_as and fuzz_dlltool harnesses. + gas_dir = os.path.join(seeds, "gas") + os.makedirs(gas_dir, exist_ok=True) + write(os.path.join(gas_dir, "seed.s"), make_gas_asm_seed()) + + def_dir = os.path.join(seeds, "dlltool") + os.makedirs(def_dir, exist_ok=True) + write(os.path.join(def_dir, "seed.def"), make_def_seed()) + + # Separate-debug-link seed for fuzz_dwarf. + dl_dir = os.path.join(seeds, "debuglink") + os.makedirs(dl_dir, exist_ok=True) + write(os.path.join(dl_dir, "debuglink.o"), make_debuglink_object()) + + arc_dir = os.path.join(seeds, "archive") + os.makedirs(arc_dir, exist_ok=True) + members = [("reloc-%s.o" % a, make_reloc_object(a)) + for a in ("riscv64", "aarch64", "ppc64")] + members.append(("dwarf4.o", make_dwarf_object(4, True))) + write(os.path.join(arc_dir, "multiarch.a"), make_archive(members)) + + n = (len(os.listdir(reloc_dir)) + len(os.listdir(dwarf_dir)) + + len(os.listdir(arc_dir))) + print("generate_seeds.py: wrote %d seeds under %s" % (n, seeds)) + + +if __name__ == "__main__": + if len(sys.argv) != 2: + sys.stderr.write("usage: generate_seeds.py \n") + sys.exit(1) + main(sys.argv[1]) diff --git a/projects/ghostscript/Dockerfile b/projects/ghostscript/Dockerfile index 3e405585bdce..f887b450387b 100644 --- a/projects/ghostscript/Dockerfile +++ b/projects/ghostscript/Dockerfile @@ -28,4 +28,4 @@ COPY dicts $SRC/dicts WORKDIR ghostpdl COPY *.cc *.options *.h $SRC/ -COPY build.sh $SRC/ +COPY build.sh generate_seeds.py $SRC/ diff --git a/projects/ghostscript/build.sh b/projects/ghostscript/build.sh index 6c6e651f6378..8ca5a29ad91d 100755 --- a/projects/ghostscript/build.sh +++ b/projects/ghostscript/build.sh @@ -116,12 +116,25 @@ for f in examples/ridt91.eps examples/snowflak.ps $SRC/pdf_seeds/pdf.pdf; do done zip -j "$OUT/gs_device_pdfwrite_opts_fuzzer_seed_corpus.zip" "$WORK"/gs_device_pdfwrite_opts_fuzzer_seeds/* +# Generate structured PostScript / PCL-XL / PCL seeds (see generate_seeds.py). +# The stock examples lean on DeviceRGB/Gray and basic operators; the generated +# PostScript exercises the colour-space / CIE / ICC machinery (zcolor.c, +# gsicc_create.c, zcie.c), smooth shadings, the PDF1.4 transparency compositor, +# halftones (zht2.c), images and DSC parsing (dscparse.c) across every device. +python3 $SRC/generate_seeds.py "$WORK/generated_gs_seeds" + # Create seeds for gstoraster_fuzzer mkdir -p "$WORK/seeds" for f in examples/*.{ps,pdf}; do s=$(sha1sum "$f" | awk '{print $1}') cp "$f" "$WORK/seeds/$s" done +# Add the generated PostScript seeds so they propagate to every device fuzzer +# corpus copied from gstoraster_fuzzer below. +for f in "$WORK"/generated_gs_seeds/ps/*.ps; do + s=$(sha1sum "$f" | awk '{print $1}') + cp "$f" "$WORK/seeds/$s" +done # Create corpus for gstoraster_fuzzer zip -j "$OUT/gstoraster_fuzzer_seed_corpus.zip" "$WORK"/seeds/* @@ -157,6 +170,7 @@ for f in pcl/examples/*.pcl; do s=$(sha1sum "$f" | awk '{print $1}') cp "$f" "$WORK/pcl_seeds/$s" done +cp "$WORK"/generated_gs_seeds/pcl/* "$WORK/pcl_seeds/" 2>/dev/null || true zip -j "$OUT/gs_pcl_fuzzer_seed_corpus.zip" "$WORK"/pcl_seeds/* # Create PXL seed corpus from example PXL files @@ -165,6 +179,7 @@ for f in pcl/examples/*.pxl pcl/examples/*.px3; do s=$(sha1sum "$f" | awk '{print $1}') cp "$f" "$WORK/pxl_seeds/$s" done +cp "$WORK"/generated_gs_seeds/pxl/* "$WORK/pxl_seeds/" 2>/dev/null || true zip -j "$OUT/gs_pxl_fuzzer_seed_corpus.zip" "$WORK"/pxl_seeds/* # Create XPS seed corpus from example XPS files @@ -175,6 +190,7 @@ for f in pcl/examples/*.xps xps/tools/*.xps; do cp "$f" "$WORK/xps_seeds/$s" fi done +cp "$WORK"/generated_gs_seeds/xps/* "$WORK/xps_seeds/" 2>/dev/null || true zip -j "$OUT/gs_xps_fuzzer_seed_corpus.zip" "$WORK"/xps_seeds/* # Copy dictionaries for new fuzzers diff --git a/projects/ghostscript/generate_seeds.py b/projects/ghostscript/generate_seeds.py new file mode 100644 index 000000000000..55ca7c607e9b --- /dev/null +++ b/projects/ghostscript/generate_seeds.py @@ -0,0 +1,748 @@ +#!/usr/bin/env python3 +# Copyright 2026 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Structured PostScript/PCL-XL seed generation for the ghostscript OSS-Fuzz +device fuzzers. + +Every `gs_device_*` / `gstoraster_*` fuzzer feeds the input to Ghostscript on +stdin and renders it to a device (see gs_fuzzlib.h: the args end in `-_`). The +shipped seed corpus is the stock `examples/*.ps,*.pdf`, which lean almost +entirely on DeviceRGB/DeviceGray and basic operators. + +This script emits small, valid PostScript programs that each drive one of +those clusters, plus a couple of PCL-XL / PCL seeds for the dedicated +gs_pxl / gs_pcl fuzzers. Pure Python stdlib, no Ghostscript needed at +generation time. + +Usage: python3 generate_seeds.py +""" + +import os +import sys +import struct +import zlib +import io +import zipfile + + +def w(d, name, data): + os.makedirs(d, exist_ok=True) + with open(os.path.join(d, name), "wb") as f: + f.write(data if isinstance(data, bytes) else data.encode("latin-1")) + + +PS_HEADER = "%!PS-Adobe-3.0\n" + + +# -------------------------------------------------------------------------- +# PostScript colour spaces -> zcolor.c +# -------------------------------------------------------------------------- +def ps_colorspaces(): + return PS_HEADER + r"""%%Title: colour space torture +%%EndComments +% --- CIEBasedABC +[ /CIEBasedABC << + /DecodeLMN [ {dup mul} bind {dup mul} bind {dup mul} bind ] + /MatrixLMN [0.41 0.21 0.02 0.36 0.72 0.12 0.18 0.07 0.95] + /WhitePoint [0.9505 1.0 1.089] + /BlackPoint [0 0 0] + /RangeABC [0 1 0 1 0 1] +>> ] setcolorspace +0.3 0.5 0.7 setcolor 20 20 80 80 rectfill +% --- CIEBasedA +[ /CIEBasedA << /DecodeA {dup mul} bind /MatrixA [1 1 1] + /WhitePoint [0.9505 1.0 1.089] >> ] setcolorspace +0.6 setcolor 120 20 60 60 rectfill +% --- CIEBasedDEF +[ /CIEBasedDEF << /DecodeDEF [ {} bind {} bind {} bind ] + /RangeDEF [0 1 0 1 0 1] /RangeHIJ [0 1 0 1 0 1] + /Table [2 2 2 (\000\000\000\377\377\377\200\200\200\100\100\100)] + /DecodeABC [ {} bind {} bind {} bind ] + /MatrixABC [1 0 0 0 1 0 0 0 1] + /WhitePoint [0.9505 1.0 1.089] >> ] setcolorspace +0.2 0.4 0.6 setcolor 20 120 60 60 rectfill +% --- Separation +[ /Separation /Spot /DeviceCMYK { dup 0 0 0 4 1 roll } bind ] setcolorspace +0.8 setcolor 120 120 60 60 rectfill +% --- DeviceN +[ /DeviceN [/C1 /C2] /DeviceRGB { 0.5 mul exch 0.5 mul exch 0 } bind ] + setcolorspace +0.4 0.7 setcolor 200 20 60 60 rectfill +% --- Indexed over DeviceRGB +[ /Indexed /DeviceRGB 3 ] setcolorspace +2 setcolor 200 120 60 60 rectfill +% --- legacy operators +0.5 setgray 0 0 10 10 rectfill +0.1 0.2 0.3 setrgbcolor 10 0 10 10 rectfill +0.1 0.2 0.3 0.4 setcmykcolor 20 0 10 10 rectfill +0.5 0.6 0.7 sethsbcolor 30 0 10 10 rectfill +currentcolor pop currentcolorspace pop currentgray pop +showpage +""" + + +# -------------------------------------------------------------------------- +# More colour spaces + colour-rendering dictionary -> gscie/gscrd/gsciemap +# -------------------------------------------------------------------------- +def ps_color_rendering(): + return PS_HEADER + r"""%%Title: colour rendering + extra colour spaces +% --- CIEBasedDEFG (4-input CIE, e.g. CMYK device link) +[ /CIEBasedDEFG << + /DecodeDEFG [ {} bind {} bind {} bind {} bind ] + /RangeDEFG [0 1 0 1 0 1 0 1] /RangeHIJK [0 1 0 1 0 1 0 1] + /Table [2 2 2 2 (\000\000\000\377\377\377\200\200\200\100\100\100 + \040\040\040\140\140\140\240\240\240\300\300\300 + \020\020\020\060\060\060\120\120\120\160\160\160 + \220\220\220\260\260\260\320\320\320\360\360\360)] + /DecodeABC [ {} bind {} bind {} bind ] + /MatrixABC [1 0 0 0 1 0 0 0 1] + /WhitePoint [0.9505 1.0 1.089] >> ] setcolorspace +0.2 0.4 0.6 0.1 setcolor 20 20 60 60 rectfill +% --- a colour-rendering dictionary (CRD, type 1) +<< /ColorRenderingType 1 + /WhitePoint [0.9505 1.0 1.089] /BlackPoint [0 0 0] + /MatrixPQR [1 0 0 0 1 0 0 0 1] + /RangePQR [-0.5 2 -0.5 2 -0.5 2] + /TransformPQR [ {3 -1 roll pop pop} bind {3 -1 roll pop pop} bind + {3 -1 roll pop pop} bind ] + /MatrixLMN [1 0 0 0 1 0 0 0 1] + /EncodeLMN [ {} bind {} bind {} bind ] + /RangeLMN [0 1 0 1 0 1] + /MatrixABC [1 0 0 0 1 0 0 0 1] + /EncodeABC [ {} bind {} bind {} bind ] + /RangeABC [0 1 0 1 0 1] + /RenderTable null >> setcolorrendering +% --- black generation + undercolour removal (CMYK) +{ dup dup dup pop pop pop } bind setblackgeneration +{ 0.5 mul } bind setundercolorremoval +0.1 0.2 0.3 0.4 setcmykcolor 90 20 60 60 rectfill +% --- DeviceN with an attributes dict +[ /DeviceN [/Cyan /Magenta /Spot] /DeviceCMYK + { 0 4 1 roll 0 } bind + << /Subtype /NChannel /Colorants << /Spot + [ /Separation /Spot /DeviceCMYK {0 0 0 4 1 roll} bind ] >> >> ] + setcolorspace +0.3 0.4 0.5 setcolor 20 90 60 60 rectfill +% --- Separation /All +[ /Separation /All /DeviceCMYK { dup dup dup } bind ] setcolorspace +0.7 setcolor 90 90 60 60 rectfill +showpage +""" + + +# -------------------------------------------------------------------------- +# Smooth shading (shfill, all ShadingType values) -> gxshade*.c +# -------------------------------------------------------------------------- +def ps_shadings(): + return PS_HEADER + r"""%%Title: shadings +% Type 1 function-based shading +<< /ShadingType 1 /ColorSpace /DeviceRGB + /Function << /FunctionType 2 /Domain [0 1] /C0 [1 0 0] /C1 [0 0 1] /N 1 >> + /Domain [0 1 0 1] >> shfill +% Type 2 axial +gsave 0 0 100 100 rectclip +<< /ShadingType 2 /ColorSpace /DeviceRGB /Coords [0 0 200 200] + /Extend [true true] + /Function << /FunctionType 2 /Domain [0 1] /C0 [1 1 0] /C1 [0 1 1] /N 1 >> +>> shfill grestore +% Type 3 radial +gsave 100 0 100 100 rectclip +<< /ShadingType 3 /ColorSpace /DeviceRGB /Coords [150 50 0 150 50 60] + /Extend [true true] + /Function << /FunctionType 2 /Domain [0 1] /C0 [1 0 1] /C1 [0 0 0] /N 1 >> +>> shfill grestore +% Type 4 free-form Gouraud (inline data via DataSource string) +<< /ShadingType 4 /ColorSpace /DeviceRGB /BitsPerCoordinate 8 + /BitsPerComponent 8 /BitsPerFlag 8 /Decode [0 255 0 255 0 1 0 1 0 1] + /DataSource <00 00 00 ff 00 00 00 ff 00 00 ff 00 00 80 80 00 00 ff> +>> shfill +% Type 5 lattice Gouraud +<< /ShadingType 5 /ColorSpace /DeviceRGB /BitsPerCoordinate 8 + /BitsPerComponent 8 /VerticesPerRow 2 /Decode [0 255 0 255 0 1 0 1 0 1] + /DataSource <00 00 ff0000 ff 00 00ff00 00 ff 0000ff ff ff ffffff> +>> shfill +% Type 6 Coons patch +<< /ShadingType 6 /ColorSpace /DeviceRGB /BitsPerCoordinate 8 + /BitsPerComponent 8 /BitsPerFlag 8 + /Decode [0 255 0 255 0 1 0 1 0 1] + /DataSource <00 + 00 00 20 00 40 00 60 00 60 20 60 40 60 60 40 60 20 60 00 60 00 40 00 20 + ff0000 00ff00 0000ff ffff00> +>> shfill +showpage +""" + + +# -------------------------------------------------------------------------- +# Transparency: groups, blend modes, soft masks, alpha -> gdevp14 / gxblend +# -------------------------------------------------------------------------- +def ps_transparency(): + modes = ["Normal", "Multiply", "Screen", "Overlay", "Darken", "Lighten", + "ColorDodge", "ColorBurn", "HardLight", "SoftLight", "Difference", + "Exclusion", "Hue", "Saturation", "Color", "Luminosity"] + body = [PS_HEADER, "%%Title: transparency\n"] + body.append(".setblendmode where { pop } if\n") + x = 10 + for i, m in enumerate(modes): + body.append( + "gsave /%s .setblendmode 0.7 .setfillconstantalpha\n" + "1 0 0 setrgbcolor %d 20 40 40 rectfill\n" + "0 0 1 setrgbcolor %d 40 40 40 rectfill grestore\n" + % (m, x, x)) + x += 12 + # transparency group + body.append( + "<< /Subtype /Group /CS /DeviceRGB /I true /K false >>\n" + "1 .begintransparencygroup\n" + "0 1 0 setrgbcolor 60 100 80 80 rectfill\n" + ".endtransparencygroup\n") + body.append("showpage\n") + return "".join(body) + + +# -------------------------------------------------------------------------- +# Images: image / colorimage / imagemask, indexed, multiple bit depths +# -------------------------------------------------------------------------- +def ps_images(): + return PS_HEADER + r"""%%Title: images +% 8-bit grayscale image +gsave 0 0 translate 100 100 scale +8 8 8 [8 0 0 8 0 0] +{ <0011223344556677> } image +grestore +% RGB colorimage +gsave 100 0 translate 100 100 scale +4 4 8 [4 0 0 4 0 0] +{ } +false 3 colorimage +grestore +% imagemask +gsave 0 100 translate 60 60 scale +0 0 0 setrgbcolor +8 8 false [8 0 0 8 0 0] { <8142241818244281> } imagemask +grestore +% indexed image via image dict + Interpolate (gxiscale) +gsave 100 100 translate 80 80 scale +[ /Indexed /DeviceRGB 3 ] setcolorspace +<< /ImageType 1 /Width 2 /Height 2 /BitsPerComponent 2 + /Decode [0 3] /Interpolate true + /ImageMatrix [2 0 0 2 0 0] + /DataSource <00 40 80 c0> >> image +grestore +showpage +""" + + +# -------------------------------------------------------------------------- +# Halftones / transfer functions -> gshtscr / gxht +# -------------------------------------------------------------------------- +def ps_halftones(): + return PS_HEADER + r"""%%Title: halftones +% Type 1 spot halftone +<< /HalftoneType 1 /Frequency 60 /Angle 45 + /SpotFunction { 180 mul cos exch 180 mul cos add 2 div } bind >> +sethalftone +% Type 3 threshold halftone +<< /HalftoneType 3 /Width 2 /Height 2 /Thresholds <00 55 aa ff> >> +sethalftone +% legacy setscreen + transfer +60 30 { 180 mul cos exch 180 mul cos add 2 div } bind setscreen +{ 1 exch sub } bind settransfer +{1 exch sub}{1 exch sub}{1 exch sub}{1 exch sub} setcolortransfer +0.5 setgray 0 0 100 100 rectfill +showpage +""" + + +# -------------------------------------------------------------------------- +# DSC-rich document -> dscparse.c (used by ps2write / eps handling) +# -------------------------------------------------------------------------- +def ps_dsc(): + return r"""%!PS-Adobe-3.0 +%%Title: DSC torture +%%Creator: seedgen +%%CreationDate: 2026 +%%BoundingBox: 0 0 200 200 +%%HiResBoundingBox: 0.0 0.0 200.0 200.0 +%%DocumentMedia: Default 200 200 80 white () +%%DocumentData: Clean7Bit +%%LanguageLevel: 3 +%%Orientation: Portrait +%%PageOrder: Ascend +%%Pages: 2 +%%DocumentNeededResources: font Helvetica +%%DocumentSuppliedResources: procset Seed 1.0 0 +%%EndComments +%%BeginProlog +%%BeginResource: procset Seed 1.0 0 +/box { newpath 0 0 moveto 50 0 rlineto 0 50 rlineto -50 0 rlineto closepath } def +%%EndResource +%%EndProlog +%%BeginSetup +/Helvetica findfont 12 scalefont setfont +%%EndSetup +%%Page: one 1 +%%BeginPageSetup +gsave +%%EndPageSetup +20 20 moveto box 0.5 setgray fill +20 100 moveto (DSC page one) show +grestore +showpage +%%Page: two 2 +gsave +0.2 0.4 0.6 setrgbcolor 30 30 box fill +grestore +showpage +%%Trailer +%%EOF +""" + + +# -------------------------------------------------------------------------- +# Fonts / text: Type 3 font, show variants, clipping +# -------------------------------------------------------------------------- +def ps_fonts_text(): + return PS_HEADER + r"""%%Title: fonts and text +% Type 3 user-defined font +8 dict dup begin + /FontType 3 def + /FontMatrix [0.01 0 0 0.01 0 0] def + /FontBBox [0 0 100 100] def + /Encoding 256 array def + Encoding 65 /A put + /CharProcs 2 dict def + CharProcs begin + /A { 0 0 moveto 100 0 lineto 50 100 lineto closepath fill } bind def + /.notdef { } bind def + end + /BuildGlyph { exch /CharProcs get exch 2 copy known not { pop /.notdef } if + get exec } bind def +end +/SeedType3 exch definefont pop +/SeedType3 findfont 24 scalefont setfont +20 150 moveto (AAA) show +% standard font show variants +/Helvetica findfont 14 scalefont setfont +20 120 moveto (kerned) 0 0 (k) 0 0 ashow +20 100 moveto (widthshow) 1 0 32 widthshow +20 80 moveto [3 2] 0 setdash 0 0 1 setrgbcolor (dashed clip) show +% text as clip path +20 40 moveto /Helvetica findfont 30 scalefont setfont +(CLIP) true charpath clip +0 0 200 200 8 { pop 0 1 0 setrgbcolor 0 0 200 200 rectfill } repeat +showpage +""" + + +# -------------------------------------------------------------------------- +# PCL-XL (PCL6) seed for gs_pxl_fuzzer +# -------------------------------------------------------------------------- +def pclxl_seed(): + # PCL-XL big-endian protocol; a minimal valid stream: + # ) HP-PCL XL;2;0 header, BeginSession, OpenDataSource, BeginPage, + # SetColorSpace, a rectangle, EndPage, CloseDataSource, EndSession. + out = bytearray() + out += b") HP-PCL XL;2;0;Comment Seed\n" + + def ubyte(tag, v): + return bytes([0xc0, v, tag]) # ubyte attr + attr-id + + def uint16(v): + return bytes([0xc1]) + struct.pack(">H", v) + + def attr(idbyte): + return bytes([0xf8, idbyte]) # attribute id tag + + # BeginSession: UnitsPerMeasure (uint16 xy), MeasureName (ubyte), ... + out += bytes([0xc0, 0x00, 0xf8, 0x29]) # ProtocolClass? skip + # Simpler: use documented operator bytes. + # UnitsPerMeasure = [600 600] + out += bytes([0xc1]) + struct.pack(">H", 600) + out += bytes([0xc1]) + struct.pack(">H", 600) + out += attr(0x88) # UnitsPerMeasure + out += bytes([0xc0, 0x00]) + attr(0x86) # MeasureName=eInch + out += bytes([0x41]) # BeginSession + out += bytes([0xc0, 0x01]) + attr(0x1c) # SourceType + out += bytes([0x42]) # OpenDataSource + out += bytes([0xc0, 0x02]) + attr(0x28) # ColorSpace=eRGB + out += bytes([0x6a]) # SetColorSpace + out += bytes([0xc0, 0x00]) + attr(0x29) # Orientation + out += bytes([0xc0, 0x02]) + attr(0x25) # MediaSize=Letter + out += bytes([0x43]) # BeginPage + # rectangle + out += uint16(100) + attr(0x53) # not strictly valid + out += bytes([0x44]) # EndPage + out += bytes([0x45]) # CloseDataSource-ish + out += bytes([0x46]) # EndSession-ish + return bytes(out) + + +# -------------------------------------------------------------------------- +# PCL5 seed for gs_pcl_fuzzer +# -------------------------------------------------------------------------- +def pcl5_seed(): + ESC = b"\x1b" + out = bytearray() + out += ESC + b"E" # printer reset + out += ESC + b"&l1O" # orientation landscape + out += ESC + b"&l2A" # page size letter + out += ESC + b"(s1p12v0s0b4099T" # font selection + out += ESC + b"&a100h200V" # cursor position + out += b"Hello PCL5 seed\r\n" + out += ESC + b"*c100a100b0P" # fill rectangle + out += ESC + b"*v1S" # set source + # raster graphics + out += ESC + b"*t100R" # raster resolution + out += ESC + b"*r0A" # start raster + out += ESC + b"*b4W" + b"\xff\x00\xff\x00" + out += ESC + b"*rB" # end raster + out += ESC + b"E" + return bytes(out) + + +# -------------------------------------------------------------------------- +# Complex paths / stroking / clipping -> gxfill.c, gxstroke.c, gxclip.c +# -------------------------------------------------------------------------- +def ps_paths(): + return PS_HEADER + r"""%%Title: paths, stroking, clipping +% self-intersecting star, even-odd vs nonzero winding +/star { newpath 100 190 moveto 140 20 lineto 10 130 lineto 190 130 lineto + 60 20 lineto closepath } def +gsave star 1 0 0 setrgbcolor eofill grestore +gsave 0 200 translate star 0 0 1 setrgbcolor fill grestore +% many overlapping subpaths in one fill (winding accumulation) +newpath 0 1 20 { dup 10 mul dup 5 add exch 80 add 40 0 360 arc } for +0 0.6 0 setrgbcolor fill +% bezier curves +newpath 10 250 moveto 60 380 140 380 190 250 curveto +30 300 lineto 100 350 170 300 closepath 0.4 0.2 0.8 setrgbcolor fill +% stroking: caps, joins, miter, dashes, zero-length dots +2 setlinecap 1 setlinejoin 8 setmiterlimit +[6 3 2 3] 1 setdash 5 setlinewidth +newpath 220 20 moveto 380 20 lineto 300 120 lineto stroke +0 setlinecap 0 setlinejoin [] 0 setdash +1 setlinewidth newpath 220 150 moveto 380 150 lineto stroke +% zero-length stroke with round caps -> dots +1 setlinecap 10 setlinewidth +newpath 240 200 moveto 240 200 lineto stroke +% complex clip then fill a big region +gsave newpath 220 220 moveto 380 220 lineto 300 380 lineto closepath +clip 0 1 1 setrgbcolor 200 200 200 200 rectfill grestore +% rectclip + eoclip +gsave 50 400 100 80 rectclip 0.9 0.5 0.1 setrgbcolor +0 0 600 600 rectfill grestore +showpage +""" + + +# -------------------------------------------------------------------------- +# Image scaling / interpolation / mask types -> gxiscale.c, gxdownscale.c +# -------------------------------------------------------------------------- +def ps_image_scaling(): + return PS_HEADER + r"""%%Title: image scaling, interpolation, mask types +% interpolated upscale of a tiny image (drives gxiscale) +gsave 0 0 translate 180 180 scale +<< /ImageType 1 /Width 4 /Height 4 /BitsPerComponent 8 /Interpolate true + /Decode [0 1 0 1 0 1] /ImageMatrix [4 0 0 4 0 0] + /DataSource >> image +grestore +% large downscale (small dest from big source) -> downscaling path +gsave 200 200 translate 30 30 scale +<< /ImageType 1 /Width 32 /Height 32 /BitsPerComponent 1 /Interpolate false + /Decode [0 1] /ImageMatrix [32 0 0 32 0 0] + /DataSource { } >> image +grestore +% 16-bit grayscale image +gsave 0 200 translate 90 90 scale +<< /ImageType 1 /Width 2 /Height 2 /BitsPerComponent 16 + /Decode [0 1] /ImageMatrix [2 0 0 2 0 0] + /DataSource <0000ffffffff0000> >> image +grestore +% ImageType 4 colour-key masked image +gsave 200 0 translate 90 90 scale +<< /ImageType 4 /Width 2 /Height 2 /BitsPerComponent 8 /MaskColor [255] + /Decode [0 1] /ImageMatrix [2 0 0 2 0 0] + /DataSource <00ff80ff> >> image +grestore +% ImageType 3 explicit-mask image +gsave 100 100 translate 80 80 scale +<< /ImageType 3 /InterleaveType 3 + /DataDict << /ImageType 1 /Width 2 /Height 2 /BitsPerComponent 8 + /Decode [0 1 0 1 0 1] /ImageMatrix [2 0 0 2 0 0] + /DataSource >> + /MaskDict << /ImageType 1 /Width 2 /Height 2 /BitsPerComponent 1 + /Decode [0 1] /ImageMatrix [2 0 0 2 0 0] + /DataSource <40> >> >> image +grestore +showpage +""" + + +# -------------------------------------------------------------------------- +# setpagedevice parameter dictionaries -> gsparaml.c, gsdparam.c +# -------------------------------------------------------------------------- +def ps_pagedevice_params(): + return PS_HEADER + r"""%%Title: page device parameters +<< /PageSize [200 200] /Margins [0 0] /HWResolution [72 72] + /ImagingBBox null /Orientation 0 /Policies << /PageSize 3 /Policy 0 >> + /BeginPage { pop } /EndPage { pop pop true } + /Install {} /UseCIEColor true >> setpagedevice +currentpagedevice /PageSize get aload pop pop pop +% nested dict + array param types +<< /PageSize [200 200] + /InputAttributes << 0 << /PageSize [200 200] >> /Priority [0] >> + /OutputAttributes << 0 << >> >> + /Deferred true /DeviceRenderingInfo << /MaxSeparations 4 >> >> +setpagedevice +% gsave/grestore of gstate + clippath/initclip +gsave clippath pathbbox 4 array astore pop grestore initclip +0.5 setgray 10 10 180 180 rectfill +showpage +""" + + +# -------------------------------------------------------------------------- +# XPS (OpenXPS) packages -> xps/*.c + expat XML parser + (image) pngread +# The gs_xps_fuzzer feeds the file to gpdl, which detects the OPC ZIP and runs +# the XPS interpreter. The stock corpus has only a couple of sample .xps files. +# -------------------------------------------------------------------------- +XPS_NS = "http://schemas.microsoft.com/xps/2005/06" +REL_NS = "http://schemas.openxmlformats.org/package/2006/relationships" + + +def _png_rgb(w, h): + def chunk(t, d): + return (struct.pack(">I", len(d)) + t + d + + struct.pack(">I", zlib.crc32(t + d) & 0xffffffff)) + ihdr = struct.pack(">IIBBBBB", w, h, 8, 2, 0, 0, 0) # 8-bit RGB + raw = b"" + for y in range(h): + raw += b"\x00" + bytes([(x * 32 + y * 16) & 255 + for x in range(w) for _ in range(3)]) + return (b"\x89PNG\r\n\x1a\n" + chunk(b"IHDR", ihdr) + + chunk(b"IDAT", zlib.compress(raw)) + chunk(b"IEND", b"")) + + +def _tiff_rgb(w, h): + """A minimal baseline (uncompressed RGB, single strip) TIFF, little-endian. + Reachable as an XPS ImageBrush source -> xps/xpstiff.c + libtiff read.""" + tags = [] # (tag, type, count, value-or-inline-bytes) + strip = bytes([(x * 20 + y * 10 + c * 5) & 255 + for y in range(h) for x in range(w) for c in range(3)]) + # out-of-line areas: BitsPerSample (3 shorts), strip data + hdr_len = 8 + ifd_count = 11 + ifd_len = 2 + ifd_count * 12 + 4 + bps_off = hdr_len + ifd_len + strip_off = bps_off + 6 + bps = struct.pack("off + ifd += e(259, 3, 1, struct.pack("' + '' + '' + '' + '' + '' + '' + '' + '') + rels = ('' + '' + '' % REL_NS) + fdseq = ('' + '' + '' % XPS_NS) + fdoc = ('' + '' % XPS_NS) + z = io.BytesIO() + with zipfile.ZipFile(z, "w", zipfile.ZIP_DEFLATED) as zf: + zf.writestr("[Content_Types].xml", ct) + zf.writestr("_rels/.rels", rels) + zf.writestr("FixedDocSeq.fdseq", fdseq) + zf.writestr("Documents/1/FixedDoc.fdoc", fdoc) + zf.writestr("Documents/1/Pages/1.fpage", fpage_xml) + if with_png: + zf.writestr("Resources/img.png", _png_rgb(16, 16)) + if with_tiff: + zf.writestr("Resources/img.tif", _tiff_rgb(16, 16)) + return z.getvalue() + + +def xps_tiff(): + """FixedPage whose ImageBrush references a TIFF part -> xps/xpstiff.c and + the libtiff read path (tif_getimage / tif_read), both dark in production.""" + fpage = ( + '' % XPS_NS + + '' + '' + '') + return _xps_package(fpage, with_tiff=True) + + +def xps_vector(): + """FixedPage of vector content: canvases (transform/clip/opacity), paths + with complex geometry, solid/linear/radial/image/visual brushes.""" + fpage = ( + '' % XPS_NS + + '' + '' + '' + '' + # solid fill path + '' + # stroked, dashed path with caps/joins + '' + # linear gradient fill via Path.Fill + '' + '' + '' + '' + '' + '' + '' + '' + # radial gradient + '' + '' + '' + '' + '' + '' + '' + # image brush referencing the PNG part + '' + '' + # explicit complex PathGeometry with arc + bezier segments + '' + '' + '' + '' + '' + '' + '' + '') + return _xps_package(fpage, with_png=True) + + +def xps_glyphs(): + """FixedPage with Glyphs elements (text) + opacity mask + visual brush. + Glyphs reference a font part; even when the font fails to load the XML is + fully parsed (expat) and the xpsglyphs dispatch runs.""" + fpage = ( + '' % XPS_NS + + '' + '' + '' + '' + '' + '' + '' + '' + '' + '' + '' + '') + # include a (minimal, likely-unparseable) font part so the loader path runs + z = io.BytesIO() + pkg = _xps_package(fpage, with_png=False) + zin = zipfile.ZipFile(io.BytesIO(pkg), "r") + with zipfile.ZipFile(z, "w", zipfile.ZIP_DEFLATED) as zf: + for n in zin.namelist(): + zf.writestr(n, zin.read(n)) + zf.writestr("Resources/font.ttf", b"\x00\x01\x00\x00" + b"\x00" * 64) + return z.getvalue() + + +# -------------------------------------------------------------------------- +def main(): + out = sys.argv[1] if len(sys.argv) > 1 else "gs_seeds" + os.makedirs(out, exist_ok=True) + ps_dir = os.path.join(out, "ps") # for gstoraster / device fuzzers + pxl_dir = os.path.join(out, "pxl") + pcl_dir = os.path.join(out, "pcl") + xps_dir = os.path.join(out, "xps") # for gs_xps_fuzzer + for name, gen in [ + ("colorspaces.ps", ps_colorspaces), + ("color_rendering.ps", ps_color_rendering), + ("shadings.ps", ps_shadings), + ("transparency.ps", ps_transparency), + ("images.ps", ps_images), + ("image_scaling.ps", ps_image_scaling), + ("paths.ps", ps_paths), + ("pagedevice_params.ps", ps_pagedevice_params), + ("halftones.ps", ps_halftones), + ("dsc.ps", ps_dsc), + ("fonts_text.ps", ps_fonts_text), + ]: + w(ps_dir, name, gen()) + for name, gen in [ + ("vector.xps", xps_vector), + ("glyphs.xps", xps_glyphs), + ("tiff.xps", xps_tiff), + ]: + w(xps_dir, name, gen()) + w(pxl_dir, "seed.bin", pclxl_seed()) + w(pcl_dir, "seed.pcl", pcl5_seed()) + total = sum(len(files) for _, _, files in os.walk(out)) + sys.stderr.write("generate_seeds.py: wrote %d seeds under %s\n" + % (total, out)) + + +if __name__ == "__main__": + main() diff --git a/projects/gstreamer/Dockerfile b/projects/gstreamer/Dockerfile index d5c4493be875..1a560a9fb29e 100644 --- a/projects/gstreamer/Dockerfile +++ b/projects/gstreamer/Dockerfile @@ -29,4 +29,4 @@ RUN pip3 install --disable-pip-version-check --no-cache-dir \ RUN git clone --depth 1 --recursive https://gitlab.freedesktop.org/gstreamer/gstreamer.git gstreamer WORKDIR gstreamer -COPY build.sh $SRC/ +COPY build.sh generate_seeds.py $SRC/ diff --git a/projects/gstreamer/build.sh b/projects/gstreamer/build.sh index e7154f3e89f3..25c55c70fd95 100755 --- a/projects/gstreamer/build.sh +++ b/projects/gstreamer/build.sh @@ -29,3 +29,25 @@ if grep -q -F "20.04" /etc/os-release ; then fi $SRC/gstreamer/ci/fuzzing/build-oss-fuzz.sh + +# Append structured seeds (see generate_seeds.py). The upstream fuzz targets +# ship with only a handful of corpus files each (and the push-based `typefind` +# target ships none), so we add structurally valid inputs that reach the +# parsing code directly: +# gst-codec-utils H.264/H.265/H.266 PTL, AV1 av1C, Opus headers +# gst-tag ID3v1/ID3v2 frames, EXIF IFDs, XMP, Vorbis comments +# gst-subparse SubRip/WebVTT/MicroDVD/SubViewer/MPL2/SAMI/... +# typefind magic headers for many container/codec formats +# Existing corpora are retained; generated seeds are merged into the zips. +python3 $SRC/generate_seeds.py $SRC/generated_seeds +for target in gst-codec-utils gst-tag gst-subparse typefind gst-discoverer; do + seeddir="$SRC/generated_seeds/$target" + if [ -d "$seeddir" ]; then + zip -j -q "$OUT/${target}_seed_corpus.zip" "$seeddir"/* + fi +done + +for ft in gst-tag; do + echo "[libfuzzer]" > $OUT/${ft}.options + echo "detect_leaks=0" >> $OUT/${ft}.options +done diff --git a/projects/gstreamer/generate_seeds.py b/projects/gstreamer/generate_seeds.py new file mode 100644 index 000000000000..7e8d377af5ad --- /dev/null +++ b/projects/gstreamer/generate_seeds.py @@ -0,0 +1,776 @@ +#!/usr/bin/env python3 +# Copyright 2026 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Structured seed generation for the GStreamer OSS-Fuzz fuzz targets. + +The fuzz targets in ci/fuzzing each parse a well-defined binary or textual +format, but ship with only a handful of corpus files (2-12 each, and the +push-based `typefind` target has none). This script emits structurally valid +seeds for each target so the fuzzer starts from inputs that actually reach the +parsing code instead of having to rediscover the container layouts. + +Targets and the code they exercise (see the public summary.json for the +weakly-covered files): + + gst-codec-utils pbutils/codec-utils.c H.264/H.265/H.266 profile-tier-level, + AV1 av1C, Opus header parsing + gst-tag gst/tag/*.c ID3v1/ID3v2 frames, EXIF IFD, XMP, + Vorbis comments + gst-subparse subparse element SubRip, WebVTT, MicroDVD, SubViewer, + MPL2, SAMI, TMPlayer, LRC + typefind gst/typefind + plugins magic-based type detection for many + container/codec formats + gst-discoverer ogg/theora/vorbis (kept minimal -- needs real media) + +Each format is emitted into a per-target subdirectory: + // +so the build script can zip each subdirectory into the matching +_seed_corpus.zip. + +Only the Python standard library is used. + +Usage: python3 generate_seeds.py +""" + +import os +import sys +import struct +import zlib + + +def w(d, name, data): + os.makedirs(d, exist_ok=True) + with open(os.path.join(d, name), "wb") as f: + f.write(data if isinstance(data, bytes) else data.encode("latin-1")) + + +# ========================================================================== +# gst-codec-utils +# ========================================================================== +def gen_codec_utils(base): + d = os.path.join(base, "gst-codec-utils") + + # --- H.264: codec_data is read as [profile_idc, constraints, level_idc]. + # AVCDecoderConfigurationRecord (as carried in MP4 'avcC'). + for name, prof, lvl in [("baseline", 66, 30), ("main", 77, 31), + ("high", 100, 40), ("high10", 110, 50), + ("high422", 122, 51), ("high444", 244, 52)]: + sps = bytes([0x67, prof, 0x00, lvl, 0xAC, 0xB2, 0x00, 0x07]) + pps = bytes([0x68, 0xCE, 0x3C, 0x80]) + avcc = bytes([0x01, prof, 0x00, lvl, 0xFF, 0xE1]) + \ + struct.pack(">H", len(sps)) + sps + b"\x01" + \ + struct.pack(">H", len(pps)) + pps + w(d, "h264_avcc_%s.bin" % name, avcc) + # the bare 3-byte profile/flags/level slice too + w(d, "h264_ptl_%s.bin" % name, bytes([prof, 0x00, lvl])) + + # --- H.265 profile_tier_level (12 bytes minimum to set level too). + def hevc_ptl(profile_idc, tier, level_idc, compat_bit=1): + b = bytearray(12) + b[0] = (profile_idc & 0x1f) # space=0 tier=0 profile_idc + if tier: + b[0] |= 0x20 + # general_profile_compatibility_flags (32 bits) at bytes 1..4 + compat = 1 << (31 - compat_bit) + b[1:5] = struct.pack(">I", compat) + # constraint flags bytes 5..10 (leave general_progressive etc set) + b[5] = 0x90 + b[11] = level_idc # general_level_idc + return bytes(b) + # A few profile/tier/level variants. (Empirically the harness's h264/h265 + # profile helpers extract fields without per-value line branches, so a + # larger matrix adds ~no coverage -- keep this set small.) + for name, p, t, l in [("main", 1, 0, 120), ("main10", 2, 0, 123), + ("main_still", 3, 0, 90), ("high_tier", 1, 1, 150), + ("rext4", 4, 0, 153)]: + w(d, "h265_ptl_%s.bin" % name, hevc_ptl(p, t, l)) + + # --- H.266/VVC profile_tier_level-ish payload. + for name, byte0, lvl in [("main10", 0x01, 51), ("main10_444", 0x21, 67)]: + w(d, "h266_ptl_%s.bin" % name, + bytes([byte0, 0x00, 0x00, 0x00, lvl, 0x00, 0x00, 0x00])) + + # --- AV1 codec configuration record (av1C). + # marker(1)=1 version(7)=1 -> 0x81 ; seq_profile(3) seq_level_idx(5) + for name, prof, lvl in [("main_l30", 0, 1), ("high_l40", 1, 8), + ("pro_l50", 2, 16)]: + seqhdr = bytes([0x0A, 0x0B, 0x00, 0x00, 0x00, 0x24, 0xCF, 0xBF, + 0x1B, 0xE0, 0x01, 0x40]) # tiny OBU_SEQUENCE_HEADER + av1c = bytes([0x81, (prof << 5) | lvl, 0x00, 0x00]) + seqhdr + w(d, "av1c_%s.bin" % name, av1c) + + # --- Opus header tail (harness prepends "OpusHead\x01"). + # layout after version: channels(1) pre_skip(2) rate(4) gain(2) mapping(1) + for name, ch, family in [("mono", 1, 0), ("stereo", 2, 0), + ("surround", 6, 1)]: + tail = bytes([ch]) + struct.pack("> 21) & 0x7f, (len(data) >> 14) & 0x7f, + (len(data) >> 7) & 0x7f, len(data) & 0x7f]) + else: # 2.3 plain size + sz = struct.pack(">I", len(data)) + return fid + sz + b"\x00\x00" + data + + +def _id3v2_tag(frames, version=4): + body = b"".join(frames) + size = len(body) + ssize = bytes([(size >> 21) & 0x7f, (size >> 14) & 0x7f, + (size >> 7) & 0x7f, size & 0x7f]) + return b"ID3" + bytes([version, 0, 0]) + ssize + body + + +def _tiff_exif_multi(): + """A TIFF/EXIF buffer with IFD0 + an Exif sub-IFD (0x8769) + a GPS sub-IFD + (0x8825), little-endian, covering ~all tags gstexiftag.c maps so each tag's + deserializer branch runs. Layout: 8-byte TIFF header, then the three IFDs + laid out back-to-back, with all out-of-line values in a trailing data + area. Types: 2=ASCII 3=SHORT 4=LONG 5=RATIONAL 7=UNDEFINED 10=SRATIONAL.""" + import struct as _s + HDR = 8 + + def build_ifd(entries, ifd_offset, next_off=0): + # entries: list of (tag, typ, count, raw_value_bytes_or_inline) + n = len(entries) + ifd_size = 2 + n * 12 + 4 + data_off = ifd_offset + ifd_size + body = _s.pack(" parse_split_strings + _id3v2_frame(b"TPE1", text + b"Artist A\x00Artist B\x00Artist C"), + ] + w(d, "id3v2_4_full.bin", _id3v2_tag(frames, 4)) + + # --- ID3v2.3 variant. + frames3 = [ + _id3v2_frame(b"TIT2", b"\x00Title v23", 3), + _id3v2_frame(b"TPE1", b"\x01\xff\xfeA\x00r\x00t\x00", 3), # UTF-16 + _id3v2_frame(b"APIC", b"\x00image/jpeg\x00\x03\x00" + b"\xff\xd8" * 4, + 3), + ] + w(d, "id3v2_3.bin", _id3v2_tag(frames3, 3)) + + # --- ID3v1 (exactly 128 bytes). + v1 = b"TAG" + b"Title".ljust(30, b"\x00") + b"Artist".ljust(30, b"\x00") \ + + b"Album".ljust(30, b"\x00") + b"2026" \ + + b"Comment".ljust(28, b"\x00") + b"\x00" + b"\x03" + bytes([17]) + w(d, "id3v1.bin", v1) + + # --- EXIF with a TIFF header (little-endian) + an IFD of common tags. + def tiff_exif(): + entries = [ + (0x010F, 2, b"Make\x00"), # Make (ASCII) + (0x0110, 2, b"Model XYZ\x00"), # Model + (0x0112, 3, struct.pack("' + '' + '' + '' + 'Fuzz Title' + '' + 'An Author' + '' + '') + w(d, "xmp.bin", xmp.encode("utf-8")) + + # --- Vorbis comment buffer. + def vorbiscomment(): + vendor = b"fuzz libVorbis" + comments = [b"TITLE=Vorbis Title", b"VERSION=remaster", + b"ARTIST=Vorbis Artist", b"PERFORMER=The Performer", + b"ALBUM=Album", b"DATE=2026-06-12", b"GENRE=Electronic", + b"TRACKNUMBER=4", b"TRACKTOTAL=12", b"DISCNUMBER=1", + b"COPYRIGHT=(c) 2026", b"LICENSE=CC-BY", + b"ORGANIZATION=Label", b"DESCRIPTION=a track", + b"LOCATION=Studio", b"CONTACT=info@example.com", + b"ISRC=US-XXX-26-00001", b"COMPOSER=A Composer", + b"REPLAYGAIN_TRACK_GAIN=-2.1 dB", + b"REPLAYGAIN_TRACK_PEAK=0.98", + b"REPLAYGAIN_ALBUM_GAIN=-1.5 dB", + b"REPLAYGAIN_ALBUM_PEAK=0.99", + b"MUSICBRAINZ_TRACKID=abc-123", + b"MUSICBRAINZ_ARTISTID=def-456", + b"BPM=128", b"LANGUAGE=eng", + b"METADATA_BLOCK_PICTURE=AAAAAA=="] + out = struct.pack(" 00:00:04,000\nHello bold world\n\n" + "2\n00:00:05,500 --> 00:00:08,250\nSecond line\nwith two rows\n\n" + "3\n00:01:02,100 --> 00:01:05,000\nitalic {\\an8}top\n") + w(d, "webvtt.vtt", + "WEBVTT - Some title\n\nNOTE a comment\n\n" + "1\n00:00:00.000 --> 00:00:02.000 line:0 position:50%\n" + "Hello\n\n" + "00:00:02.000 --> 00:00:04.000\nstyled text\n") + w(d, "microdvd.sub", + "{1}{1}29.970\n{0}{60}First subtitle|second row\n" + "{75}{120}{y:i}Italic line\n{150}{200}Another\n") + w(d, "subviewer.sub", + "[INFORMATION]\n[TITLE]Fuzz\n[END INFORMATION]\n" + "00:00:01.00,00:00:03.00\nFirst caption\n\n" + "00:00:04.00,00:00:06.00\nSecond caption\n") + w(d, "mpl2.txt", + "[10][30]First mpl2 line\n[35][60]Second line|next row\n") + w(d, "sami.smi", + "Fuzz" + "" + "

First

" + "

Second

\n") + w(d, "tmplayer.txt", + "00:00:01:First TMPlayer line\n00:00:04:Second line\n") + w(d, "lrc.lrc", + "[ti:Song]\n[ar:Artist]\n[00:01.00]First lyric\n[00:04.50]Second\n") + w(d, "qttext.txt", + "{QTtext}{font:Geneva}{size:12}\n[00:00:01.00]\nFirst caption\n") + + +# ========================================================================== +# typefind (push-based type detection -- only the leading magic matters) +# ========================================================================== +def _png(): + sig = b"\x89PNG\r\n\x1a\n" + ihdr = struct.pack(">IIBBBBB", 1, 1, 8, 2, 0, 0, 0) + def chunk(t, d): + return struct.pack(">I", len(d)) + t + d + \ + struct.pack(">I", zlib.crc32(t + d) & 0xffffffff) + idat = zlib.compress(b"\x00\xff\x00\x00") + return sig + chunk(b"IHDR", ihdr) + chunk(b"IDAT", idat) + \ + chunk(b"IEND", b"") + + +def _riff_wav(): + fmt = struct.pack("I", 0x200) + brand + b"mp42" + ftyp = struct.pack(">I", len(ftyp) + 4) + ftyp + mdat = struct.pack(">I", 16) + b"mdat" + b"\x00" * 8 + return ftyp + mdat + + +def _matroska(): + # EBML header declaring a Matroska/WebM doctype. + def vint(n): + return bytes([0x80 | n]) + ebml = (b"\x1aE\xdf\xa3") # EBML id + doctype = b"\x42\x82" + vint(8) + b"matroska" + body = (b"\x42\x86" + vint(1) + b"\x01" + # EBMLVersion + b"\x42\xf7" + vint(1) + b"\x01" + # EBMLReadVersion + doctype + + b"\x42\x87" + vint(1) + b"\x02" + # DocTypeVersion + b"\x42\x85" + vint(1) + b"\x02") # DocTypeReadVersion + hdr = ebml + vint(len(body)) + body + # a Segment id so demux start is plausible + seg = b"\x18\x53\x80\x67" + b"\x01\x00\x00\x00\x00\x00\x10\x00" + return hdr + seg + b"\x00" * 16 + + +def _ogg(): + # OggS page header (version 0, BOS) + a vorbis id header start. + hdr = b"OggS" + bytes([0, 0x02]) + b"\x00" * 8 + b"\x00" * 4 + \ + struct.pack("I", 0x00000022) + b"\x10\x00\x10\x00" + \ + b"\x00\x00\x00" + b"\x00\x00\x00" + b"\x0a\xc4\x42\xf0" + b"\x00" * 16 + return b"fLaC" + streaminfo + + +def _mp3_id3(): + # MPEG-1 Layer III, 128 kbit/s, 44.1 kHz -> frame length 417 bytes. + # The typefinder requires GST_MP3_TYPEFIND_MIN_HEADERS (2) consecutive + # consistent frames, so emit several so the frame-scan loop confirms. + frame = b"\xff\xfb\x90\x00" + b"\x00" * 413 + return _id3v2_tag([_id3v2_frame(b"TIT2", b"\x00mp3")], 4) + frame * 6 + + +def _adts_aac(): + # ADTS, AAC-LC, 44.1 kHz, stereo; aac_frame_length = 32 bytes per frame. + # Repeat so the ADTS frame-walk confirms instead of bailing after one. + fl = 32 + hdr = bytes([0xFF, 0xF1, 0x50, 0x80 | ((fl >> 11) & 3), + (fl >> 3) & 0xFF, ((fl & 7) << 5) | 0x1F, 0xFC]) + return (hdr + b"\x00" * (fl - len(hdr))) * 6 + + +def _mpegts(): + # 12 transport packets of 188 bytes, sync 0x47; first is the PAT (PID 0). + # mpeg_ts typefind needs >= 4 sync bytes spaced by a valid packet size. + pat = bytes([0x47, 0x40, 0x00, 0x10, 0x00]) + bytes([0x00, 0xB0, 0x0D, + 0x00, 0x01, 0xC1, 0x00, 0x00, 0x00, 0x01, 0xF0, 0x00]) + pat = pat.ljust(188, b"\xff"[0:1] * 0 or b"\xff") + null = bytes([0x47, 0x1F, 0xFF, 0x10]) + b"\x00" * 184 + return pat + null * 11 + + +def _quicktime_full(): + # ftyp + moov (with mvhd) + mdat -> have_moov && have_mdat confirms, + # and the atom-walk loop runs over several nested atoms. + def atom(typ, payload=b""): + return struct.pack(">I", 8 + len(payload)) + typ + payload + ftyp = atom(b"ftyp", b"qt " + struct.pack(">I", 0x200) + b"qt ") + mvhd = atom(b"mvhd", b"\x00" * 100) + tkhd = atom(b"tkhd", b"\x00" * 84) + trak = atom(b"trak", tkhd) + moov = atom(b"moov", mvhd + trak) + mdat = atom(b"mdat", b"\x00" * 32) + return ftyp + moov + mdat + + +def gen_typefind(base): + d = os.path.join(base, "typefind") + w(d, "png.png", _png()) + w(d, "jpeg.jpg", b"\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00" + + b"\x00" * 16 + b"\xff\xd9") + w(d, "gif.gif", b"GIF89a" + struct.pack("IIIII", 24, 16, 1, 8000, 1)) + w(d, "midi.mid", b"MThd" + struct.pack(">IHHH", 6, 0, 1, 96) + + b"MTrk" + struct.pack(">I", 4) + b"\x00\xff\x2f\x00") + w(d, "flv.flv", b"FLV\x01\x05" + struct.pack(">I", 9) + + struct.pack(">I", 0) + b"\x00" * 16) + w(d, "mpegts.ts", _mpegts()) + w(d, "quicktime_full.mov", _quicktime_full()) + w(d, "mpeg_ps.mpg", b"\x00\x00\x01\xba" + b"\x44\x00\x04\x00\x04\x01" + + b"\x00\x03\xf8" + b"\x00\x00\x01\xbb" + b"\x00" * 16) + w(d, "h264.h264", b"\x00\x00\x00\x01\x67\x42\x00\x1f" + b"\x00" * 8 + + b"\x00\x00\x00\x01\x68\xce\x3c\x80") + w(d, "caf.caf", b"caff\x00\x01\x00\x00" + b"\x00" * 16) + w(d, "ico.ico", b"\x00\x00\x01\x00\x01\x00\x10\x10\x00\x00" + b"\x00" * 16) + w(d, "tiff.tiff", b"II*\x00" + struct.pack("") + w(d, "html.html", b"") + w(d, "svg.svg", b"") + w(d, "bzip2.bz2", b"BZh91AY&SY" + b"\x00" * 16) + w(d, "gzip.gz", b"\x1f\x8b\x08\x00" + b"\x00" * 16) + w(d, "zip.zip", b"PK\x03\x04\x14\x00" + b"\x00" * 26) + w(d, "elf.bin", b"\x7fELF\x02\x01\x01\x00" + b"\x00" * 16) + w(d, "utf8text.txt", b"plain ascii then unicode \xc3\xa9\xc3\xa8 text\n") + + pad = b"\x00" * 64 + + # ---- ISO-BMFF brand variants (distinct quicktime/mj2/3gp/heif paths) ---- + w(d, "mj2.mj2", _isobmff(b"mjp2")) + w(d, "3gp.3gp", _isobmff(b"3gp4")) + w(d, "avif.avif", _isobmff(b"avif")) + w(d, "m4a.m4a", _isobmff(b"M4A ")) + # bare QuickTime atoms (no ftyp): moov-first / mdat-first / wide / free + for atom in (b"moov", b"mdat", b"wide", b"free", b"skip", b"pnot"): + w(d, "qt_%s.mov" % atom.decode(), struct.pack(">I", 16) + atom + pad) + + # ---- RIFF / IFF family ---- + def iff(form): + body = form + pad + return b"FORM" + struct.pack(">I", len(body)) + body + w(d, "aiff.aiff", iff(b"AIFF")) + w(d, "aifc.aifc", iff(b"AIFC")) + w(d, "iff_8svx.iff", iff(b"8SVX")) + w(d, "iff_16sv.iff", iff(b"16SV")) + w(d, "iff_ilbm.iff", iff(b"ILBM")) + + # ---- container / stream formats by leading magic ---- + table = { + # audio + "aac_adif.aac": b"ADIF" + pad, + "shorten.shn": b"ajkg" + b"\x02" + pad, + "ape_tag.apetag": b"APETAGEX" + struct.pack("I", 18) + pad, + "asf.asf": b"\x30\x26\xb2\x75\x8e\x66\xcf\x11\xa6\xd9\x00\xaa" + b"\x00\x62\xce\x6c" + pad, + "swf.swf": b"FWS\x09" + struct.pack("", + "ttml.ttml": b"", + "smil.smil": b"", + # ar / tar already-ish; ar archive + "ar.a": b"!\n" + b"foo/ 0 0 0 100644 4 `\n" + b"\x00\x00\x00\x00", + # tracker module formats (4cc / signature) + "mod_xm.xm": b"Extended Module: seed" + b"\x00" * 37 + b"\x1a", + "mod_it.it": b"IMPM" + pad, + "mod_dbm.dbm": b"DBM0" + pad, + "mod_dsm.dsm": b"DSMF" + pad, + "mod_far.far": b"FAR\xfe" + pad, + "mod_mmd.med": b"MMD0" + pad, + "mod_okta.okt": b"OKTASONG" + pad, + "mod_psm.psm": b"PSM " + pad, + "digibooster.dbm": b"DIGI Booster module\x00" + pad, + # codec elementary streams (syncword / NAL) + "ac3.ac3": (b"\x0b\x77\x00\x00\x3c\x00" + b"\x00" * 250) * 3, + "dts.dts": (b"\x7f\xfe\x80\x01\x00\x00\x00\x00" + b"\x00" * 120) * 3, + "eac3.eac3": (b"\x0b\x77\x18\x00" + b"\x00" * 200) * 3, + "h263.h263": b"\x00\x00\x80\x02" + pad, + "h265.h265": b"\x00\x00\x00\x01\x40\x01" + b"\x00" * 8 + + b"\x00\x00\x00\x01\x42\x01" + b"\x00" * 8 + + b"\x00\x00\x00\x01\x44\x01" + pad, + "h266.h266": b"\x00\x00\x00\x01\x00\x79" + b"\x00" * 8 + + b"\x00\x00\x00\x01\x00\x81" + pad, + "mpeg_es.mpv": b"\x00\x00\x01\xb3\x16\x01\x20\xc4" + b"\x00" * 16 + + b"\x00\x00\x01\xb8" + pad, + "mpeg4_es.m4v": b"\x00\x00\x01\xb0\x01\x00\x00\x01\xb5" + pad, + "av1.obu": b"\x12\x00\x0a\x0b\x00\x00\x00\x24\xcf\xbf\x1b\xe0\x01\x40" + + pad, + "dv.dv": b"\x1f\x07\x00\x3f" + b"\x00" * 76 + b"\x1f\x07\x01\x3f" + pad, + "pva.pva": b"AV\x01\x00" + pad, + } + for name, data in table.items(): + w(d, name, data) + + # ---- Ogg-wrapped codec identification packets ---- + def ogg_packet(payload, serial=1): + nseg = (len(payload) + 254) // 255 + segtab = bytes([255] * (nseg - 1) + + [len(payload) - 255 * (nseg - 1)]) if nseg else b"\x00" + return (b"OggS" + bytes([0, 0x02]) + b"\x00" * 8 + + struct.pack(" 1 else "gst_seeds" + os.makedirs(out, exist_ok=True) + gen_codec_utils(out) + gen_tag(out) + gen_subparse(out) + gen_typefind(out) + gen_discoverer(out) + total = 0 + for root, _, files in os.walk(out): + total += len(files) + sys.stderr.write("generate_seeds.py: wrote %d seeds under %s\n" + % (total, out)) + + +if __name__ == "__main__": + main() diff --git a/projects/libheif/Dockerfile b/projects/libheif/Dockerfile index d6df05b487a1..32bdf4e05304 100644 --- a/projects/libheif/Dockerfile +++ b/projects/libheif/Dockerfile @@ -24,4 +24,4 @@ RUN git clone \ WORKDIR libheif -COPY build.sh $SRC/ +COPY build.sh generate_seeds.py $SRC/ diff --git a/projects/libheif/build.sh b/projects/libheif/build.sh index 14f8a016f37d..c66cc8457084 100755 --- a/projects/libheif/build.sh +++ b/projects/libheif/build.sh @@ -17,3 +17,18 @@ # Delegate actual building to the script provided by libheif. ./scripts/build-oss-fuzz.sh + +# Structured HEIF/ISOBMFF seeds (see generate_seeds.py). These exercise the +# box / item-property / derivation (grid, iovl) parsing in box.cc and the +# movie-box path in seq_boxes.cc that the shipped .heic corpus does not carry. +# box_fuzzer and tile_fuzzer ship no corpus at all, so they gain the most. +python3 "$SRC/generate_seeds.py" "$SRC/generated_heif_seeds" + +# Name corpora with the underscore convention OSS-Fuzz actually loads +# (_seed_corpus.zip); the binaries are file_fuzzer / box_fuzzer / +# tile_fuzzer. +zip -j -q "$OUT/box_fuzzer_seed_corpus.zip" "$SRC"/generated_heif_seeds/*.heif +zip -j -q "$OUT/tile_fuzzer_seed_corpus.zip" "$SRC"/generated_heif_seeds/*.heif +# file_fuzzer: the stock .heic corpus plus the generated container seeds. +cp "$SRC"/libheif/fuzzing/data/corpus/*.heic "$SRC"/generated_heif_seeds/ 2>/dev/null || true +zip -j -q "$OUT/file_fuzzer_seed_corpus.zip" "$SRC"/generated_heif_seeds/* diff --git a/projects/libheif/generate_seeds.py b/projects/libheif/generate_seeds.py new file mode 100644 index 000000000000..e79aede25bd6 --- /dev/null +++ b/projects/libheif/generate_seeds.py @@ -0,0 +1,524 @@ +#!/usr/bin/env python3 +# Copyright 2026 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""Structured HEIF/ISOBMFF seed generation for the libheif OSS-Fuzz targets. + +libheif builds six fuzzers; only `file-fuzzer` ships a seed corpus. The +`box-fuzzer` (which loops over `Box::read` and dumps every box) and the other +targets start from nothing. The public coverage report and box.cc show a very +large set of recognised box / item-property types (clap, irot, imir, ispe, +pixi, colr, pasp, auxC, clli, mdcv, cclv, amve, a1lx, a1op, lsel, uncC, cmpC, +grid/iovl derivations, iref reference types, ...), most of which the existing +.heic corpus never carries. + +This script synthesises structurally valid ISOBMFF/HEIF files that exercise the +container, item-property and derivation parsing paths. No external codec is +required: the files are about *box structure*, which `box-fuzzer` parses in +full and `file-fuzzer` walks before attempting any decode. We also emit an +uncompressed-codec (`uncC`/`unci`) image, which libheif can actually decode +from pure container data, reaching the decode + colour-conversion paths. + +Pure Python standard library only. + +Usage: python3 generate_seeds.py +""" + +import os +import sys +import struct + + +def w(d, name, data): + os.makedirs(d, exist_ok=True) + with open(os.path.join(d, name), "wb") as f: + f.write(data) + + +# -------------------------------------------------------------------------- +# ISOBMFF box helpers +# -------------------------------------------------------------------------- +def box(typ, payload=b""): + assert len(typ) == 4 + return struct.pack(">I", 8 + len(payload)) + typ.encode("latin-1") + payload + + +def fullbox(typ, version, flags, payload=b""): + hdr = struct.pack(">I", (version << 24) | (flags & 0xFFFFFF)) + return box(typ, hdr + payload) + + +# --- item property boxes (live inside ipco) ------------------------------- +def p_ispe(wd, ht): + return fullbox("ispe", 0, 0, struct.pack(">II", wd, ht)) + + +def p_pixi(channels): + return fullbox("pixi", 0, 0, bytes([len(channels)]) + bytes(channels)) + + +def p_irot(angle): # 0,1,2,3 -> 0/90/180/270 + return box("irot", bytes([angle & 3])) + + +def p_imir(axis): # 0 vertical, 1 horizontal + return box("imir", bytes([axis & 1])) + + +def p_clap(): + # cleanAperture: widthN,widthD,heightN,heightD,horizOffN/D,vertOffN/D + return box("clap", struct.pack(">iiiiiiii", 32, 1, 32, 1, 0, 1, 0, 1)) + + +def p_pasp(): + return box("pasp", struct.pack(">II", 1, 1)) + + +def p_colr_nclx(): + return box("colr", b"nclx" + struct.pack(">HHH", 1, 13, 1) + bytes([0x80])) + + +def p_colr_ricc(): + icc = b"\x00\x00\x00\x0cseed-iccprof" + return box("colr", b"rICC" + icc) + + +def p_auxC(uri): + return fullbox("auxC", 0, 0, uri.encode("latin-1") + b"\x00") + + +def p_clli(): + return box("clli", struct.pack(">HH", 1000, 50)) + + +def p_mdcv(): + return box("mdcv", struct.pack(">HHHHHHHHII", + 13250, 34500, 7500, 3000, 34000, 16000, 15635, 16450, + 10000000, 50)) + + +def p_hvcC(): + # Minimal HEVCDecoderConfigurationRecord header (no nal arrays). + return box("hvcC", bytes([0x01, 0x01, 0x60, 0x00, 0x00, 0x00, 0x90, 0x00, + 0x00, 0x00, 0x00, 0x00, 0x3C, 0xF0, 0x00, 0xFC, + 0xFD, 0xF8, 0xF8, 0x00, 0x00, 0x00]) + + bytes([0x00])) + + +def p_av1C(): + return box("av1C", bytes([0x81, 0x00, 0x0C, 0x00])) + + +def p_a1lx(): + return box("a1lx", struct.pack(">BIII", 0, 16, 16, 16)[:1] + + struct.pack(">III", 16, 16, 16)) + + +def p_a1op(): + return box("a1op", bytes([1])) + + +def p_lsel(): + return box("lsel", struct.pack(">H", 0)) + + +def p_uncC_unci(): + # uncompressed-frame config (uncC v1 'tiled' minimal) + component defs. + cmpd = box("cmpd", struct.pack(">I", 3) + + struct.pack(">H", 4) + struct.pack(">H", 5) + struct.pack(">H", 6)) + uncC = fullbox("uncC", 1, 0, b"\x00\x00\x00\x00") + return cmpd + uncC + + +def p_pymd(): + return box("pymd", struct.pack(">HH", 1, 1) + b"\x00" * 4) + + +def p_mskC(): + return fullbox("mskC", 0, 0, bytes([8]) + b"\x00\x00\x00") + + +# -------------------------------------------------------------------------- +# meta-box assembly +# -------------------------------------------------------------------------- +def make_meta(items, properties, associations, irefs=b"", idat=b"", + primary=1): + """items: list of (item_id, item_type, extra_infe_payload) + properties: list of property box bytes (1-based index order) + associations: list of (item_id, [(prop_index, essential), ...]) + irefs: pre-built iref payload (concatenated reference boxes)""" + hdlr = fullbox("hdlr", 0, 0, + struct.pack(">I", 0) + b"pict" + b"\x00" * 12 + + b"libheif-seed\x00") + pitm = fullbox("pitm", 0, 0, struct.pack(">H", primary)) + + # iinf with infe entries + infes = b"" + for iid, ityp, extra in items: + payload = struct.pack(">HH", iid, 0) + ityp.encode("latin-1") + \ + b"seed\x00" + extra + infes += fullbox("infe", 2, 0, payload) + iinf = fullbox("iinf", 0, 0, struct.pack(">H", len(items)) + infes) + + # iprp = ipco (properties) + ipma (associations) + ipco = box("ipco", b"".join(properties)) + ipma_body = struct.pack(">I", len(associations)) + for iid, props in associations: + ipma_body += struct.pack(">H", iid) + bytes([len(props)]) + for idx, essential in props: + ipma_body += bytes([((0x80 if essential else 0) | (idx & 0x7F))]) + ipma = fullbox("ipma", 0, 0, ipma_body) + iprp = box("iprp", ipco + ipma) + + # iloc: place every item inside the idat box (construction_method=1) + iloc_body = bytes([(4 << 4) | 0, (0 << 4) | 0]) # offset/len sz=4, base/idx=0 + iloc_body += struct.pack(">H", len(items)) + off = 0 + for iid, ityp, extra in items: + ln = 16 + iloc_body += struct.pack(">H", iid) + struct.pack(">H", 1) # method=1 idat + iloc_body += struct.pack(">H", 0) # data_ref_index + iloc_body += struct.pack(">H", 1) # extent_count + iloc_body += struct.pack(">I", off) + struct.pack(">I", ln) + off += ln + iloc = fullbox("iloc", 1, 0, iloc_body) + + idat_box = box("idat", idat) if idat else b"" + iref_box = fullbox("iref", 0, 0, irefs) if irefs else b"" + + body = hdlr + pitm + iinf + iref_box + iprp + iloc + idat_box + return fullbox("meta", 0, 0, body) + + +def iref_entry(ref_type, from_id, to_ids): + payload = struct.pack(">H", from_id) + struct.pack(">H", len(to_ids)) + for t in to_ids: + payload += struct.pack(">H", t) + return box(ref_type, payload) + + +FTYP = box("ftyp", b"heic" + struct.pack(">I", 0) + b"mif1heic") + + +# -------------------------------------------------------------------------- +def seed_comprehensive(): + """A file carrying a large variety of item properties + several items.""" + props = [ + p_ispe(32, 32), p_pixi([8, 8, 8]), p_irot(1), p_imir(0), p_clap(), + p_pasp(), p_colr_nclx(), p_colr_ricc(), p_auxC("urn:mpeg:hevc:aux:alpha"), + p_clli(), p_mdcv(), p_hvcC(), p_av1C(), p_a1lx(), p_a1op(), p_lsel(), + p_mskC(), + ] + items = [ + (1, "hvc1", b""), # primary coded image + (2, "av01", b""), # av1 coded image + (3, "Exif", b""), # metadata item + (4, "mime", b"\x00application/rdf+xml\x00"), + ] + assoc = [ + (1, [(1, False), (2, False), (3, False), (12, True), (7, False), + (10, False), (11, False)]), + (2, [(1, False), (2, False), (13, True), (14, False), (15, False)]), + ] + irefs = (iref_entry("cdsc", 3, [1]) + iref_entry("thmb", 2, [1]) + + iref_entry("auxl", 2, [1])) + idat = b"\x00" * (16 * len(items)) + meta = make_meta(items, props, assoc, irefs, idat) + return FTYP + meta + box("mdat", b"\x00" * 32) + + +def seed_grid(): + """A 2x2 'grid' derived image referencing four coded tiles.""" + # grid item payload: version, flags, rows-1, cols-1, output W, H (16-bit) + grid_data = bytes([0, 0, 1, 1]) + struct.pack(">HH", 64, 64) + props = [p_ispe(64, 64), p_pixi([8, 8, 8]), p_hvcC(), p_colr_nclx()] + items = [(1, "grid", b"")] + [(i, "hvc1", b"") for i in range(2, 6)] + assoc = [(1, [(1, False), (4, False)])] + \ + [(i, [(1, False), (3, True), (4, False)]) for i in range(2, 6)] + irefs = iref_entry("dimg", 1, [2, 3, 4, 5]) + idat = grid_data.ljust(16, b"\x00") + b"\x00" * (16 * 4) + meta = make_meta(items, props, assoc, irefs, idat) + return FTYP + meta + box("mdat", b"\x00" * 64) + + +def seed_overlay(): + """An 'iovl' overlay derived image.""" + # iovl: version/flags, canvas_fill (4x16), output W,H, then per-image x,y + iovl = bytes([0, 0]) + struct.pack(">HHHH", 0, 0, 0, 0xFFFF) + iovl += struct.pack(">HH", 64, 64) + iovl += struct.pack(">hh", 0, 0) + struct.pack(">hh", 16, 16) + props = [p_ispe(64, 64), p_pixi([8, 8, 8]), p_hvcC()] + items = [(1, "iovl", b""), (2, "hvc1", b""), (3, "hvc1", b"")] + assoc = [(1, [(1, False)]), (2, [(1, False), (3, True)]), + (3, [(1, False), (3, True)])] + irefs = iref_entry("dimg", 1, [2, 3]) + idat = iovl.ljust(16, b"\x00") + b"\x00" * 32 + meta = make_meta(items, props, assoc, irefs, idat) + return FTYP + meta + box("mdat", b"\x00" * 64) + + +def _heif_single_item(item_type, props, assoc, item_data, extra_props_meta=b""): + """Build a complete HEIF with one item whose data lives in an idat box, + with a *correctly sized* iloc extent so the item actually decodes.""" + hdlr = fullbox("hdlr", 0, 0, struct.pack(">I", 0) + b"pict" + + b"\x00" * 12 + b"seed\x00") + pitm = fullbox("pitm", 0, 0, struct.pack(">H", 1)) + infe = fullbox("infe", 2, 0, struct.pack(">HH", 1, 0) + + item_type.encode("latin-1") + b"\x00") + iinf = fullbox("iinf", 0, 0, struct.pack(">H", 1) + infe) + ipco = box("ipco", b"".join(props)) + ipma_body = struct.pack(">I", 1) + struct.pack(">H", 1) + bytes([len(assoc)]) + for idx, ess in assoc: + ipma_body += bytes([(0x80 if ess else 0) | (idx & 0x7F)]) + ipma = fullbox("ipma", 0, 0, ipma_body) + iprp = box("iprp", ipco + ipma) + + # iloc v1: offset_size=4 length_size=4, base_offset_size=0 index_size=0, + # one item, construction_method=0 (file offset), one extent. The data lives + # in an mdat box appended after the meta box, so the absolute offset is + # len(FTYP)+len(meta)+8. The offset field width is fixed, so the meta size + # is the same whether we use a placeholder or the real offset. + def make_iloc(offset): + il = bytes([(4 << 4) | 4, 0]) + struct.pack(">H", 1) + il += struct.pack(">H", 1) + struct.pack(">H", 0) + struct.pack(">H", 0) + il += struct.pack(">H", 1) + struct.pack(">I", offset) + \ + struct.pack(">I", len(item_data)) + return fullbox("iloc", 1, 0, il) + + meta = fullbox("meta", 0, 0, hdlr + pitm + iinf + iprp + make_iloc(0)) + data_offset = len(FTYP) + len(meta) + 8 # +8 = mdat box header + meta = fullbox("meta", 0, 0, + hdlr + pitm + iinf + iprp + make_iloc(data_offset)) + return FTYP + meta + box("mdat", item_data) + + +def _unci_image(w, h, comp_types, interleave, bit_depth=8, comp_format=0, + comp_align=0, flags=0, pixel_size=0, row_align=0, + tile_align=0, tile_cols=1, tile_rows=1, sampling=0, + block_size=0): + """A decodable ISO 23001-17 uncompressed image. interleave: 0=component, + 1=pixel, 2=mixed, 3=row, 4=tile-component, 5=multi-Y.""" + nc = len(comp_types) + bpc = (bit_depth + 7) // 8 + + def val(x, y, c): + return (x * 37 + y * 17 + c * 53) & ((1 << bit_depth) - 1) + + def emit(v): + if bpc == 1: + return bytes([v & 0xFF]) + if flags & 0x80: # components_little_endian + return bytes([v & 0xFF, (v >> 8) & 0xFF]) + return bytes([(v >> 8) & 0xFF, v & 0xFF]) + + data = bytearray() + if interleave == 1 or interleave == 2: # pixel / mixed + for y in range(h): + for x in range(w): + for c in range(nc): + data += emit(val(x, y, c)) + if pixel_size: + while len(data) % pixel_size: + data += b"\x00" + elif interleave == 3: # row + for y in range(h): + for c in range(nc): + for x in range(w): + data += emit(val(x, y, c)) + else: # component / tile-component + for c in range(nc): + for y in range(h): + for x in range(w): + data += emit(val(x, y, c)) + pixels = bytes(data) + + cmpd = box("cmpd", struct.pack(">I", nc) + + b"".join(struct.pack(">H", t) for t in comp_types)) + u = struct.pack(">II", 0, nc) + for i in range(nc): + u += struct.pack(">HBBB", i, bit_depth - 1, comp_format, comp_align) + u += struct.pack(">BBBB", sampling, interleave, block_size, flags) + u += struct.pack(">IIIII", pixel_size, row_align, tile_align, + tile_cols - 1, tile_rows - 1) + # uncC version 0 = explicit component configuration (profile=0). Version 1 + # is the compact form that requires a known profile 4cc and omits the + # component array, so it must NOT be used with an explicit component list. + uncC = fullbox("uncC", 0, 0, u) + ispe = fullbox("ispe", 0, 0, struct.pack(">II", w, h)) + props = [ispe, p_pixi([bit_depth] * nc), cmpd, uncC] + assoc = [(1, False), (2, False), (3, True), (4, True)] + return _heif_single_item("unci", props, assoc, pixels) + + +# Component types (ISO 23001-17): 0=mono 1=Y 2=Cb 3=Cr 4=R 5=G 6=B 7=alpha. +def seed_unci_rgb_pixel(): + return _unci_image(8, 8, [4, 5, 6], interleave=1) + + +def seed_unci_rgb_planar(): + return _unci_image(8, 8, [4, 5, 6], interleave=0) + + +def seed_unci_rgb_row(): + return _unci_image(8, 8, [4, 5, 6], interleave=3) + + +def seed_unci_rgba_pixel(): + return _unci_image(8, 8, [4, 5, 6, 7], interleave=1) + + +def seed_unci_mono(): + return _unci_image(8, 8, [0], interleave=0) + + +def seed_unci_yuv(): + return _unci_image(8, 8, [1, 2, 3], interleave=1, sampling=0) + + +def seed_unci_rgb16le(): + return _unci_image(8, 8, [4, 5, 6], interleave=1, bit_depth=16, flags=0x80) + + +def seed_unci_rgb16be(): + return _unci_image(8, 8, [4, 5, 6], interleave=0, bit_depth=16, flags=0x00) + + +def seed_unci_rgb_tiled(): + return _unci_image(8, 8, [4, 5, 6], interleave=1, tile_cols=2, tile_rows=2) + + +def seed_unci_rgb_pixsize(): + return _unci_image(8, 8, [4, 5, 6], interleave=1, pixel_size=4, row_align=4) + + +def seed_unci_rgb_compalign(): + return _unci_image(8, 8, [4, 5, 6], interleave=0, comp_align=2) + + +# Block-based decoders: only selected when block_size == pixel_size != 0 (the +# non-block decoders reject block_size!=0 via check_common_requirements). +def seed_unci_block_pixel(): + return _unci_image(8, 8, [4, 5, 6], interleave=1, pixel_size=4, + block_size=4) + + +def seed_unci_block_pixel_le(): + # block_little_endian (0x20) + return _unci_image(8, 8, [4, 5, 6], interleave=1, pixel_size=4, + block_size=4, flags=0x20) + + +def seed_unci_block_pixel_rev(): + # block_reversed (0x10) + return _unci_image(8, 8, [4, 5, 6], interleave=1, pixel_size=4, + block_size=4, flags=0x10) + + +def seed_unci_block_pixel_padlsb(): + # block_pad_lsb (0x40) + return _unci_image(8, 8, [4, 5, 6], interleave=1, pixel_size=4, + block_size=4, flags=0x40) + + +def seed_unci_block_component(): + # block_component requires block_bits/2 < bit_depth <= block_bits, so + # block_size=1 (8 bits) pairs with 8-bit components. + return _unci_image(8, 8, [4, 5, 6], interleave=0, block_size=1) + + +def seed_unci_block_component16(): + # block_size=2 (16 bits) pairs with 16-bit components. + return _unci_image(8, 8, [4, 5, 6], interleave=0, block_size=2, + bit_depth=16) + + +def seed_unci_block_component_rev(): + return _unci_image(8, 8, [4, 5, 6], interleave=0, block_size=1, flags=0x10) + + +def seed_unci_block_component_padlsb(): + return _unci_image(8, 8, [4, 5, 6], interleave=0, block_size=1, flags=0x40) + + +def seed_moov_skeleton(): + """A file with a moov/trak/mdia/minf/stbl skeleton so the (rarely used) + movie-box parsers in box.cc get walked as top-level boxes too.""" + mvhd = fullbox("mvhd", 0, 0, b"\x00" * 96) + tkhd = fullbox("tkhd", 0, 7, b"\x00" * 80) + mdhd = fullbox("mdhd", 0, 0, b"\x00" * 20) + hdlr = fullbox("hdlr", 0, 0, struct.pack(">I", 0) + b"vide" + b"\x00" * 12 + + b"seed\x00") + vmhd = fullbox("vmhd", 0, 1, b"\x00" * 8) + dref = fullbox("dref", 0, 0, struct.pack(">I", 1) + + fullbox("url ", 0, 1, b"")) + dinf = box("dinf", dref) + stsd = fullbox("stsd", 0, 0, struct.pack(">I", 0)) + stts = fullbox("stts", 0, 0, struct.pack(">I", 0)) + stsc = fullbox("stsc", 0, 0, struct.pack(">I", 0)) + stsz = fullbox("stsz", 0, 0, struct.pack(">II", 0, 0)) + stco = fullbox("stco", 0, 0, struct.pack(">I", 0)) + stbl = box("stbl", stsd + stts + stsc + stsz + stco) + minf = box("minf", vmhd + dinf + stbl) + mdia = box("mdia", mdhd + hdlr + minf) + trak = box("trak", tkhd + mdia) + moov = box("moov", mvhd + trak) + return FTYP + moov + box("free", b"seedfree") + box("mdat", b"\x00" * 16) + + +# -------------------------------------------------------------------------- +def main(): + out = sys.argv[1] if len(sys.argv) > 1 else "libheif_seeds" + os.makedirs(out, exist_ok=True) + gens = [ + ("comprehensive.heif", seed_comprehensive), + ("grid.heif", seed_grid), + ("overlay.heif", seed_overlay), + ("moov_skeleton.heif", seed_moov_skeleton), + # Decodable ISO 23001-17 uncompressed images (no external codec + # needed) -> the uncompressed decoder variants + colour conversion. + ("unci_rgb_pixel.heif", seed_unci_rgb_pixel), + ("unci_rgb_planar.heif", seed_unci_rgb_planar), + ("unci_rgb_row.heif", seed_unci_rgb_row), + ("unci_rgba_pixel.heif", seed_unci_rgba_pixel), + ("unci_mono.heif", seed_unci_mono), + ("unci_yuv.heif", seed_unci_yuv), + ("unci_rgb16le.heif", seed_unci_rgb16le), + ("unci_rgb16be.heif", seed_unci_rgb16be), + ("unci_rgb_tiled.heif", seed_unci_rgb_tiled), + ("unci_rgb_pixsize.heif", seed_unci_rgb_pixsize), + ("unci_rgb_compalign.heif", seed_unci_rgb_compalign), + ("unci_block_pixel.heif", seed_unci_block_pixel), + ("unci_block_pixel_le.heif", seed_unci_block_pixel_le), + ("unci_block_pixel_rev.heif", seed_unci_block_pixel_rev), + ("unci_block_pixel_padlsb.heif", seed_unci_block_pixel_padlsb), + ("unci_block_component.heif", seed_unci_block_component), + ("unci_block_component16.heif", seed_unci_block_component16), + ("unci_block_component_rev.heif", seed_unci_block_component_rev), + ("unci_block_component_padlsb.heif", seed_unci_block_component_padlsb), + ] + n = 0 + for name, fn in gens: + try: + data = fn() + except Exception as e: # keep build robust + sys.stderr.write("seed %s failed: %s\n" % (name, e)) + continue + w(out, name, data) + n += 1 + sys.stderr.write("generate_seeds.py: wrote %d HEIF seeds to %s\n" + % (n, out)) + + +if __name__ == "__main__": + main() diff --git a/projects/onnx/Dockerfile b/projects/onnx/Dockerfile new file mode 100644 index 000000000000..9149b8155f7a --- /dev/null +++ b/projects/onnx/Dockerfile @@ -0,0 +1,31 @@ +# Copyright 2026 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +################################################################################ + +FROM gcr.io/oss-fuzz-base/base-builder-python +RUN apt-get update && apt-get install -y cmake protobuf-compiler +RUN pip install "scikit-build-core>=0.11" "nanobind==2.12.0" +RUN git clone --depth 1 https://github.com/onnx/onnx.git onnx && \ + cd onnx && \ + git submodule update --init --recursive +# Pre-download FetchContent dependencies (protobuf + abseil) so the build +# works offline inside the container without network access at cmake time. +RUN mkdir -p /deps/protobuf /deps/abseil-cpp && \ + wget -qO- https://github.com/protocolbuffers/protobuf/releases/download/v25.1/protobuf-25.1.tar.gz | \ + tar -xz --strip-components=1 -C /deps/protobuf && \ + wget -qO- https://github.com/abseil/abseil-cpp/releases/download/20250127.0/abseil-cpp-20250127.0.tar.gz | \ + tar -xz --strip-components=1 -C /deps/abseil-cpp +WORKDIR $SRC +COPY build.sh $SRC/ diff --git a/projects/onnx/build.sh b/projects/onnx/build.sh new file mode 100644 index 000000000000..be0d6c0edb27 --- /dev/null +++ b/projects/onnx/build.sh @@ -0,0 +1,58 @@ +#!/bin/bash -eu +# Copyright 2026 Google LLC +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# +################################################################################ + +cd $SRC/onnx + +# Build ONNX's own protobuf from source so it is compiled with -fPIC, +# which is required to link into the Python extension .so. +# Point FetchContent at pre-downloaded sources so cmake needs no network access. +export CMAKE_ARGS="-DONNX_BUILD_CUSTOM_PROTOBUF=ON \ + -DFETCHCONTENT_SOURCE_DIR_PROTOBUF=/deps/protobuf \ + -DFETCHCONTENT_SOURCE_DIR_ABSL=/deps/abseil-cpp \ + -DFETCHCONTENT_FULLY_DISCONNECTED=ON" + +# Build the Python extension with a clean compiler environment. +# The OSS-Fuzz CFLAGS contain -fsanitize=fuzzer-no-link which references +# __sancov_lowest_stack — a symbol only provided by libFuzzer at runtime — +# causing ImportError when plain Python imports the .so. Atheris handles +# instrumentation at the Python level, so the extension does not need these +# flags. This follows the same pattern used by numpy, pyyaml, and others. +unset CFLAGS CXXFLAGS LIB_FUZZING_ENGINE +pip3 install --no-build-isolation . + +python3 $SRC/onnx/onnx/fuzz/make_seed_corpus.py \ + $OUT/fuzz_version_converter_seed_corpus.zip \ + $OUT/fuzz_parser_seed_corpus.zip \ + $OUT/fuzz_checker_seed_corpus.zip \ + $OUT/fuzz_shape_inference_seed_corpus.zip + +# Coverage builds: compile_python_fuzzer prepends a stub containing real Python +# statements (import atexit, import coverage ...) before each fuzzer file. +# Any 'from __future__' import then appears after those statements and causes +# SyntaxError. Strip them from the in-container copies only. +if [[ "$SANITIZER" == "coverage" ]]; then + for f in $(find $SRC/onnx/onnx/fuzz -maxdepth 1 -name 'fuzz_*.py'); do + sed -i '/^from __future__ import/d' "$f" + done +fi + +# Build fuzzers in $OUT. +# --collect-all numpy bundles all numpy C extensions including numpy._core.* +# which PyInstaller 6.x does not pick up automatically with numpy 2.x. +for fuzzer in $(find $SRC/onnx/onnx/fuzz -maxdepth 1 -name 'fuzz_*.py'); do + compile_python_fuzzer $fuzzer --collect-all numpy +done diff --git a/projects/onnx/project.yaml b/projects/onnx/project.yaml new file mode 100644 index 000000000000..3f82c6574cc7 --- /dev/null +++ b/projects/onnx/project.yaml @@ -0,0 +1,14 @@ +fuzzing_engines: +- libfuzzer +homepage: https://onnx.ai/ +language: python +main_repo: https://github.com/onnx/onnx +# TODO: add a dedicated oss-fuzz security contact for the ONNX project +# (e.g. onnx-security@lists.lfaidata.foundation or a new alias) once one is +# established; until then crash reports go only to the addresses below. +primary_contact: a.fehlner@googlemail.com +auto_ccs: + - fehlner@arcor.de +sanitizers: +- address +- undefined