Skip to content

jseook11/codex-pdf-ocr-to-markdown-skill

Repository files navigation

PDF OCR to Markdown

License Skill Status

A Codex skill for turning PDFs and images into user-facing Markdown sidecars with internal JSON and quality reports.

The skill preserves source files. It does not add an OCR layer to PDFs, rewrite PDFs, or create searchable replacement PDFs. Instead, it extracts visible or embedded document content into <file>_ocr.md next to the source, while JSON and quality artifacts stay under the hidden .ocr_work/ directory.

Quick Install

Copy and paste this into Codex:

Install the Codex skill from https://github.com/jseook11/codex-pdf-ocr-to-markdown-skill. Copy the repository's `pdf-ocr-to-markdown` skill folder into my user Codex skills directory, preferably `~/.agents/skills/pdf-ocr-to-markdown`. If my Codex setup uses `~/.codex/skills`, use that instead. After copying, run `python3 scripts/setup_dependencies.py --install` from the installed skill folder once, then verify it with `python3 scripts/setup_dependencies.py --check`. If package installation needs approval, ask me during installation. Verify that the installed `SKILL.md` has `name: pdf-ocr-to-markdown`, reload skills if needed, and tell me when it is ready to use.

Features

  • Handles PDFs and common image files.
  • Extracts embedded PDF text when available.
  • Renders pages for visual inspection when layout or OCR quality requires it.
  • Routes pages as text_only, visual_focus, verify_text_and_visual, or full_vision_ocr.
  • Produces a Markdown sidecar, internal JSON, page artifacts, route reports, and quality reports.
  • Keeps internal run artifacts under .ocr_work/ and leaves only the final Markdown sidecar next to the source.
  • Uses bundled scripts instead of ad hoc OCR code.

Requirements

  • Codex CLI with user skill support.
  • Python 3.
  • Local filesystem access to the PDFs or images being processed.
  • Python package installation permission if using scripts/setup_dependencies.py --install.

The exact Python dependencies are managed by scripts/setup_dependencies.py.

Installation

The easiest installation path is to paste the prompt above into Codex.

Manual installation:

git clone https://github.com/jseook11/codex-pdf-ocr-to-markdown-skill.git
cd codex-pdf-ocr-to-markdown-skill

INSTALL_DIR="$HOME/.agents/skills/pdf-ocr-to-markdown"

if [ -e "$INSTALL_DIR" ]; then
  mv "$INSTALL_DIR" "${INSTALL_DIR}.backup.$(date +%Y%m%d_%H%M%S)"
fi

mkdir -p "$(dirname "$INSTALL_DIR")"
cp -R pdf-ocr-to-markdown "$INSTALL_DIR"

python3 "$INSTALL_DIR/scripts/setup_dependencies.py" --install
python3 "$INSTALL_DIR/scripts/setup_dependencies.py" --check

If your Codex installation still uses the legacy ~/.codex/skills location, copy the folder there instead:

INSTALL_DIR="$HOME/.codex/skills/pdf-ocr-to-markdown"

if [ -e "$INSTALL_DIR" ]; then
  mv "$INSTALL_DIR" "${INSTALL_DIR}.backup.$(date +%Y%m%d_%H%M%S)"
fi

mkdir -p "$(dirname "$INSTALL_DIR")"
cp -R pdf-ocr-to-markdown "$INSTALL_DIR"

python3 "$INSTALL_DIR/scripts/setup_dependencies.py" --install
python3 "$INSTALL_DIR/scripts/setup_dependencies.py" --check

After installation, ask Codex to use $pdf-ocr-to-markdown on PDFs or images.

Agent Installation Instructions

When a user asks an agent to install this skill from https://github.com/jseook11/codex-pdf-ocr-to-markdown-skill, the agent should:

  1. Clone or download the repository into a temporary location.
  2. Copy the repository's pdf-ocr-to-markdown/ directory to ~/.agents/skills/pdf-ocr-to-markdown/, unless the user's Codex setup uses ~/.codex/skills.
  3. If that destination already exists, preserve a timestamped backup or ask before replacing it.
  4. From the installed skill directory, run python3 scripts/setup_dependencies.py --install, then python3 scripts/setup_dependencies.py --check.
  5. If package installation requires network or pip approval, ask during installation rather than deferring dependency setup to every skill invocation.
  6. Verify that the installed SKILL.md exists and its frontmatter contains name: pdf-ocr-to-markdown.
  7. Do not copy generated .ocr_work/, *_ocr.md, legacy *_ocr.json, legacy *_ocr_quality.md, __pycache__/, or .DS_Store files.
  8. Report the installed path, dependency-check result, and tell the user they can invoke $pdf-ocr-to-markdown.

Usage

From Codex:

Use $pdf-ocr-to-markdown to OCR these files into Markdown.

Examples:

Use $pdf-ocr-to-markdown to convert ./docs/sample.pdf into a Markdown sidecar file.
Use $pdf-ocr-to-markdown on these scanned Korean lecture notes. Preserve headings, reading order, and tables when possible.

Direct script invocation from the installed skill folder:

python3 scripts/batch_ocr.py path/to/file.pdf

For Korean and English documents, the default language hint is kor+eng.

Outputs

For an input named sample.pdf, final user-facing files are written next to the source:

sample.pdf
sample_ocr.md

Dots inside the source filename are normalized to underscores in generated sidecar names. For example, sample.v1.pdf writes sample_v1_ocr.md, not sample.v1.ocr.md.

If multiple files in the same folder normalize to the same sidecar stem, later files are disambiguated with _2, _3, and so on. For example, a.b.png and a_b.png produce a_b_ocr.md and a_b_2_ocr.md.

Internal artifacts are written under hidden work directories:

.ocr_work/
  run_YYYYMMDD_HHMMSS/
  ocr_output/run_YYYYMMDD_HHMMSS/

Internal final.json and quality_report.md files are kept under .ocr_work/ocr_output/run_YYYYMMDD_HHMMSS/files/.../. Codex should ask before deleting hidden work files after the final Markdown sidecar exists.

Vision Honesty

The Python scripts cannot attach page images to a vision-capable model by themselves. For visual routes, Codex must inspect the rendered image through an image-capable path before writing raw visual OCR output. The skill must report text_extraction_complete_visual_pending when visual pages remain unanalyzed.

Limitations

  • This is a workflow-oriented Codex skill, not a general-purpose OCR engine.
  • The skill does not modify source PDFs or create searchable replacement PDFs.
  • Vision-based OCR requires Codex to inspect rendered page images through an image-capable path.
  • OCR quality depends on scan resolution, page rotation, handwriting quality, and document layout.
  • Complex tables, diagrams, formulas, and handwritten notes may require manual correction.
  • Password-protected or encrypted PDFs must be made readable before processing.

Future Improvements

  • Add a faster mode for frequent use that reduces vision-pass latency.
  • Generate prompts only for pages that actually need visual inspection.
  • Make diagnostic artifacts such as detailed JSON, quality reports, and generated prompts opt-in debug outputs.
  • Support a Markdown-first visual pass so routine OCR tasks do not spend extra time producing structured internal review files.

Development

Run all tests:

python3 -m unittest discover -s tests -v

Run individual tests:

python3 tests/test_batch_ocr.py -v
python3 tests/test_merge_outputs.py -v

Compile-check scripts:

PYTHONPYCACHEPREFIX=/private/tmp/pdf-ocr-to-markdown-pycache python3 -m py_compile \
  pdf-ocr-to-markdown/scripts/batch_ocr.py \
  pdf-ocr-to-markdown/scripts/classify_pages.py \
  pdf-ocr-to-markdown/scripts/cleanup_workdir.py \
  pdf-ocr-to-markdown/scripts/extract_pdf_text.py \
  pdf-ocr-to-markdown/scripts/merge_outputs.py \
  pdf-ocr-to-markdown/scripts/render_pages.py \
  pdf-ocr-to-markdown/scripts/setup_dependencies.py

Repository Layout

pdf-ocr-to-markdown/
  SKILL.md
  agents/openai.yaml
  prompts/
  scripts/
  examples/
tests/
docs/

prompts/ contains route-specific transcription instructions used by the skill, such as text_only, visual_focus, verify_text_and_visual, and full_vision_ocr.

Attribution

This project was originally created by jseook11.

If you copy, modify, or redistribute substantial portions of this project, please retain the original copyright notice, the LICENSE file, and the NOTICE file if present.

License

MIT. See LICENSE.

About

A Codex CLI skill for converting PDFs and images into Markdown with OCR.

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages