A tool to convert between ORACC and CDLI cuneiform transliteration formats, under the FactGrid Cuneiform system.
Achieves 99.96% accuracy on the oldassyrian-lines dataset.
Status: Converts ORACC → CDLI_clean. Full round-trip to CDLI is limited by physical damage markers and editorial corrections in the source data.
- PIPELINE_GUIDE.md — Step-by-step guide to building a cleaned word-level training dataset (
word_level_cleaned.csv) from raw source CSVs. Start here if you want to generate the dataset yourself. - SUGGESTED_MODIFICATIONS.md — Non-binding suggestions for future refactoring: reference data location, module naming, output directories, CLI layout.
- src/preprocessing/dataset_quality_results/ — Quality analysis outputs: classification of misaligned vs conversion-gap rows, cleaning filter analysis, dated reports.
Requires Python ≥ 3.14. Install with uv:
uv syncOr with pip:
pip install -r requirements.txtCore dependencies: pandas, rapidfuzz. EDA scripts also use numpy.
python3 src/oracc_to_cdli.py convert <input_file> <output_file> [--has-label]Use --has-label when each line starts with an ID/label (e.g. P359065:obverse.1.1) followed by the word.
python3 src/cdli_to_oracc.py convert <input_file> <output_file> [--has-label]python3 src/oracc_to_cdli.py clean <input_file> <output_file>
# or
python3 src/cdli_to_oracc.py clean <input_file> <output_file>Optional --mapping <path> overrides the default data/reference/ATF_Character_Conventions.csv.
Input/output format examples: see data/Q499899-*.txt.
python3 src/utils/validate.py <predicted_file> <test_file>Compares predicted lines to the reference CDLI file (by line ID) and prints match rate and mismatches.
python3 examples/example.pyLoads the mapping, reads data/Q499899-oracc.txt, converts to CDLI, and writes data/Q499899-converted.txt.
See PIPELINE_GUIDE.md for the full walkthrough. The short version:
transliteration.csv + finaldf.csv
↓ load_to_db.py
data/oracc2cdli.db
↓ build_word_table.py
word_level table (DB)
↓ export_word_level.py
data/word_level.csv (~4.5M rows)
↓ clean_word_level.py
data/word_level_cleaned.csv
The pipeline requires two source CSVs in data/ (gitignored — not included in the repo):
data/transliteration.csv— CDLI-format transliterations (id_text,transliteration)data/finaldf.csv— ORACC word data (id_text,id_word,form, …)
Run unit and dataset (chunked) word conversion tests from project root:
python3 src/tests/run_word_conversion_tests.pyOptional: write a dated Markdown report:
python3 src/tests/run_word_conversion_tests.py --report src/tests/results/conversion_report.mdSupports --csv, --chunk, --max-rows, --no-roundtrip, --roundtrip-sample to tune the run.
oracc2cdli/
├── PIPELINE_GUIDE.md # Step-by-step dataset pipeline guide
├── SUGGESTED_MODIFICATIONS.md # Non-binding refactoring suggestions
├── data/
│ ├── Q499899-cdli.txt # Example CDLI-format input
│ ├── Q499899-oracc.txt # Example ORACC-format input
│ ├── Q499899-cli-output.txt # Example CLI conversion output
│ ├── word_level_cleaned_subset.csv # Cleaned subset (first N rows) for benchmarking
│ └── reference/
│ ├── ATF_Character_Conventions.csv # ORACC ↔ CDLI character mapping (committed)
│ └── transliteration.csv # Reference copy of transliteration.csv
│ (gitignored: transliteration.csv, finaldf.csv, word_level.csv,
│ word_level_cleaned.csv, oracc2cdli.db — generated by pipeline)
├── examples/
│ ├── README.md
│ └── example.py # Example ORACC→CDLI conversion (no CLI args)
└── src/
├── oracc_to_cdli.py # CLI: ORACC → CDLI convert & clean
├── cdli_to_oracc.py # CLI: CDLI → ORACC convert & clean
├── utils/
│ ├── __init__.py # Re-exports conversion, mapping, word_conversion, validate
│ ├── utils.py # Character mapping + line-level conversion (ORACC↔CDLI)
│ ├── word_conversion.py # Atomic word-level conversion (ORACC↔CDLI); cached mappings, compiled regex
│ └── validate.py # Clean CDLI lines; compare predicted vs reference file
├── preprocessing/
│ ├── __init__.py
│ ├── load_to_db.py # Load transliteration.csv + finaldf.csv → SQLite
│ ├── build_word_table.py # Build word-level table (id_text, tr_oracc, tr_cdli) in DB
│ ├── export_word_level.py # Export word_level table from DB → data/word_level.csv
│ ├── clean_word_level.py # Filter word_level.csv → data/word_level_cleaned.csv (optimized for ~4.5M rows)
│ ├── clean_word_level_subset.py # Same as clean_word_level on first N rows → word_level_cleaned_subset.csv
│ ├── analyze_dataset_quality.py # Sample word_level.csv; classify misalignment vs conversion; summary + JSON
│ ├── dataset_quality_findings.md # Narrative findings on misalignment and conversion gaps
│ ├── preprocess_old.py # [Legacy] Merge/dedupe transliteration+finaldf; superseded by build_word_table
│ └── dataset_quality_results/ # Quality analysis outputs
│ ├── dataset_quality_findings.md # Causes, fixes, and recommendations
│ ├── cleaning_filter_analysis.md # Analysis of cleaning filter boundary behavior
│ ├── dataset_quality_2026-02-19.md # Dated quality report
│ ├── dataset_quality_2026-02-24.md # Dated quality report
│ └── analysis_summary.json # Summary stats (latest run)
├── eda/
│ ├── transliteration_eda.py # EDA for transliteration.csv → results/transliteration_eda.md
│ ├── finaldf_eda.py # Chunked EDA for finaldf.csv → results/finaldf_eda.md
│ ├── word_level_eda.py # Chunked EDA for word_level.csv → results/word_level_eda.md
│ ├── word_level_cleaned_eda.py # Chunked EDA for word_level_cleaned.csv → results/word_level_cleaned_eda.md
│ └── results/ # Generated EDA reports (*.md)
└── tests/
├── __init__.py
├── test_word_conversion.py # Reusable API: unit + dataset (chunked) word conversion tests
├── run_word_conversion_tests.py # Runner: unit tests + dataset tests; optional --report path
└── results/ # Dated conversion reports (e.g. conversion_report_2-18.md, conversion_report_2-19.md)
| Path | Purpose |
|---|---|
| src/oracc_to_cdli.py | CLI: convert ORACC → CDLI, or clean an input file. Subcommands: convert, clean. |
| src/cdli_to_oracc.py | CLI: convert CDLI → ORACC, or clean an input file. Subcommands: convert, clean. |
| src/utils/utils.py | Character mapping (load_character_mapping, load_reverse_character_mapping); line-level conversion (convert_line_oracc_to_cdli, convert_line_cdli_to_oracc); validate_conversion for CSV accuracy. |
| src/utils/word_conversion.py | Atomic word-level conversion (word_oracc_to_cdli, word_cdli_to_oracc). Handles subscripts, determinatives, ellipsis. Mappings cached; single-pass compiled regex. |
| src/utils/validate.py | clean_line_cdli: normalise a line to CDLI_clean. validate: compare predicted vs reference file by line ID. CLI entry point. |
| src/preprocessing/load_to_db.py | Load transliteration.csv and finaldf.csv (chunked) into SQLite data/oracc2cdli.db. |
| src/preprocessing/build_word_table.py | Build word-level table from transliteration + finaldf; write table word_level to DB. Run after load_to_db. |
| src/preprocessing/export_word_level.py | Export the word_level table from SQLite to data/word_level.csv. |
| src/preprocessing/clean_word_level.py | Filter word_level.csv: drop misaligned rows and garbage tokens; write data/word_level_cleaned.csv. Keeps exact/high/conversion_issue, drops likely_misaligned (<30%) and garbage. Optimized for ~4.5M rows. |
| src/preprocessing/clean_word_level_subset.py | Same as clean_word_level on first N rows only; for benchmarking/timing. Output: word_level_cleaned_subset.csv. |
| src/preprocessing/analyze_dataset_quality.py | Sample word_level.csv, run CDLI↔ORACC conversion, classify rows (exact / high / conversion_issue / likely_misaligned). Writes dated report + JSON to dataset_quality_results/. |
| src/preprocessing/preprocess_old.py | [Legacy] Dedupe transliteration/finaldf, join on id_text, write merged table. Superseded by build_word_table. |
| src/eda/transliteration_eda.py | EDA for transliteration.csv (full load). Writes src/eda/results/transliteration_eda.md. |
| src/eda/finaldf_eda.py | Chunked EDA for finaldf.csv. Writes src/eda/results/finaldf_eda.md. |
| src/eda/word_level_eda.py | Chunked EDA for word_level.csv. Writes src/eda/results/word_level_eda.md. |
| src/eda/word_level_cleaned_eda.py | Chunked EDA for word_level_cleaned.csv with comparison against uncleaned stats. Writes src/eda/results/word_level_cleaned_eda.md. |
| src/tests/test_word_conversion.py | Reusable test API: unit tests (empty, None, malformed, edge cases, round-trip) and chunked dataset tests. Returns result dicts. |
| src/tests/run_word_conversion_tests.py | Runner: unit + dataset tests; optional --report <path>. Supports --csv, --chunk, --max-rows, --no-roundtrip, --roundtrip-sample. |
| examples/example.py | Example: load mapping, read ORACC file, convert to CDLI, save output. No CLI arguments. |
About 57% of rows in word_level.csv are exact matches after bidirectional conversion. ~28% are likely misaligned (different words paired by position). The cleaning pipeline drops misaligned rows and garbage tokens, yielding word_level_cleaned.csv.
For details see:
src/preprocessing/dataset_quality_results/dataset_quality_findings.md— causes, classification breakdown, and recommended fixessrc/preprocessing/dataset_quality_results/cleaning_filter_analysis.md— analysis of threshold boundary behaviorsrc/preprocessing/dataset_quality_results/analysis_summary.json— latest run stats
clean_word_level.py processes ~4.5M rows. Key optimizations in word_conversion.py and the cleaning scripts:
- Cached character mappings — loaded from CSV once at module level; previously re-read on every word conversion call (~9M+ reads for a full run).
- Single-pass compiled regex —
_apply_mapping()uses one compiledre.Pattern(longest-first) instead of N×str.replaceloops. - Pre-compiled regexes — subscript-digit and determinative regexes compiled once at module level.
str.translate()for digit→subscript — C-level translation table instead of a Python generator.- Cached non-digit sub-mapping —
word_cdli_to_oraccno longer rebuilds a filtered dict on every call. - Stripping once — input strings stripped once in vectorized chunk preprocessing; redundant
.strip()calls removed from hot path. - Reused
ProcessPoolExecutor— pool created once and reused across all chunks instead of per-chunk.
| Metric | Value |
|---|---|
| Subset (10,000 rows) classify time | 1.4 s |
| Subset (10,000 rows) total time | 1.8 s |
| Full dataset (4,546,052 rows) estimated | ~10–14 min |