ORACC2CDLI

A tool to convert between ORACC and CDLI cuneiform transliteration formats, under the FactGrid Cuneiform system.
Achieves 99.96% accuracy on the oldassyrian-lines dataset.

Status: Converts ORACC → CDLI_clean. Full round-trip to CDLI is limited by physical damage markers and editorial corrections in the source data.

Documentation

PIPELINE_GUIDE.md — Step-by-step guide to building a cleaned word-level training dataset (word_level_cleaned.csv) from raw source CSVs. Start here if you want to generate the dataset yourself.
SUGGESTED_MODIFICATIONS.md — Non-binding suggestions for future refactoring: reference data location, module naming, output directories, CLI layout.
src/preprocessing/dataset_quality_results/ — Quality analysis outputs: classification of misaligned vs conversion-gap rows, cleaning filter analysis, dated reports.

Quick start

Install dependencies

Requires Python ≥ 3.14. Install with uv:

uv sync

Or with pip:

pip install -r requirements.txt

Core dependencies: pandas, rapidfuzz. EDA scripts also use numpy.

Convert a file (ORACC → CDLI)

python3 src/oracc_to_cdli.py convert <input_file> <output_file> [--has-label]

Use --has-label when each line starts with an ID/label (e.g. P359065:obverse.1.1) followed by the word.

Convert a file (CDLI → ORACC)

python3 src/cdli_to_oracc.py convert <input_file> <output_file> [--has-label]

Clean (strip lines; either format)

python3 src/oracc_to_cdli.py clean <input_file> <output_file>
# or
python3 src/cdli_to_oracc.py clean <input_file> <output_file>

Optional --mapping <path> overrides the default data/reference/ATF_Character_Conventions.csv.

Input/output format examples: see data/Q499899-*.txt.

Validate predicted output against reference CDLI

python3 src/utils/validate.py <predicted_file> <test_file>

Compares predicted lines to the reference CDLI file (by line ID) and prints match rate and mismatches.

Run the example (no CLI args)

python3 examples/example.py

Loads the mapping, reads data/Q499899-oracc.txt, converts to CDLI, and writes data/Q499899-converted.txt.

Building the dataset

See PIPELINE_GUIDE.md for the full walkthrough. The short version:

transliteration.csv + finaldf.csv
        ↓  load_to_db.py
   data/oracc2cdli.db
        ↓  build_word_table.py
   word_level table (DB)
        ↓  export_word_level.py
   data/word_level.csv  (~4.5M rows)
        ↓  clean_word_level.py
   data/word_level_cleaned.csv

The pipeline requires two source CSVs in data/ (gitignored — not included in the repo):

data/transliteration.csv — CDLI-format transliterations (id_text, transliteration)
data/finaldf.csv — ORACC word data (id_text, id_word, form, …)

Word conversion tests

Run unit and dataset (chunked) word conversion tests from project root:

python3 src/tests/run_word_conversion_tests.py

Optional: write a dated Markdown report:

python3 src/tests/run_word_conversion_tests.py --report src/tests/results/conversion_report.md

Supports --csv, --chunk, --max-rows, --no-roundtrip, --roundtrip-sample to tune the run.

Repository structure

oracc2cdli/
├── PIPELINE_GUIDE.md                 # Step-by-step dataset pipeline guide
├── SUGGESTED_MODIFICATIONS.md        # Non-binding refactoring suggestions
├── data/
│   ├── Q499899-cdli.txt              # Example CDLI-format input
│   ├── Q499899-oracc.txt             # Example ORACC-format input
│   ├── Q499899-cli-output.txt        # Example CLI conversion output
│   ├── word_level_cleaned_subset.csv # Cleaned subset (first N rows) for benchmarking
│   └── reference/
│       ├── ATF_Character_Conventions.csv   # ORACC ↔ CDLI character mapping (committed)
│       └── transliteration.csv             # Reference copy of transliteration.csv
│   (gitignored: transliteration.csv, finaldf.csv, word_level.csv,
│                word_level_cleaned.csv, oracc2cdli.db — generated by pipeline)
├── examples/
│   ├── README.md
│   └── example.py                    # Example ORACC→CDLI conversion (no CLI args)
└── src/
    ├── oracc_to_cdli.py              # CLI: ORACC → CDLI convert & clean
    ├── cdli_to_oracc.py              # CLI: CDLI → ORACC convert & clean
    ├── utils/
    │   ├── __init__.py               # Re-exports conversion, mapping, word_conversion, validate
    │   ├── utils.py                  # Character mapping + line-level conversion (ORACC↔CDLI)
    │   ├── word_conversion.py        # Atomic word-level conversion (ORACC↔CDLI); cached mappings, compiled regex
    │   └── validate.py               # Clean CDLI lines; compare predicted vs reference file
    ├── preprocessing/
    │   ├── __init__.py
    │   ├── load_to_db.py             # Load transliteration.csv + finaldf.csv → SQLite
    │   ├── build_word_table.py       # Build word-level table (id_text, tr_oracc, tr_cdli) in DB
    │   ├── export_word_level.py      # Export word_level table from DB → data/word_level.csv
    │   ├── clean_word_level.py       # Filter word_level.csv → data/word_level_cleaned.csv (optimized for ~4.5M rows)
    │   ├── clean_word_level_subset.py # Same as clean_word_level on first N rows → word_level_cleaned_subset.csv
    │   ├── analyze_dataset_quality.py # Sample word_level.csv; classify misalignment vs conversion; summary + JSON
    │   ├── dataset_quality_findings.md # Narrative findings on misalignment and conversion gaps
    │   ├── preprocess_old.py         # [Legacy] Merge/dedupe transliteration+finaldf; superseded by build_word_table
    │   └── dataset_quality_results/  # Quality analysis outputs
    │       ├── dataset_quality_findings.md   # Causes, fixes, and recommendations
    │       ├── cleaning_filter_analysis.md   # Analysis of cleaning filter boundary behavior
    │       ├── dataset_quality_2026-02-19.md # Dated quality report
    │       ├── dataset_quality_2026-02-24.md # Dated quality report
    │       └── analysis_summary.json         # Summary stats (latest run)
    ├── eda/
    │   ├── transliteration_eda.py    # EDA for transliteration.csv → results/transliteration_eda.md
    │   ├── finaldf_eda.py            # Chunked EDA for finaldf.csv → results/finaldf_eda.md
    │   ├── word_level_eda.py         # Chunked EDA for word_level.csv → results/word_level_eda.md
    │   ├── word_level_cleaned_eda.py # Chunked EDA for word_level_cleaned.csv → results/word_level_cleaned_eda.md
    │   └── results/                  # Generated EDA reports (*.md)
    └── tests/
        ├── __init__.py
        ├── test_word_conversion.py   # Reusable API: unit + dataset (chunked) word conversion tests
        ├── run_word_conversion_tests.py  # Runner: unit tests + dataset tests; optional --report path
        └── results/                  # Dated conversion reports (e.g. conversion_report_2-18.md, conversion_report_2-19.md)

Script map

Path	Purpose
src/oracc_to_cdli.py	CLI: convert ORACC → CDLI, or clean an input file. Subcommands: `convert`, `clean`.
src/cdli_to_oracc.py	CLI: convert CDLI → ORACC, or clean an input file. Subcommands: `convert`, `clean`.
src/utils/utils.py	Character mapping (`load_character_mapping`, `load_reverse_character_mapping`); line-level conversion (`convert_line_oracc_to_cdli`, `convert_line_cdli_to_oracc`); `validate_conversion` for CSV accuracy.
src/utils/word_conversion.py	Atomic word-level conversion (`word_oracc_to_cdli`, `word_cdli_to_oracc`). Handles subscripts, determinatives, ellipsis. Mappings cached; single-pass compiled regex.
src/utils/validate.py	`clean_line_cdli`: normalise a line to CDLI_clean. `validate`: compare predicted vs reference file by line ID. CLI entry point.
src/preprocessing/load_to_db.py	Load `transliteration.csv` and `finaldf.csv` (chunked) into SQLite `data/oracc2cdli.db`.
src/preprocessing/build_word_table.py	Build word-level table from transliteration + finaldf; write table `word_level` to DB. Run after load_to_db.
src/preprocessing/export_word_level.py	Export the `word_level` table from SQLite to `data/word_level.csv`.
src/preprocessing/clean_word_level.py	Filter `word_level.csv`: drop misaligned rows and garbage tokens; write `data/word_level_cleaned.csv`. Keeps exact/high/conversion_issue, drops likely_misaligned (<30%) and garbage. Optimized for ~4.5M rows.
src/preprocessing/clean_word_level_subset.py	Same as clean_word_level on first N rows only; for benchmarking/timing. Output: `word_level_cleaned_subset.csv`.
src/preprocessing/analyze_dataset_quality.py	Sample `word_level.csv`, run CDLI↔ORACC conversion, classify rows (exact / high / conversion_issue / likely_misaligned). Writes dated report + JSON to `dataset_quality_results/`.
src/preprocessing/preprocess_old.py	[Legacy] Dedupe transliteration/finaldf, join on id_text, write merged table. Superseded by build_word_table.
src/eda/transliteration_eda.py	EDA for `transliteration.csv` (full load). Writes `src/eda/results/transliteration_eda.md`.
src/eda/finaldf_eda.py	Chunked EDA for `finaldf.csv`. Writes `src/eda/results/finaldf_eda.md`.
src/eda/word_level_eda.py	Chunked EDA for `word_level.csv`. Writes `src/eda/results/word_level_eda.md`.
src/eda/word_level_cleaned_eda.py	Chunked EDA for `word_level_cleaned.csv` with comparison against uncleaned stats. Writes `src/eda/results/word_level_cleaned_eda.md`.
src/tests/test_word_conversion.py	Reusable test API: unit tests (empty, None, malformed, edge cases, round-trip) and chunked dataset tests. Returns result dicts.
src/tests/run_word_conversion_tests.py	Runner: unit + dataset tests; optional `--report <path>`. Supports `--csv`, `--chunk`, `--max-rows`, `--no-roundtrip`, `--roundtrip-sample`.
examples/example.py	Example: load mapping, read ORACC file, convert to CDLI, save output. No CLI arguments.

Dataset quality

About 57% of rows in word_level.csv are exact matches after bidirectional conversion. ~28% are likely misaligned (different words paired by position). The cleaning pipeline drops misaligned rows and garbage tokens, yielding word_level_cleaned.csv.

For details see:

src/preprocessing/dataset_quality_results/dataset_quality_findings.md — causes, classification breakdown, and recommended fixes
src/preprocessing/dataset_quality_results/cleaning_filter_analysis.md — analysis of threshold boundary behavior
src/preprocessing/dataset_quality_results/analysis_summary.json — latest run stats

Performance (cleaning pipeline)

clean_word_level.py processes ~4.5M rows. Key optimizations in word_conversion.py and the cleaning scripts:

Cached character mappings — loaded from CSV once at module level; previously re-read on every word conversion call (~9M+ reads for a full run).
Single-pass compiled regex — _apply_mapping() uses one compiled re.Pattern (longest-first) instead of N×str.replace loops.
Pre-compiled regexes — subscript-digit and determinative regexes compiled once at module level.
str.translate() for digit→subscript — C-level translation table instead of a Python generator.
Cached non-digit sub-mapping — word_cdli_to_oracc no longer rebuilds a filtered dict on every call.
Stripping once — input strings stripped once in vectorized chunk preprocessing; redundant .strip() calls removed from hot path.
Reused ProcessPoolExecutor — pool created once and reused across all chunks instead of per-chunk.

Metric	Value
Subset (10,000 rows) classify time	1.4 s
Subset (10,000 rows) total time	1.8 s
Full dataset (4,546,052 rows) estimated	~10–14 min

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
data		data
examples		examples
src		src
.gitignore		.gitignore
.python-version		.python-version
2-27-GOALS.md		2-27-GOALS.md
PIPELINE_GUIDE.md		PIPELINE_GUIDE.md
README.md		README.md
SUGGESTED_MODIFICATIONS.md		SUGGESTED_MODIFICATIONS.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ORACC2CDLI

Documentation

Quick start

Install dependencies

Convert a file (ORACC → CDLI)

Convert a file (CDLI → ORACC)

Clean (strip lines; either format)

Validate predicted output against reference CDLI

Run the example (no CLI args)

Building the dataset

Word conversion tests

Repository structure

Script map

Dataset quality

Performance (cleaning pipeline)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ORACC2CDLI

Documentation

Quick start

Install dependencies

Convert a file (ORACC → CDLI)

Convert a file (CDLI → ORACC)

Clean (strip lines; either format)

Validate predicted output against reference CDLI

Run the example (no CLI args)

Building the dataset

Word conversion tests

Repository structure

Script map

Dataset quality

Performance (cleaning pipeline)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages