Skip to content

0ffffff/oracc2cdli

Repository files navigation

ORACC2CDLI

A tool to convert between ORACC and CDLI cuneiform transliteration formats, under the FactGrid Cuneiform system.
Achieves 99.96% accuracy on the oldassyrian-lines dataset.

Status: Converts ORACC → CDLI_clean. Full round-trip to CDLI is limited by physical damage markers and editorial corrections in the source data.


Documentation

  • PIPELINE_GUIDE.md — Step-by-step guide to building a cleaned word-level training dataset (word_level_cleaned.csv) from raw source CSVs. Start here if you want to generate the dataset yourself.
  • SUGGESTED_MODIFICATIONS.md — Non-binding suggestions for future refactoring: reference data location, module naming, output directories, CLI layout.
  • src/preprocessing/dataset_quality_results/ — Quality analysis outputs: classification of misaligned vs conversion-gap rows, cleaning filter analysis, dated reports.

Quick start

Install dependencies

Requires Python ≥ 3.14. Install with uv:

uv sync

Or with pip:

pip install -r requirements.txt

Core dependencies: pandas, rapidfuzz. EDA scripts also use numpy.

Convert a file (ORACC → CDLI)

python3 src/oracc_to_cdli.py convert <input_file> <output_file> [--has-label]

Use --has-label when each line starts with an ID/label (e.g. P359065:obverse.1.1) followed by the word.

Convert a file (CDLI → ORACC)

python3 src/cdli_to_oracc.py convert <input_file> <output_file> [--has-label]

Clean (strip lines; either format)

python3 src/oracc_to_cdli.py clean <input_file> <output_file>
# or
python3 src/cdli_to_oracc.py clean <input_file> <output_file>

Optional --mapping <path> overrides the default data/reference/ATF_Character_Conventions.csv.

Input/output format examples: see data/Q499899-*.txt.

Validate predicted output against reference CDLI

python3 src/utils/validate.py <predicted_file> <test_file>

Compares predicted lines to the reference CDLI file (by line ID) and prints match rate and mismatches.

Run the example (no CLI args)

python3 examples/example.py

Loads the mapping, reads data/Q499899-oracc.txt, converts to CDLI, and writes data/Q499899-converted.txt.


Building the dataset

See PIPELINE_GUIDE.md for the full walkthrough. The short version:

transliteration.csv + finaldf.csv
        ↓  load_to_db.py
   data/oracc2cdli.db
        ↓  build_word_table.py
   word_level table (DB)
        ↓  export_word_level.py
   data/word_level.csv  (~4.5M rows)
        ↓  clean_word_level.py
   data/word_level_cleaned.csv

The pipeline requires two source CSVs in data/ (gitignored — not included in the repo):

  • data/transliteration.csv — CDLI-format transliterations (id_text, transliteration)
  • data/finaldf.csv — ORACC word data (id_text, id_word, form, …)

Word conversion tests

Run unit and dataset (chunked) word conversion tests from project root:

python3 src/tests/run_word_conversion_tests.py

Optional: write a dated Markdown report:

python3 src/tests/run_word_conversion_tests.py --report src/tests/results/conversion_report.md

Supports --csv, --chunk, --max-rows, --no-roundtrip, --roundtrip-sample to tune the run.


Repository structure

oracc2cdli/
├── PIPELINE_GUIDE.md                 # Step-by-step dataset pipeline guide
├── SUGGESTED_MODIFICATIONS.md        # Non-binding refactoring suggestions
├── data/
│   ├── Q499899-cdli.txt              # Example CDLI-format input
│   ├── Q499899-oracc.txt             # Example ORACC-format input
│   ├── Q499899-cli-output.txt        # Example CLI conversion output
│   ├── word_level_cleaned_subset.csv # Cleaned subset (first N rows) for benchmarking
│   └── reference/
│       ├── ATF_Character_Conventions.csv   # ORACC ↔ CDLI character mapping (committed)
│       └── transliteration.csv             # Reference copy of transliteration.csv
│   (gitignored: transliteration.csv, finaldf.csv, word_level.csv,
│                word_level_cleaned.csv, oracc2cdli.db — generated by pipeline)
├── examples/
│   ├── README.md
│   └── example.py                    # Example ORACC→CDLI conversion (no CLI args)
└── src/
    ├── oracc_to_cdli.py              # CLI: ORACC → CDLI convert & clean
    ├── cdli_to_oracc.py              # CLI: CDLI → ORACC convert & clean
    ├── utils/
    │   ├── __init__.py               # Re-exports conversion, mapping, word_conversion, validate
    │   ├── utils.py                  # Character mapping + line-level conversion (ORACC↔CDLI)
    │   ├── word_conversion.py        # Atomic word-level conversion (ORACC↔CDLI); cached mappings, compiled regex
    │   └── validate.py               # Clean CDLI lines; compare predicted vs reference file
    ├── preprocessing/
    │   ├── __init__.py
    │   ├── load_to_db.py             # Load transliteration.csv + finaldf.csv → SQLite
    │   ├── build_word_table.py       # Build word-level table (id_text, tr_oracc, tr_cdli) in DB
    │   ├── export_word_level.py      # Export word_level table from DB → data/word_level.csv
    │   ├── clean_word_level.py       # Filter word_level.csv → data/word_level_cleaned.csv (optimized for ~4.5M rows)
    │   ├── clean_word_level_subset.py # Same as clean_word_level on first N rows → word_level_cleaned_subset.csv
    │   ├── analyze_dataset_quality.py # Sample word_level.csv; classify misalignment vs conversion; summary + JSON
    │   ├── dataset_quality_findings.md # Narrative findings on misalignment and conversion gaps
    │   ├── preprocess_old.py         # [Legacy] Merge/dedupe transliteration+finaldf; superseded by build_word_table
    │   └── dataset_quality_results/  # Quality analysis outputs
    │       ├── dataset_quality_findings.md   # Causes, fixes, and recommendations
    │       ├── cleaning_filter_analysis.md   # Analysis of cleaning filter boundary behavior
    │       ├── dataset_quality_2026-02-19.md # Dated quality report
    │       ├── dataset_quality_2026-02-24.md # Dated quality report
    │       └── analysis_summary.json         # Summary stats (latest run)
    ├── eda/
    │   ├── transliteration_eda.py    # EDA for transliteration.csv → results/transliteration_eda.md
    │   ├── finaldf_eda.py            # Chunked EDA for finaldf.csv → results/finaldf_eda.md
    │   ├── word_level_eda.py         # Chunked EDA for word_level.csv → results/word_level_eda.md
    │   ├── word_level_cleaned_eda.py # Chunked EDA for word_level_cleaned.csv → results/word_level_cleaned_eda.md
    │   └── results/                  # Generated EDA reports (*.md)
    └── tests/
        ├── __init__.py
        ├── test_word_conversion.py   # Reusable API: unit + dataset (chunked) word conversion tests
        ├── run_word_conversion_tests.py  # Runner: unit tests + dataset tests; optional --report path
        └── results/                  # Dated conversion reports (e.g. conversion_report_2-18.md, conversion_report_2-19.md)

Script map

Path Purpose
src/oracc_to_cdli.py CLI: convert ORACC → CDLI, or clean an input file. Subcommands: convert, clean.
src/cdli_to_oracc.py CLI: convert CDLI → ORACC, or clean an input file. Subcommands: convert, clean.
src/utils/utils.py Character mapping (load_character_mapping, load_reverse_character_mapping); line-level conversion (convert_line_oracc_to_cdli, convert_line_cdli_to_oracc); validate_conversion for CSV accuracy.
src/utils/word_conversion.py Atomic word-level conversion (word_oracc_to_cdli, word_cdli_to_oracc). Handles subscripts, determinatives, ellipsis. Mappings cached; single-pass compiled regex.
src/utils/validate.py clean_line_cdli: normalise a line to CDLI_clean. validate: compare predicted vs reference file by line ID. CLI entry point.
src/preprocessing/load_to_db.py Load transliteration.csv and finaldf.csv (chunked) into SQLite data/oracc2cdli.db.
src/preprocessing/build_word_table.py Build word-level table from transliteration + finaldf; write table word_level to DB. Run after load_to_db.
src/preprocessing/export_word_level.py Export the word_level table from SQLite to data/word_level.csv.
src/preprocessing/clean_word_level.py Filter word_level.csv: drop misaligned rows and garbage tokens; write data/word_level_cleaned.csv. Keeps exact/high/conversion_issue, drops likely_misaligned (<30%) and garbage. Optimized for ~4.5M rows.
src/preprocessing/clean_word_level_subset.py Same as clean_word_level on first N rows only; for benchmarking/timing. Output: word_level_cleaned_subset.csv.
src/preprocessing/analyze_dataset_quality.py Sample word_level.csv, run CDLI↔ORACC conversion, classify rows (exact / high / conversion_issue / likely_misaligned). Writes dated report + JSON to dataset_quality_results/.
src/preprocessing/preprocess_old.py [Legacy] Dedupe transliteration/finaldf, join on id_text, write merged table. Superseded by build_word_table.
src/eda/transliteration_eda.py EDA for transliteration.csv (full load). Writes src/eda/results/transliteration_eda.md.
src/eda/finaldf_eda.py Chunked EDA for finaldf.csv. Writes src/eda/results/finaldf_eda.md.
src/eda/word_level_eda.py Chunked EDA for word_level.csv. Writes src/eda/results/word_level_eda.md.
src/eda/word_level_cleaned_eda.py Chunked EDA for word_level_cleaned.csv with comparison against uncleaned stats. Writes src/eda/results/word_level_cleaned_eda.md.
src/tests/test_word_conversion.py Reusable test API: unit tests (empty, None, malformed, edge cases, round-trip) and chunked dataset tests. Returns result dicts.
src/tests/run_word_conversion_tests.py Runner: unit + dataset tests; optional --report <path>. Supports --csv, --chunk, --max-rows, --no-roundtrip, --roundtrip-sample.
examples/example.py Example: load mapping, read ORACC file, convert to CDLI, save output. No CLI arguments.

Dataset quality

About 57% of rows in word_level.csv are exact matches after bidirectional conversion. ~28% are likely misaligned (different words paired by position). The cleaning pipeline drops misaligned rows and garbage tokens, yielding word_level_cleaned.csv.

For details see:


Performance (cleaning pipeline)

clean_word_level.py processes ~4.5M rows. Key optimizations in word_conversion.py and the cleaning scripts:

  • Cached character mappings — loaded from CSV once at module level; previously re-read on every word conversion call (~9M+ reads for a full run).
  • Single-pass compiled regex_apply_mapping() uses one compiled re.Pattern (longest-first) instead of N×str.replace loops.
  • Pre-compiled regexes — subscript-digit and determinative regexes compiled once at module level.
  • str.translate() for digit→subscript — C-level translation table instead of a Python generator.
  • Cached non-digit sub-mappingword_cdli_to_oracc no longer rebuilds a filtered dict on every call.
  • Stripping once — input strings stripped once in vectorized chunk preprocessing; redundant .strip() calls removed from hot path.
  • Reused ProcessPoolExecutor — pool created once and reused across all chunks instead of per-chunk.
Metric Value
Subset (10,000 rows) classify time 1.4 s
Subset (10,000 rows) total time 1.8 s
Full dataset (4,546,052 rows) estimated ~10–14 min

About

A conversion tool for ORACC to CDLI format on FactGrid Cuneiform

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages