Skip to content

frisk55frisk/loadfile-tool

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

loadfile

eDiscovery load file conversion, validation, and normalization CLI.

The problem: eDiscovery load files come in multiple formats with inconsistent delimiters, encodings, date formats, and line endings. Converting between formats and validating data integrity before importing into review tools (Relativity, Everlaw, etc.) currently requires either ReadySuite ($1,599/year, Windows-only) or painful manual work.

loadfile is a Python CLI that handles all of this. It auto-detects formats and dialects, converts between them, validates structural integrity, and normalizes the messy edge cases that cause silent import failures.

Quick Start

pip install loadfile
# What format and encoding is this file?
loadfile inspect production.dat

# Convert encoding from CP1252 to UTF-8
loadfile convert production.dat --to utf-8

# Convert all files in a directory
loadfile batch ./case_files --to utf-8 --recursive

# Convert between formats (e.g., DII → DAT)
loadfile transform metadata.dii --to dat

# Validate before importing
loadfile validate production.dat --cross-file images.opt

# Check if fields will be truncated in Relativity
loadfile check-lengths production.dat

# Normalize mixed date formats to ISO
loadfile dates production.dat --normalize iso

# Analyze Bates numbering for gaps
loadfile bates production.dat

Supported Formats

Format Description Parse Serialize Convert
Concordance DAT Metadata load file (9 dialect variants)
Opticon OPT Image-level load file
IPRO LFP Image load file (command-code format)
Summation DII Token-based metadata load file
EDRM XML Industry-standard XML schema

DAT Dialect Auto-Detection

Most tools produce slightly different DAT files. loadfile auto-detects and handles all of them:

Dialect Separator Qualifier Used By
concordance þ ® Concordance, IPRO, most vendors
pipe_caret | ^ Relativity default
pipe_quote | " Everlaw, some vendors
pipe_bare | (none) Simple exports
tab_quote \t " Spreadsheet exports
tab_bare \t (none) TSV-style exports
comma_quote , " CSV-style .dat files
pipe_thorn | þ Rare vendor variant
ascii_ctrl \x14 \xFE Legacy ASCII control

Commands

inspect — View file structure

Auto-detects format, encoding, and dialect. Works with all 5 formats.

loadfile inspect production.dat
File:       production.dat
Size:       45,678 bytes
Encoding:   cp1252 (73%)
BOM:        no
Format:     dat

Variant:    concordance (sep='þ', qual='®')
Fields:     12
Records:    1,847

Headers:
    1. DOCID
    2. BEGBATES
    3. CUSTODIAN
    ...

Options:

  • --max-rows — Number of records to preview (default: 5)
  • --format — File format: auto (default), dat, opt, lfp, dii, edrm_xml

detect — Identify encoding

loadfile detect production.dat

Options:

  • --sample-size — Bytes to sample for detection (default: 65536)

convert — Change encoding

Supports UTF-8, UTF-8 BOM, CP1252, Latin-1, UTF-16 LE/BE. Auto-detects source encoding.

loadfile convert production.dat --to utf-8
loadfile convert production.dat --from cp1252 --to utf-8 -o output.dat
loadfile convert production.dat --to utf-8-sig   # with BOM

Options:

  • -t, --to — Target encoding (required)
  • -f, --from — Source encoding (auto-detected if omitted)
  • -o, --output — Output file path (default: <input>.converted)
  • --add-bom — Add UTF-8 BOM to output
  • --strip-bom — Remove BOM from output
  • --on-error — How to handle unmappable characters: strict (default, raises error), replace, ignore

batch — Convert a directory

loadfile batch ./case_files --to utf-8 --recursive --dry-run   # preview
loadfile batch ./case_files --to utf-8 --recursive              # execute
loadfile batch ./data --to utf-8 --pattern "*.dat" --recursive

Skips files already in the target encoding. Preserves subdirectory structure.

Options:

  • -t, --to — Target encoding (required)
  • -f, --from — Source encoding (auto-detected per file if omitted)
  • -o, --output-dir — Output directory (default: <input_dir>/converted/)
  • --pattern — Glob pattern for matching files (default: *)
  • -r, --recursive — Search subdirectories
  • --add-bom — Add UTF-8 BOM to output
  • --strip-bom — Remove BOM from output
  • --on-error — How to handle unmappable characters: strict (default, raises error), replace, ignore
  • --dry-run — Detect encodings only, don't convert

transform — Convert between formats

Metadata formats (DAT ↔ DII ↔ EDRM XML) and image formats (OPT ↔ LFP) convert to each other via an intermediate data model.

loadfile transform metadata.dii --to dat
loadfile transform production.dat --to edrm_xml
loadfile transform images.opt --to lfp

Options:

  • -t, --to — Target format: dat, dii, edrm_xml, opt, lfp (required)
  • -f, --from — Source format (auto-detected if omitted)
  • -o, --output — Output file path (default: input file with new extension)

validate — Check for errors

Runs three layers of validation:

# Single file — structural + format-specific checks
loadfile validate production.dat

# Cross-file checks (BATES in DAT vs OPT)
loadfile validate production.dat --cross-file images.opt

# With image path verification
loadfile validate production.dat --cross-file images.opt --image-dir ./IMAGES/

Structural checks: field count mismatches, duplicate IDs, empty required fields, empty records, trailing whitespace.

Format-specific checks: DAT field count per row, OPT/LFP missing initial DocBreak, duplicate image IDs, empty paths, DII missing DOCID, EDRM empty tags and missing file references.

Cross-file checks: BATES numbers in DAT exist in OPT, OPT document starts match DAT, image file existence on disk.

Options:

  • --format — File format: auto (default), dat, opt, lfp, dii, edrm_xml
  • --cross-file — Image load file (OPT/LFP) for cross-file validation
  • --image-dir — Directory to verify image file existence
  • --bates-field — Field name for BATES numbers (auto-detected if omitted)
  • --id-field — Field name for document ID (auto-detected if omitted)

dates — Analyze and normalize dates

Auto-detects date fields and shows format distribution. Detects 11 date patterns including MM/DD/YYYY, YYYY-MM-DD, DD-MMM-YYYY, M/D/YY, and variants with time. Warns on mixed formats, unparseable values, and 2-digit years. Currently supports DAT format.

loadfile dates production.dat                        # analyze
loadfile dates production.dat --normalize iso         # normalize to YYYY-MM-DD
loadfile dates production.dat --normalize relativity   # normalize to DD-MMM-YYYY
loadfile dates data.dat --field DATESENT --field DATECREATED --normalize iso

Options:

  • --field — Date field name(s) to check (repeatable, auto-detected if omitted)
  • --normalize — Normalize dates to a format: iso (YYYY-MM-DD), us (MM/DD/YYYY), relativity (DD-MMM-YYYY), compact (YYYYMMDD)
  • -o, --output — Output file path (default: <input>.normalized.<ext>)

check-lengths — Prevent truncation on import

Pre-validates field lengths against review tool limits. Knows Relativity defaults (400 chars for fixed-length, 450 for identifiers, 10MB for long text). Currently supports DAT format.

loadfile check-lengths production.dat                      # Relativity defaults
loadfile check-lengths production.dat --target summation    # Summation limits
loadfile check-lengths production.dat --limit 200           # custom limit

Options:

  • --target — Target review tool: relativity (default), summation, concordance
  • --limit — Override default field length limit for all fields

bates — Analyze Bates numbering

Parses Bates patterns (prefix + zero-padded number), detects gaps, prefix changes mid-production, and padding inconsistencies. Supports DAT, DII, and EDRM XML.

loadfile bates production.dat
loadfile bates production.dat --field BEGBATES
loadfile bates metadata.dii --format dii

Options:

  • --field — Field name containing BATES numbers (auto-detected if omitted)
  • --format — File format: auto (default), dat, dii, edrm_xml
  • --max-gaps — Maximum number of gaps to report (default: 100)

scan-encoding — Detect mixed encodings

Scans each line independently to find files where UTF-8 and CP1252 are mixed (common when files are assembled from multiple sources).

loadfile scan-encoding production.dat
loadfile scan-encoding data.dat --primary utf-8 --fallback cp1252

Options:

  • --primary — Expected primary encoding (auto-detected if omitted)
  • --fallback — Fallback encoding to try (default: cp1252)
  • --max-samples — Max number of problem lines to show (default: 10)

fix-endings — Normalize line endings

Detects mixed CR/LF/CRLF and normalizes to a single type. Verifies line count integrity after conversion.

loadfile fix-endings production.dat                    # default: CRLF (Windows)
loadfile fix-endings production.dat --target lf         # Unix-style
loadfile fix-endings data.dat --target crlf -o output.dat

Options:

  • --target — Target line ending: crlf (default, for Windows tools), lf
  • -o, --output — Output file path (default: <input>.fixed.<ext>)
  • --force — Force normalization even if multiline fields are detected

Python API

All functionality is available programmatically:

from loadfile.parsers.dat import parse, serialize
from loadfile.encodings.detector import detect_encoding, convert_encoding
from loadfile.converter.convert import convert_format
from loadfile.validators.structural import validate_structure
from loadfile.validators.bates import analyze_bates

# Parse any DAT dialect
dat = parse(open("production.dat").read())
print(dat.variant)   # pipe_caret (sep='|', qual='^')
print(dat.headers)   # ['Control Number', 'Custodian', ...]
print(dat.rows[0])   # ['REL-0000001', 'Smith, John', ...]

# Convert format
edrm_xml = convert_format(dat_content, target="edrm_xml")

# Stream large files (memory-efficient)
from loadfile.parsers.streaming import stream_dat
for record in stream_dat(Path("huge.dat")):
    print(record["DOCID"])

# Validate
from loadfile.validators.cross_file import validate_cross_files
report = validate_cross_files(metadata_records=records, opt_file=opt)
for issue in report.issues:
    print(issue)

Architecture

src/loadfile/
├── cli.py                 # 11 CLI commands
├── format_detect.py       # Auto-detect file format from content
├── encodings/
│   ├── detector.py        # Encoding detection (BOM, UTF-8, chardet, CP1252 heuristic)
│   └── mixed.py           # Line-level mixed encoding detection + fallback
├── parsers/
│   ├── dat.py             # Concordance DAT (9 dialects, auto-detection)
│   ├── opt.py             # Opticon OPT
│   ├── lfp.py             # IPRO LFP
│   ├── dii.py             # Summation DII
│   ├── edrm.py            # EDRM XML (multiple schema variants)
│   └── streaming.py       # Memory-efficient streaming for large files
├── converter/
│   ├── models.py          # Intermediate data model
│   ├── adapters.py        # Format ↔ intermediate conversion
│   └── convert.py         # High-level convert API
├── validators/
│   ├── models.py          # Issue, Severity, ValidationReport
│   ├── structural.py      # Field counts, duplicate IDs, empty fields
│   ├── format_specific.py # Per-format validation rules
│   ├── cross_file.py      # DAT ↔ OPT cross-referencing
│   ├── bates.py           # Bates number pattern analysis
│   └── field_length.py    # Import limit pre-validation
└── utils/
    ├── dates.py           # Date format detection + normalization
    ├── paths.py           # Path separator normalization
    └── line_endings.py    # Line ending detection + normalization

Format conversion uses a hub-and-spoke model: each format converts to/from an intermediate representation, so adding a new format requires only one adapter (not N×N converters).

Development

git clone https://github.com/frisk55frisk/loadfile-tool.git
cd loadfile-tool
pip install -e ".[dev]"
pytest

Tests: 169 passing (111 pytest + 19 parser + 39 edge-case), 32 fixture files covering dialect detection, encoding mixed files, multiline fields, path separator variants, BOM handling, data integrity, and realistic multi-hundred-record roundtrips.

Why Not ReadySuite?

loadfile ReadySuite
Price Free (MIT) $1,599/year
Platform Any (Python CLI) Windows only
Automation CLI/Python API, scriptable GUI only
Formats 5 formats, 9 DAT dialects More formats
Data locality Runs locally, data never leaves your machine Runs locally

loadfile is not a full replacement for ReadySuite's GUI and QC workflow. It's a focused tool for the conversion, validation, and normalization layer — the part that happens before you load data into your review platform.

Beta

loadfile is currently in beta. We're looking for feedback from eDiscovery professionals using loadfile in practice. If you find a bug, encounter a file format we can't handle, or have a feature request, please open an issue.

Contributing

Bug reports, feature requests, and feedback are welcome. If you encounter a load file edge case that loadfile doesn't handle, please open an issue -- ideally with a (sanitized) sample file.

During the beta period we are not accepting external code contributions. This may change in the future.

License

MIT

About

eDiscovery load file CLI — validate, convert, and inspect Concordance DAT, OPT, LFP, DII, and EDRM XML

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages