eDiscovery load file conversion, validation, and normalization CLI.
The problem: eDiscovery load files come in multiple formats with inconsistent delimiters, encodings, date formats, and line endings. Converting between formats and validating data integrity before importing into review tools (Relativity, Everlaw, etc.) currently requires either ReadySuite ($1,599/year, Windows-only) or painful manual work.
loadfile is a Python CLI that handles all of this. It auto-detects formats and dialects, converts between them, validates structural integrity, and normalizes the messy edge cases that cause silent import failures.
pip install loadfile# What format and encoding is this file?
loadfile inspect production.dat
# Convert encoding from CP1252 to UTF-8
loadfile convert production.dat --to utf-8
# Convert all files in a directory
loadfile batch ./case_files --to utf-8 --recursive
# Convert between formats (e.g., DII → DAT)
loadfile transform metadata.dii --to dat
# Validate before importing
loadfile validate production.dat --cross-file images.opt
# Check if fields will be truncated in Relativity
loadfile check-lengths production.dat
# Normalize mixed date formats to ISO
loadfile dates production.dat --normalize iso
# Analyze Bates numbering for gaps
loadfile bates production.dat| Format | Description | Parse | Serialize | Convert |
|---|---|---|---|---|
| Concordance DAT | Metadata load file (9 dialect variants) | ✓ | ✓ | ✓ |
| Opticon OPT | Image-level load file | ✓ | ✓ | ✓ |
| IPRO LFP | Image load file (command-code format) | ✓ | ✓ | ✓ |
| Summation DII | Token-based metadata load file | ✓ | ✓ | ✓ |
| EDRM XML | Industry-standard XML schema | ✓ | ✓ | ✓ |
Most tools produce slightly different DAT files. loadfile auto-detects and handles all of them:
| Dialect | Separator | Qualifier | Used By |
|---|---|---|---|
concordance |
þ | ® | Concordance, IPRO, most vendors |
pipe_caret |
| | ^ | Relativity default |
pipe_quote |
| | " | Everlaw, some vendors |
pipe_bare |
| | (none) | Simple exports |
tab_quote |
\t | " | Spreadsheet exports |
tab_bare |
\t | (none) | TSV-style exports |
comma_quote |
, | " | CSV-style .dat files |
pipe_thorn |
| | þ | Rare vendor variant |
ascii_ctrl |
\x14 | \xFE | Legacy ASCII control |
Auto-detects format, encoding, and dialect. Works with all 5 formats.
loadfile inspect production.datFile: production.dat
Size: 45,678 bytes
Encoding: cp1252 (73%)
BOM: no
Format: dat
Variant: concordance (sep='þ', qual='®')
Fields: 12
Records: 1,847
Headers:
1. DOCID
2. BEGBATES
3. CUSTODIAN
...
Options:
--max-rows— Number of records to preview (default: 5)--format— File format:auto(default),dat,opt,lfp,dii,edrm_xml
loadfile detect production.datOptions:
--sample-size— Bytes to sample for detection (default: 65536)
Supports UTF-8, UTF-8 BOM, CP1252, Latin-1, UTF-16 LE/BE. Auto-detects source encoding.
loadfile convert production.dat --to utf-8
loadfile convert production.dat --from cp1252 --to utf-8 -o output.dat
loadfile convert production.dat --to utf-8-sig # with BOMOptions:
-t, --to— Target encoding (required)-f, --from— Source encoding (auto-detected if omitted)-o, --output— Output file path (default:<input>.converted)--add-bom— Add UTF-8 BOM to output--strip-bom— Remove BOM from output--on-error— How to handle unmappable characters:strict(default, raises error),replace,ignore
loadfile batch ./case_files --to utf-8 --recursive --dry-run # preview
loadfile batch ./case_files --to utf-8 --recursive # execute
loadfile batch ./data --to utf-8 --pattern "*.dat" --recursiveSkips files already in the target encoding. Preserves subdirectory structure.
Options:
-t, --to— Target encoding (required)-f, --from— Source encoding (auto-detected per file if omitted)-o, --output-dir— Output directory (default:<input_dir>/converted/)--pattern— Glob pattern for matching files (default:*)-r, --recursive— Search subdirectories--add-bom— Add UTF-8 BOM to output--strip-bom— Remove BOM from output--on-error— How to handle unmappable characters:strict(default, raises error),replace,ignore--dry-run— Detect encodings only, don't convert
Metadata formats (DAT ↔ DII ↔ EDRM XML) and image formats (OPT ↔ LFP) convert to each other via an intermediate data model.
loadfile transform metadata.dii --to dat
loadfile transform production.dat --to edrm_xml
loadfile transform images.opt --to lfpOptions:
-t, --to— Target format:dat,dii,edrm_xml,opt,lfp(required)-f, --from— Source format (auto-detected if omitted)-o, --output— Output file path (default: input file with new extension)
Runs three layers of validation:
# Single file — structural + format-specific checks
loadfile validate production.dat
# Cross-file checks (BATES in DAT vs OPT)
loadfile validate production.dat --cross-file images.opt
# With image path verification
loadfile validate production.dat --cross-file images.opt --image-dir ./IMAGES/Structural checks: field count mismatches, duplicate IDs, empty required fields, empty records, trailing whitespace.
Format-specific checks: DAT field count per row, OPT/LFP missing initial DocBreak, duplicate image IDs, empty paths, DII missing DOCID, EDRM empty tags and missing file references.
Cross-file checks: BATES numbers in DAT exist in OPT, OPT document starts match DAT, image file existence on disk.
Options:
--format— File format:auto(default),dat,opt,lfp,dii,edrm_xml--cross-file— Image load file (OPT/LFP) for cross-file validation--image-dir— Directory to verify image file existence--bates-field— Field name for BATES numbers (auto-detected if omitted)--id-field— Field name for document ID (auto-detected if omitted)
Auto-detects date fields and shows format distribution. Detects 11 date patterns including MM/DD/YYYY, YYYY-MM-DD, DD-MMM-YYYY, M/D/YY, and variants with time. Warns on mixed formats, unparseable values, and 2-digit years. Currently supports DAT format.
loadfile dates production.dat # analyze
loadfile dates production.dat --normalize iso # normalize to YYYY-MM-DD
loadfile dates production.dat --normalize relativity # normalize to DD-MMM-YYYY
loadfile dates data.dat --field DATESENT --field DATECREATED --normalize isoOptions:
--field— Date field name(s) to check (repeatable, auto-detected if omitted)--normalize— Normalize dates to a format:iso(YYYY-MM-DD),us(MM/DD/YYYY),relativity(DD-MMM-YYYY),compact(YYYYMMDD)-o, --output— Output file path (default:<input>.normalized.<ext>)
Pre-validates field lengths against review tool limits. Knows Relativity defaults (400 chars for fixed-length, 450 for identifiers, 10MB for long text). Currently supports DAT format.
loadfile check-lengths production.dat # Relativity defaults
loadfile check-lengths production.dat --target summation # Summation limits
loadfile check-lengths production.dat --limit 200 # custom limitOptions:
--target— Target review tool:relativity(default),summation,concordance--limit— Override default field length limit for all fields
Parses Bates patterns (prefix + zero-padded number), detects gaps, prefix changes mid-production, and padding inconsistencies. Supports DAT, DII, and EDRM XML.
loadfile bates production.dat
loadfile bates production.dat --field BEGBATES
loadfile bates metadata.dii --format diiOptions:
--field— Field name containing BATES numbers (auto-detected if omitted)--format— File format:auto(default),dat,dii,edrm_xml--max-gaps— Maximum number of gaps to report (default: 100)
Scans each line independently to find files where UTF-8 and CP1252 are mixed (common when files are assembled from multiple sources).
loadfile scan-encoding production.dat
loadfile scan-encoding data.dat --primary utf-8 --fallback cp1252Options:
--primary— Expected primary encoding (auto-detected if omitted)--fallback— Fallback encoding to try (default:cp1252)--max-samples— Max number of problem lines to show (default: 10)
Detects mixed CR/LF/CRLF and normalizes to a single type. Verifies line count integrity after conversion.
loadfile fix-endings production.dat # default: CRLF (Windows)
loadfile fix-endings production.dat --target lf # Unix-style
loadfile fix-endings data.dat --target crlf -o output.datOptions:
--target— Target line ending:crlf(default, for Windows tools),lf-o, --output— Output file path (default:<input>.fixed.<ext>)--force— Force normalization even if multiline fields are detected
All functionality is available programmatically:
from loadfile.parsers.dat import parse, serialize
from loadfile.encodings.detector import detect_encoding, convert_encoding
from loadfile.converter.convert import convert_format
from loadfile.validators.structural import validate_structure
from loadfile.validators.bates import analyze_bates
# Parse any DAT dialect
dat = parse(open("production.dat").read())
print(dat.variant) # pipe_caret (sep='|', qual='^')
print(dat.headers) # ['Control Number', 'Custodian', ...]
print(dat.rows[0]) # ['REL-0000001', 'Smith, John', ...]
# Convert format
edrm_xml = convert_format(dat_content, target="edrm_xml")
# Stream large files (memory-efficient)
from loadfile.parsers.streaming import stream_dat
for record in stream_dat(Path("huge.dat")):
print(record["DOCID"])
# Validate
from loadfile.validators.cross_file import validate_cross_files
report = validate_cross_files(metadata_records=records, opt_file=opt)
for issue in report.issues:
print(issue)src/loadfile/
├── cli.py # 11 CLI commands
├── format_detect.py # Auto-detect file format from content
├── encodings/
│ ├── detector.py # Encoding detection (BOM, UTF-8, chardet, CP1252 heuristic)
│ └── mixed.py # Line-level mixed encoding detection + fallback
├── parsers/
│ ├── dat.py # Concordance DAT (9 dialects, auto-detection)
│ ├── opt.py # Opticon OPT
│ ├── lfp.py # IPRO LFP
│ ├── dii.py # Summation DII
│ ├── edrm.py # EDRM XML (multiple schema variants)
│ └── streaming.py # Memory-efficient streaming for large files
├── converter/
│ ├── models.py # Intermediate data model
│ ├── adapters.py # Format ↔ intermediate conversion
│ └── convert.py # High-level convert API
├── validators/
│ ├── models.py # Issue, Severity, ValidationReport
│ ├── structural.py # Field counts, duplicate IDs, empty fields
│ ├── format_specific.py # Per-format validation rules
│ ├── cross_file.py # DAT ↔ OPT cross-referencing
│ ├── bates.py # Bates number pattern analysis
│ └── field_length.py # Import limit pre-validation
└── utils/
├── dates.py # Date format detection + normalization
├── paths.py # Path separator normalization
└── line_endings.py # Line ending detection + normalization
Format conversion uses a hub-and-spoke model: each format converts to/from an intermediate representation, so adding a new format requires only one adapter (not N×N converters).
git clone https://github.com/frisk55frisk/loadfile-tool.git
cd loadfile-tool
pip install -e ".[dev]"
pytestTests: 169 passing (111 pytest + 19 parser + 39 edge-case), 32 fixture files covering dialect detection, encoding mixed files, multiline fields, path separator variants, BOM handling, data integrity, and realistic multi-hundred-record roundtrips.
| loadfile | ReadySuite | |
|---|---|---|
| Price | Free (MIT) | $1,599/year |
| Platform | Any (Python CLI) | Windows only |
| Automation | CLI/Python API, scriptable | GUI only |
| Formats | 5 formats, 9 DAT dialects | More formats |
| Data locality | Runs locally, data never leaves your machine | Runs locally |
loadfile is not a full replacement for ReadySuite's GUI and QC workflow. It's a focused tool for the conversion, validation, and normalization layer — the part that happens before you load data into your review platform.
loadfile is currently in beta. We're looking for feedback from eDiscovery professionals using loadfile in practice. If you find a bug, encounter a file format we can't handle, or have a feature request, please open an issue.
Bug reports, feature requests, and feedback are welcome. If you encounter a load file edge case that loadfile doesn't handle, please open an issue -- ideally with a (sanitized) sample file.
During the beta period we are not accepting external code contributions. This may change in the future.
MIT