loadfile

eDiscovery load file conversion, validation, and normalization CLI.

The problem: eDiscovery load files come in multiple formats with inconsistent delimiters, encodings, date formats, and line endings. Converting between formats and validating data integrity before importing into review tools (Relativity, Everlaw, etc.) currently requires either ReadySuite ($1,599/year, Windows-only) or painful manual work.

loadfile is a Python CLI that handles all of this. It auto-detects formats and dialects, converts between them, validates structural integrity, and normalizes the messy edge cases that cause silent import failures.

Quick Start

pip install loadfile

# What format and encoding is this file?
loadfile inspect production.dat

# Convert encoding from CP1252 to UTF-8
loadfile convert production.dat --to utf-8

# Convert all files in a directory
loadfile batch ./case_files --to utf-8 --recursive

# Convert between formats (e.g., DII → DAT)
loadfile transform metadata.dii --to dat

# Validate before importing
loadfile validate production.dat --cross-file images.opt

# Check if fields will be truncated in Relativity
loadfile check-lengths production.dat

# Normalize mixed date formats to ISO
loadfile dates production.dat --normalize iso

# Analyze Bates numbering for gaps
loadfile bates production.dat

Supported Formats

Format	Description	Parse	Serialize	Convert
Concordance DAT	Metadata load file (9 dialect variants)	✓	✓	✓
Opticon OPT	Image-level load file	✓	✓	✓
IPRO LFP	Image load file (command-code format)	✓	✓	✓
Summation DII	Token-based metadata load file	✓	✓	✓
EDRM XML	Industry-standard XML schema	✓	✓	✓

DAT Dialect Auto-Detection

Most tools produce slightly different DAT files. loadfile auto-detects and handles all of them:

Dialect	Separator	Qualifier	Used By
`concordance`	þ	®	Concordance, IPRO, most vendors
`pipe_caret`	\|	^	Relativity default
`pipe_quote`	\|	"	Everlaw, some vendors
`pipe_bare`	\|	(none)	Simple exports
`tab_quote`	\t	"	Spreadsheet exports
`tab_bare`	\t	(none)	TSV-style exports
`comma_quote`	,	"	CSV-style .dat files
`pipe_thorn`	\|	þ	Rare vendor variant
`ascii_ctrl`	\x14	\xFE	Legacy ASCII control

Commands

`inspect` — View file structure

Auto-detects format, encoding, and dialect. Works with all 5 formats.

loadfile inspect production.dat

File:       production.dat
Size:       45,678 bytes
Encoding:   cp1252 (73%)
BOM:        no
Format:     dat

Variant:    concordance (sep='þ', qual='®')
Fields:     12
Records:    1,847

Headers:
    1. DOCID
    2. BEGBATES
    3. CUSTODIAN
    ...

Options:

--max-rows — Number of records to preview (default: 5)
--format — File format: auto (default), dat, opt, lfp, dii, edrm_xml

`detect` — Identify encoding

loadfile detect production.dat

Options:

--sample-size — Bytes to sample for detection (default: 65536)

`convert` — Change encoding

Supports UTF-8, UTF-8 BOM, CP1252, Latin-1, UTF-16 LE/BE. Auto-detects source encoding.

loadfile convert production.dat --to utf-8
loadfile convert production.dat --from cp1252 --to utf-8 -o output.dat
loadfile convert production.dat --to utf-8-sig   # with BOM

Options:

-t, --to — Target encoding (required)
-f, --from — Source encoding (auto-detected if omitted)
-o, --output — Output file path (default: <input>.converted)
--add-bom — Add UTF-8 BOM to output
--strip-bom — Remove BOM from output
--on-error — How to handle unmappable characters: strict (default, raises error), replace, ignore

`batch` — Convert a directory

loadfile batch ./case_files --to utf-8 --recursive --dry-run   # preview
loadfile batch ./case_files --to utf-8 --recursive              # execute
loadfile batch ./data --to utf-8 --pattern "*.dat" --recursive

Skips files already in the target encoding. Preserves subdirectory structure.

Options:

-t, --to — Target encoding (required)
-f, --from — Source encoding (auto-detected per file if omitted)
-o, --output-dir — Output directory (default: <input_dir>/converted/)
--pattern — Glob pattern for matching files (default: *)
-r, --recursive — Search subdirectories
--add-bom — Add UTF-8 BOM to output
--strip-bom — Remove BOM from output
--on-error — How to handle unmappable characters: strict (default, raises error), replace, ignore
--dry-run — Detect encodings only, don't convert

`transform` — Convert between formats

Metadata formats (DAT ↔ DII ↔ EDRM XML) and image formats (OPT ↔ LFP) convert to each other via an intermediate data model.

loadfile transform metadata.dii --to dat
loadfile transform production.dat --to edrm_xml
loadfile transform images.opt --to lfp

Options:

-t, --to — Target format: dat, dii, edrm_xml, opt, lfp (required)
-f, --from — Source format (auto-detected if omitted)
-o, --output — Output file path (default: input file with new extension)

`validate` — Check for errors

Runs three layers of validation:

# Single file — structural + format-specific checks
loadfile validate production.dat

# Cross-file checks (BATES in DAT vs OPT)
loadfile validate production.dat --cross-file images.opt

# With image path verification
loadfile validate production.dat --cross-file images.opt --image-dir ./IMAGES/

Structural checks: field count mismatches, duplicate IDs, empty required fields, empty records, trailing whitespace.

Format-specific checks: DAT field count per row, OPT/LFP missing initial DocBreak, duplicate image IDs, empty paths, DII missing DOCID, EDRM empty tags and missing file references.

Cross-file checks: BATES numbers in DAT exist in OPT, OPT document starts match DAT, image file existence on disk.

Options:

--format — File format: auto (default), dat, opt, lfp, dii, edrm_xml
--cross-file — Image load file (OPT/LFP) for cross-file validation
--image-dir — Directory to verify image file existence
--bates-field — Field name for BATES numbers (auto-detected if omitted)
--id-field — Field name for document ID (auto-detected if omitted)

`dates` — Analyze and normalize dates

Auto-detects date fields and shows format distribution. Detects 11 date patterns including MM/DD/YYYY, YYYY-MM-DD, DD-MMM-YYYY, M/D/YY, and variants with time. Warns on mixed formats, unparseable values, and 2-digit years. Currently supports DAT format.

loadfile dates production.dat                        # analyze
loadfile dates production.dat --normalize iso         # normalize to YYYY-MM-DD
loadfile dates production.dat --normalize relativity   # normalize to DD-MMM-YYYY
loadfile dates data.dat --field DATESENT --field DATECREATED --normalize iso

Options:

--field — Date field name(s) to check (repeatable, auto-detected if omitted)
--normalize — Normalize dates to a format: iso (YYYY-MM-DD), us (MM/DD/YYYY), relativity (DD-MMM-YYYY), compact (YYYYMMDD)
-o, --output — Output file path (default: <input>.normalized.<ext>)

`check-lengths` — Prevent truncation on import

Pre-validates field lengths against review tool limits. Knows Relativity defaults (400 chars for fixed-length, 450 for identifiers, 10MB for long text). Currently supports DAT format.

loadfile check-lengths production.dat                      # Relativity defaults
loadfile check-lengths production.dat --target summation    # Summation limits
loadfile check-lengths production.dat --limit 200           # custom limit

Options:

--target — Target review tool: relativity (default), summation, concordance
--limit — Override default field length limit for all fields

`bates` — Analyze Bates numbering

Parses Bates patterns (prefix + zero-padded number), detects gaps, prefix changes mid-production, and padding inconsistencies. Supports DAT, DII, and EDRM XML.

loadfile bates production.dat
loadfile bates production.dat --field BEGBATES
loadfile bates metadata.dii --format dii

Options:

--field — Field name containing BATES numbers (auto-detected if omitted)
--format — File format: auto (default), dat, dii, edrm_xml
--max-gaps — Maximum number of gaps to report (default: 100)

`scan-encoding` — Detect mixed encodings

Scans each line independently to find files where UTF-8 and CP1252 are mixed (common when files are assembled from multiple sources).

loadfile scan-encoding production.dat
loadfile scan-encoding data.dat --primary utf-8 --fallback cp1252

Options:

--primary — Expected primary encoding (auto-detected if omitted)
--fallback — Fallback encoding to try (default: cp1252)
--max-samples — Max number of problem lines to show (default: 10)

`fix-endings` — Normalize line endings

Detects mixed CR/LF/CRLF and normalizes to a single type. Verifies line count integrity after conversion.

loadfile fix-endings production.dat                    # default: CRLF (Windows)
loadfile fix-endings production.dat --target lf         # Unix-style
loadfile fix-endings data.dat --target crlf -o output.dat

Options:

--target — Target line ending: crlf (default, for Windows tools), lf
-o, --output — Output file path (default: <input>.fixed.<ext>)
--force — Force normalization even if multiline fields are detected

Python API

All functionality is available programmatically:

from loadfile.parsers.dat import parse, serialize
from loadfile.encodings.detector import detect_encoding, convert_encoding
from loadfile.converter.convert import convert_format
from loadfile.validators.structural import validate_structure
from loadfile.validators.bates import analyze_bates

# Parse any DAT dialect
dat = parse(open("production.dat").read())
print(dat.variant)   # pipe_caret (sep='|', qual='^')
print(dat.headers)   # ['Control Number', 'Custodian', ...]
print(dat.rows[0])   # ['REL-0000001', 'Smith, John', ...]

# Convert format
edrm_xml = convert_format(dat_content, target="edrm_xml")

# Stream large files (memory-efficient)
from loadfile.parsers.streaming import stream_dat
for record in stream_dat(Path("huge.dat")):
    print(record["DOCID"])

# Validate
from loadfile.validators.cross_file import validate_cross_files
report = validate_cross_files(metadata_records=records, opt_file=opt)
for issue in report.issues:
    print(issue)

Architecture

src/loadfile/
├── cli.py                 # 11 CLI commands
├── format_detect.py       # Auto-detect file format from content
├── encodings/
│   ├── detector.py        # Encoding detection (BOM, UTF-8, chardet, CP1252 heuristic)
│   └── mixed.py           # Line-level mixed encoding detection + fallback
├── parsers/
│   ├── dat.py             # Concordance DAT (9 dialects, auto-detection)
│   ├── opt.py             # Opticon OPT
│   ├── lfp.py             # IPRO LFP
│   ├── dii.py             # Summation DII
│   ├── edrm.py            # EDRM XML (multiple schema variants)
│   └── streaming.py       # Memory-efficient streaming for large files
├── converter/
│   ├── models.py          # Intermediate data model
│   ├── adapters.py        # Format ↔ intermediate conversion
│   └── convert.py         # High-level convert API
├── validators/
│   ├── models.py          # Issue, Severity, ValidationReport
│   ├── structural.py      # Field counts, duplicate IDs, empty fields
│   ├── format_specific.py # Per-format validation rules
│   ├── cross_file.py      # DAT ↔ OPT cross-referencing
│   ├── bates.py           # Bates number pattern analysis
│   └── field_length.py    # Import limit pre-validation
└── utils/
    ├── dates.py           # Date format detection + normalization
    ├── paths.py           # Path separator normalization
    └── line_endings.py    # Line ending detection + normalization

Format conversion uses a hub-and-spoke model: each format converts to/from an intermediate representation, so adding a new format requires only one adapter (not N×N converters).

Development

git clone https://github.com/frisk55frisk/loadfile-tool.git
cd loadfile-tool
pip install -e ".[dev]"
pytest

Tests: 169 passing (111 pytest + 19 parser + 39 edge-case), 32 fixture files covering dialect detection, encoding mixed files, multiline fields, path separator variants, BOM handling, data integrity, and realistic multi-hundred-record roundtrips.

Why Not ReadySuite?

	loadfile	ReadySuite
Price	Free (MIT)	$1,599/year
Platform	Any (Python CLI)	Windows only
Automation	CLI/Python API, scriptable	GUI only
Formats	5 formats, 9 DAT dialects	More formats
Data locality	Runs locally, data never leaves your machine	Runs locally

loadfile is not a full replacement for ReadySuite's GUI and QC workflow. It's a focused tool for the conversion, validation, and normalization layer — the part that happens before you load data into your review platform.

Beta

loadfile is currently in beta. We're looking for feedback from eDiscovery professionals using loadfile in practice. If you find a bug, encounter a file format we can't handle, or have a feature request, please open an issue.

Contributing

Bug reports, feature requests, and feedback are welcome. If you encounter a load file edge case that loadfile doesn't handle, please open an issue -- ideally with a (sanitized) sample file.

During the beta period we are not accepting external code contributions. This may change in the future.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
src/loadfile		src/loadfile
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

loadfile

Quick Start

Supported Formats

DAT Dialect Auto-Detection

Commands

`inspect` — View file structure

`detect` — Identify encoding

`convert` — Change encoding

`batch` — Convert a directory

`transform` — Convert between formats

`validate` — Check for errors

`dates` — Analyze and normalize dates

`check-lengths` — Prevent truncation on import

`bates` — Analyze Bates numbering

`scan-encoding` — Detect mixed encodings

`fix-endings` — Normalize line endings

Python API

Architecture

Development

Why Not ReadySuite?

Beta

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

loadfile

Quick Start

Supported Formats

DAT Dialect Auto-Detection

Commands

inspect — View file structure

detect — Identify encoding

convert — Change encoding

batch — Convert a directory

transform — Convert between formats

validate — Check for errors

dates — Analyze and normalize dates

check-lengths — Prevent truncation on import

bates — Analyze Bates numbering

scan-encoding — Detect mixed encodings

fix-endings — Normalize line endings

Python API

Architecture

Development

Why Not ReadySuite?

Beta

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`inspect` — View file structure

`detect` — Identify encoding

`convert` — Change encoding

`batch` — Convert a directory

`transform` — Convert between formats

`validate` — Check for errors

`dates` — Analyze and normalize dates

`check-lengths` — Prevent truncation on import

`bates` — Analyze Bates numbering

`scan-encoding` — Detect mixed encodings

`fix-endings` — Normalize line endings

Packages