FWParser

A simple parser for fixed width files.

Installation

There are no dependencies required to use this module. Simply run:

pip install --upgrade git+https://github.com/avgra3/fwparser.git

Or

git clone https://github.com/avgra3/fwparser.git
cd ./fwparser
python -m pip install .

Or

uv add git+https://github.com/avgra3/fwparser.git@main

You can change the @main to a spcific branch or if you leave it out altogether, you will default to the main branch.

Usage

Once you have fwparser installed, you can use it like below:

from fwparser.fwparser import parse_data_file

"""
# Assuming foo.txt contains the below
12345John      Doe       123 Main St         1234567890
"""
FIXED_WIDTH_FILE = "foo.txt"

# Needed in order to parse
DATA_OUTLINE = {
    "customer_id": (1, 5),
    "first_name": (6, 10),
    "last_name": (16, 10),
    "address": (26, 20),
    "phone_number": (46, 10),
}


data = parse_data_file(
        raw_data_file=FIXED_WIDTH_FILE,
        header_config=DATA_OUTLINE,
        trim_white_space=True, # This makes the result cleaner
        offset=1, # Only use if your config does not have an index at zero
)

print(data)
"""
customer_id,first_name,last_name,address,phone_number
12345,John,Doe,123 Main St, 1234567890
"""

With the above, you can then either save the data to a file or use another package like pandas or polars to work more with the data.

Optional Dependencies

You can optionally install the package to have Toml, Pandas, or Polars support.

# With Toml support
pip install --upgrade "fwparser[toml] @ git+https://github.com/avgra3/fwparser.git"

# With Pandas support -- includes toml support
pip install --upgrade "fwparser[pandas] @ git+https://github.com/avgra3/fwparser.git"

# With Polars support -- includes toml support
pip install --upgrade "fwparser[polars] @ git+https://github.com/avgra3/fwparser.git"

# With Polars, Pandas, and Toml support
pip install --upgrade "fwparser[all] @ git+https://github.com/avgra3/fwparser.git"

Multiprocessing

If you have a large file, think 1 million lines and/or 100+ fields of data, you should consider using the multiprocessing module. See the Benchmarking section for how these functions scale.

There are some things to note before running this method:

Do you have a low core count available to you? (Less than 2). If so, this method may actually take longer than the single threaded version because of the overhead of orchestrating the threads.
Do you have a large enough file for this to make sense? Is your file large in row count and/or field counts? This would be like having a around 1 million+ lines and/or around 50+ fields.

Here is an example:

from fwparser.speedy import FastFwparser

"""
# Assuming foo.txt contains the below
12345John      Doe       123 Main St         1234567890
"""
FIXED_WIDTH_FILE = "foo.txt"

# Needed in order to parse
DATA_OUTLINE = {
    "customer_id": (1, 5),
    "first_name": (6, 10),
    "last_name": (16, 10),
    "address": (26, 20),
    "phone_number": (46, 10),
}

TRIM_WHITESPACE = True
OFFSET = 1
ENCLOSED_BY = ""
SEP = ","
LINE_ENDING = "\r\n"
MAX_CPUS = 4

fwparser_object = FastFwparser(
        data=FIXED_WIDTH_FILE,
        header_config=DATA_OUTLINE,
        trim_whitespace=TRIM_WHITESPACE,
        offset=OFFSET,
        enclosed_by=ENCLOSED_BY,
        sep=SEP,
        line_ending=LINE_ENDING,
        max_cpu = MAX_CPUS,
)

data = fwparser_object.parse_data_file()

print(data)

"""
customer_id,first_name,last_name,address,phone_number
12345,John,Doe,123 Main St,1234567890
.....
"""

Comparing Single to Multi Processing

A comparison of the single and multi-process functions for a 1 million record fixed width file are below. The CPU that was used is an AMD Ryzen 5 2500U processor with 4 cores and 8 threads. This is not a particulary powerful cpu, but the benchmark hightlights that when using more than cores, performance does not scale linearly (as expected).

From the benchmark results, you can see that using the multiprocessing function without declaring a core count < 2 results in worse performance than the single core option.

Why

You might be wondering why this package exists. Simply, I found that the pandas and polars implementations for parsing file types like these were clunky and not the main focus of those projects.

As the method currently works, you can easily move to working with a pandas or polars dataframe from the data here.

The base implementation of this project does not use any external dependencies. As long as you have a Python version >=3.8, this project should work for your needs.

Issues/Bugs

If you find any issues while using this module feel free to open an issue or open a pull request for any bug fixes you find.

Benchmarking

From the project source directory, run the command make benchmark. The benchmark will run and all results will be output into Benchmark_Results directory.

Making the test file will take a while to make. If you have already ran the benchmark, the creation of the file will be skipped.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
Benchmark_Results		Benchmark_Results
scripts		scripts
src/fwparser		src/fwparser
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FWParser

Installation

Usage

Optional Dependencies

Multiprocessing

Comparing Single to Multi Processing

Why

Issues/Bugs

Benchmarking

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

avgra3/fwparser

Folders and files

Latest commit

History

Repository files navigation

FWParser

Installation

Usage

Optional Dependencies

Multiprocessing

Comparing Single to Multi Processing

Why

Issues/Bugs

Benchmarking

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages