feat(fasta): add lightweight FASTA file format support by behroozazarkhalili · Pull Request #7923 · huggingface/datasets

behroozazarkhalili · 2025-12-31T19:33:00Z

Summary

This PR adds support for loading FASTA files directly with load_dataset(), addressing feedback from #7851.

FASTA is a text-based format for representing nucleotide sequences (DNA/RNA) or peptide sequences (proteins), widely used in bioinformatics.

Key Features

Zero external dependencies - Uses a lightweight pure Python parser based on readfq.py by Heng Li
Streaming support - Generator-based parsing for memory efficiency with large genomic files
Compression support - Automatic detection and handling of gzip, bzip2, and xz compressed files via magic bytes
Large sequence support - Uses large_string Arrow type to handle viral genomes and long sequences (fixes UTF-8 overflow)
Adaptive batching - max_batch_bytes parameter (default 256MB) prevents Parquet page size errors with very large sequences

Technical Decisions (Addressing #7851 Feedback)

Concern	Solution
Long sequences → UTF-8 overflow (@apcamargo, @UriNeri)	Uses `pa.large_string()` for sequence column
BioPython is overkill (@apcamargo)	Pure Python parser based on Heng Li's readfq.py
Parquet page size limit i32::MAX (@UriNeri)	Adaptive dual-threshold batching with `max_batch_bytes`

Columns

Column	Type	Description
`id`	string	Sequence identifier (first word after `>`)
`description`	string	Full description line (everything after id)
`sequence`	large_string	The biological sequence (DNA/RNA/protein)

Supported Extensions

.fa, .fasta, .fna, .ffn, .faa, .frn (and compressed variants)

Usage

from datasets import load_dataset

# Load FASTA file
dataset = load_dataset("fasta", data_files="sequences.fasta")

# Load with column filtering
dataset = load_dataset("fasta", data_files="sequences.fa", columns=["id", "sequence"])

# Load gzipped file
dataset = load_dataset("fasta", data_files="sequences.fa.gz")

# Configure batching for very large genomes
dataset = load_dataset("fasta", data_files="genome.fasta", max_batch_bytes=128*1024*1024)

Test Plan

All 22 tests passing.

cc: @georgia-hf

Add native support for loading FASTA biological sequence files with zero external dependencies. This addresses feedback from PR huggingface#7851. Key features: - Pure Python parser based on Heng Li's readfq.py (no BioPython dependency) - Uses pa.large_string() for sequences to handle UTF-8 overflow with long genomes - Adaptive byte-based batching (max_batch_bytes=256MB) prevents Parquet page size errors with very large sequences like complete viral genomes - Supports gzip, bzip2, and xz compression via magic byte detection - Column filtering: select subset of [id, description, sequence] Supported extensions: .fa, .fasta, .fna, .ffn, .faa, .frn

FastaConfig was missing the __post_init__ method that calls super().__post_init__(). This is required to inherit BuilderConfig's validation for: - Invalid config name characters (InvalidConfigName) - data_files type validation (ValueError) This aligns with the pattern used in ArrowConfig, MmcifFolderConfig, and other packaged module configs. Also includes minor style formatting from ruff.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(fasta): add lightweight FASTA file format support#7923

feat(fasta): add lightweight FASTA file format support#7923
behroozazarkhalili wants to merge 2 commits into
huggingface:mainfrom
behroozazarkhalili:feat/fasta-support

behroozazarkhalili commented Dec 31, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

behroozazarkhalili commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Features

Technical Decisions (Addressing #7851 Feedback)

Columns

Supported Extensions

Usage

Test Plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

behroozazarkhalili commented Dec 31, 2025 •

edited

Loading