Add fasta support#7851
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
lhoestq
left a comment
There was a problem hiding this comment.
Looks all good ! I just left minor comments
| ".ffn": ("fasta", {}), | ||
| ".frn": ("fasta", {}), | ||
| } | ||
| _EXTENSION_TO_MODULE.update({ext: ("imagefolder", {}) for ext in imagefolder.ImageFolder.EXTENSIONS}) |
There was a problem hiding this comment.
You can also add the upper versions of the extensions here
There was a problem hiding this comment.
Huh, but why do none of the other extensions have uppers?
| import datasets | ||
| from datasets.features import Value | ||
| from datasets.table import table_cast | ||
| from datasets.utils.file_utils import xopen |
There was a problem hiding this comment.
No need for xopen, you can use regular open (it's replaced by xopen anyway when needed due to an old logic)
…gingface#7848) remove mode parameter in docstring of pdf and video feature
|
A few comments:
|
What @apcamargo said, plus FWIW in our approach (so might not be relevant here) we use polars (with custom fasta io parser) or polars-bio (that has a Which in polars can be solved with: |
Add native support for loading FASTA biological sequence files with zero external dependencies. This addresses feedback from PR huggingface#7851. Key features: - Pure Python parser based on Heng Li's readfq.py (no BioPython dependency) - Uses pa.large_string() for sequences to handle UTF-8 overflow with long genomes - Adaptive byte-based batching (max_batch_bytes=256MB) prevents Parquet page size errors with very large sequences like complete viral genomes - Supports gzip, bzip2, and xz compression via magic byte detection - Column filtering: select subset of [id, description, sequence] Supported extensions: .fa, .fasta, .fna, .ffn, .faa, .frn
lhoestq
left a comment
There was a problem hiding this comment.
coming back to scientific formats datasets :)
just a little comment before and checking the CI is green before we merge
| from typing import List as ListT | ||
|
|
||
| import pyarrow as pa | ||
| from Bio import SeqIO |
There was a problem hiding this comment.
can you move this import in _iter_fasta_records ? since it's an optional dependency
|
|
||
| ```py | ||
| >>> from datasets import load_dataset | ||
| >>> dataset = load_dataset("fasta", data_files="data.fasta") |
There was a problem hiding this comment.
are there fasta datasets on the Hub that would be cool
This PR adds support for FASTA files conversion to Parquet.