feat: Add mmCIF file support for macromolecular structures#7925
Open
behroozazarkhalili wants to merge 3 commits into
Open
feat: Add mmCIF file support for macromolecular structures#7925behroozazarkhalili wants to merge 3 commits into
behroozazarkhalili wants to merge 3 commits into
Conversation
Add support for loading mmCIF (macromolecular Crystallographic Information File) format directly with load_dataset(). mmCIF is the modern standard for 3D macromolecular structures used by PDB since 2014. Key features: - Zero external dependencies: Pure Python parser for CIF syntax - Streaming support: Generator-based parsing for large structure files - Compression support: Auto-detection of gzip, bzip2, xz compressed files - ML-ready output: Atomic coordinates suitable for structure-based ML models Configuration options: - columns: Select subset of atom_site columns (default: 11 common columns) - include_hetatm: Option to exclude ligand/water HETATM records - batch_size: Control atoms per batch (default: 100000) Supported extensions: .cif, .mmcif (and compressed variants)
This was referenced Dec 31, 2025
This refactors the mmCIF loader to follow the ImageFolder pattern, where
each row in the dataset contains one complete protein structure file.
This is the recommended ML-friendly approach for working with structural data.
Key changes:
- Add ProteinStructure feature type for handling protein structure files
- Supports lazy loading (decode=False) or full content (decode=True)
- Works with both PDB and mmCIF formats
- Rewrite MmcifFolder to extend FolderBasedBuilder
- Supports folder-based labels (like ImageFolder)
- Supports metadata.csv files for additional columns
- Uses ProteinStructure as BASE_FEATURE
- Fix bug in FolderBasedBuilder._generate_examples where drop_metadata
would fail with IndexError when metadata files were in the files list
- Root cause: enumerate(files) created gaps in shard_idx when files
were skipped due to extension filtering
- Solution: Use separate valid_shard_idx counter that only increments
when samples are actually yielded
Usage:
>>> from datasets import load_dataset
>>> dataset = load_dataset("mmcif", data_dir="./structures")
>>> structure_content = dataset[0]["structure"] # Complete mmCIF content
- Fix line length in protein_structure.py error messages - Sort imports alphabetically in __init__.py - Format function calls and f-strings in test_mmcif.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds support for loading mmCIF (macromolecular Crystallographic Information File) files with
load_dataset(), following the ImageFolder pattern where one row = one structure.Based on feedback from @lhoestq in #7930, this approach makes datasets more practical for ML workflows:
Architecture
Uses
FolderBasedBuilderpattern (likeImageFolder,AudioFolder):New
ProteinStructureFeature TypeSupported Extensions
.cif,.mmcifUsage
Test Results
All 24 mmCIF tests + 15 ProteinStructure feature tests pass.
Related PRs
References
cc @lhoestq @georgia-hf