-
Notifications
You must be signed in to change notification settings - Fork 83
Description
Summary
I have implemented batch processing features for the SimpleFold CLI and would like to contribute the code. This adds support for recursive directory scanning, checkpoint resume, and flexible cache management.
Motivation
I am working on a protein synthesis project that requires folding a large number of protein sequences. During my work, I encountered these limitations:
-
Scattered data: Our protein sequences are organized in subdirectories by categories, making it cumbersome to process them one by one.
-
Long-running jobs: Folding thousands of proteins takes hours or days. Without checkpoint support, any interruption would require restarting from the beginning.
-
Multi-GPU utilization: To speed up the process, we need to use multiple GPUs in parallel, which requires cache sharing and duplicate detection.
I initially wrote a wrapper script to address these needs, and now I would like to contribute the improvements back to the main repository.
Implementation Status
✅ Code is ready - I have implemented the following features:
--recursive: Recursively scan subdirectories for FASTA files--skip_completed: Skip proteins that already have predictions--cache_dir: Specify custom cache directory for cache sharing- Better error handling: single file failure doesn't stop entire batch
Changes Made
Modified Files:
src/simplefold/cli.py: Added 3 new command-line argumentssrc/simplefold/inference.py: Updatedpredict_structures_from_fastasfunction
New CLI Arguments:
--recursive # Recursively scan subdirectories for FASTA files
--skip_completed # Skip proteins that already have predictions
--cache_dir DIRECTORY # Custom cache directory (default: output_dir/cache)Use Cases
1. Recursive Directory Scanning
# Process all FASTA files in /data/proteins and its subdirectories
simplefold --fasta_path /data/proteins --recursive2. Multi-GPU Parallel Processing
# GPU 0: Process with shared cache
CUDA_VISIBLE_DEVICES=0 simplefold --fasta_path /data/proteins --recursive \
--cache_dir /shared/cache --skip_completed --output_dir /results/gpu0
# GPU 1: Same cache, skip_completed prevents duplication
CUDA_VISIBLE_DEVICES=1 simplefold --fasta_path /data/proteins --recursive \
--cache_dir /shared/cache --skip_completed --output_dir /results/gpu13. Resume Interrupted Jobs
# First run (can be interrupted)
simplefold --fasta_path /large_dataset --recursive --skip_completed
# Resume from where it left off (if interrupted)
simplefold --fasta_path /large_dataset --recursive --skip_completedBackward Compatibility
- ✅ Fully backward compatible - all new parameters have sensible defaults
- ✅ No changes to model architecture or inference logic
- ✅ No new dependencies required
- ✅ Existing CLI usage remains unchanged
Request
Would you be willing to accept a pull request with these batch processing enhancements?
I can submit the PR if you're interested in merging these changes.