Skip to content

[Feature Request + Implementation] Batch Processing Support for CLI #48

@Nichts-fan

Description

@Nichts-fan

Summary

I have implemented batch processing features for the SimpleFold CLI and would like to contribute the code. This adds support for recursive directory scanning, checkpoint resume, and flexible cache management.

Motivation

I am working on a protein synthesis project that requires folding a large number of protein sequences. During my work, I encountered these limitations:

  1. Scattered data: Our protein sequences are organized in subdirectories by categories, making it cumbersome to process them one by one.

  2. Long-running jobs: Folding thousands of proteins takes hours or days. Without checkpoint support, any interruption would require restarting from the beginning.

  3. Multi-GPU utilization: To speed up the process, we need to use multiple GPUs in parallel, which requires cache sharing and duplicate detection.

I initially wrote a wrapper script to address these needs, and now I would like to contribute the improvements back to the main repository.

Implementation Status

Code is ready - I have implemented the following features:

  • --recursive: Recursively scan subdirectories for FASTA files
  • --skip_completed: Skip proteins that already have predictions
  • --cache_dir: Specify custom cache directory for cache sharing
  • Better error handling: single file failure doesn't stop entire batch

Changes Made

Modified Files:

  • src/simplefold/cli.py: Added 3 new command-line arguments
  • src/simplefold/inference.py: Updated predict_structures_from_fastas function

New CLI Arguments:

--recursive          # Recursively scan subdirectories for FASTA files
--skip_completed     # Skip proteins that already have predictions
--cache_dir DIRECTORY # Custom cache directory (default: output_dir/cache)

Use Cases

1. Recursive Directory Scanning

# Process all FASTA files in /data/proteins and its subdirectories
simplefold --fasta_path /data/proteins --recursive

2. Multi-GPU Parallel Processing

# GPU 0: Process with shared cache
CUDA_VISIBLE_DEVICES=0 simplefold --fasta_path /data/proteins --recursive \
    --cache_dir /shared/cache --skip_completed --output_dir /results/gpu0

# GPU 1: Same cache, skip_completed prevents duplication
CUDA_VISIBLE_DEVICES=1 simplefold --fasta_path /data/proteins --recursive \
    --cache_dir /shared/cache --skip_completed --output_dir /results/gpu1

3. Resume Interrupted Jobs

# First run (can be interrupted)
simplefold --fasta_path /large_dataset --recursive --skip_completed

# Resume from where it left off (if interrupted)
simplefold --fasta_path /large_dataset --recursive --skip_completed

Backward Compatibility

  • ✅ Fully backward compatible - all new parameters have sensible defaults
  • ✅ No changes to model architecture or inference logic
  • ✅ No new dependencies required
  • ✅ Existing CLI usage remains unchanged

Request

Would you be willing to accept a pull request with these batch processing enhancements?

I can submit the PR if you're interested in merging these changes.

Metadata

Metadata

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions