[Feature Request + Implementation] Batch Processing Support for CLI

## Summary
I have implemented batch processing features for the SimpleFold CLI and would like to contribute the code. This adds support for recursive directory scanning, checkpoint resume, and flexible cache management.

## Motivation
I am working on a **protein synthesis project** that requires folding a large number of protein sequences. During my work, I encountered these limitations:

1. **Scattered data**: Our protein sequences are organized in subdirectories by categories, making it cumbersome to process them one by one.

2. **Long-running jobs**: Folding thousands of proteins takes hours or days. Without checkpoint support, any interruption would require restarting from the beginning.

3. **Multi-GPU utilization**: To speed up the process, we need to use multiple GPUs in parallel, which requires cache sharing and duplicate detection.

I initially wrote a wrapper script to address these needs, and now I would like to contribute the improvements back to the main repository.

## Implementation Status
✅ **Code is ready** - I have implemented the following features:
- `--recursive`: Recursively scan subdirectories for FASTA files
- `--skip_completed`: Skip proteins that already have predictions
- `--cache_dir`: Specify custom cache directory for cache sharing
- Better error handling: single file failure doesn't stop entire batch

## Changes Made

### Modified Files:
- `src/simplefold/cli.py`: Added 3 new command-line arguments
- `src/simplefold/inference.py`: Updated `predict_structures_from_fastas` function

### New CLI Arguments:
```bash
--recursive          # Recursively scan subdirectories for FASTA files
--skip_completed     # Skip proteins that already have predictions
--cache_dir DIRECTORY # Custom cache directory (default: output_dir/cache)
```

## Use Cases

### 1. Recursive Directory Scanning
```bash
# Process all FASTA files in /data/proteins and its subdirectories
simplefold --fasta_path /data/proteins --recursive
```

### 2. Multi-GPU Parallel Processing
```bash
# GPU 0: Process with shared cache
CUDA_VISIBLE_DEVICES=0 simplefold --fasta_path /data/proteins --recursive \
    --cache_dir /shared/cache --skip_completed --output_dir /results/gpu0

# GPU 1: Same cache, skip_completed prevents duplication
CUDA_VISIBLE_DEVICES=1 simplefold --fasta_path /data/proteins --recursive \
    --cache_dir /shared/cache --skip_completed --output_dir /results/gpu1
```

### 3. Resume Interrupted Jobs
```bash
# First run (can be interrupted)
simplefold --fasta_path /large_dataset --recursive --skip_completed

# Resume from where it left off (if interrupted)
simplefold --fasta_path /large_dataset --recursive --skip_completed
```

## Backward Compatibility
- ✅ Fully backward compatible - all new parameters have sensible defaults
- ✅ No changes to model architecture or inference logic
- ✅ No new dependencies required
- ✅ Existing CLI usage remains unchanged

## Request
Would you be willing to accept a pull request with these batch processing enhancements?

I can submit the PR if you're interested in merging these changes.
````

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request + Implementation] Batch Processing Support for CLI #48

Summary

Motivation

Implementation Status

Changes Made

Modified Files:

New CLI Arguments:

Use Cases

1. Recursive Directory Scanning

2. Multi-GPU Parallel Processing

3. Resume Interrupted Jobs

Backward Compatibility

Request

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature Request + Implementation] Batch Processing Support for CLI #48

Description

Summary

Motivation

Implementation Status

Changes Made

Modified Files:

New CLI Arguments:

Use Cases

1. Recursive Directory Scanning

2. Multi-GPU Parallel Processing

3. Resume Interrupted Jobs

Backward Compatibility

Request

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions