Skip to content

singjc/easypqp-rs

Repository files navigation

easypqp-rs


Rust PyPI - Version

easypqp-rs is a Rust library for in-silico peptide library generation, with Python bindings for integration with the python EasyPQP library.

Features

  • Fast in-silico library generation using Rust

  • Includes a command-line tool for batch library generation

  • Python bindings for integration within the easypqp Python package

  • Configurable via JSON for fine-tuning predictions, fragmentation settings, and NCE/instrument profiles

Rust Binary CLI Example

easypqp-rs has an optional standalone command-line interface (CLI) binary for generating in-silico libraries. This can be used independently of the EasyPQP Python package if you prefer.

easypqp-insilico ./config.json

Enabling CUDA Support

easypqp-rs and its CLI and Python bindings can optionally use CUDA for GPU acceleration if the underlying redeem-properties dependency is built with the cuda feature. This is controlled via Cargo features and is disabled by default.

To enable CUDA support, build with the cuda feature at the top level. This will propagate the feature through all crates:

Rust CLI

cargo build --release --features cuda

Python (via maturin or pip)

If building the Python package from source, pass the feature to maturin:

maturin build --features cuda
# or for develop mode
maturin develop --features cuda

If using pip, you can pass features with the --config-settings flag:

pip install . --config-settings=--features=cuda

This will enable CUDA support in all relevant dependencies, including redeem-properties.

Docker / Running the CUDA-enabled container

If you built the CUDA-enabled Docker image (the repository Dockerfile builds the binary with the cuda feature), run it on a host with NVIDIA drivers and the NVIDIA Container Toolkit installed. The container must be started with GPU access (for example using Docker's --gpus option).

Example: build the image locally (from repo root):

docker build -t easypqp-insilico:cuda -f Dockerfile .

Run the container and give it access to all GPUs (mount your working directory so easypqp-insilico can read/write data):

docker run --rm --gpus all -v "$(pwd):/data" easypqp-insilico:cuda easypqp-insilico /data/config.json

Notes:

  • The host must have the NVIDIA Container Toolkit (nvidia-docker) configured so the container can access GPUs. On modern Docker releases you can use the builtin --gpus flag.
  • You can restrict GPUs with --gpus 'device=0' or use environment variables to select devices if your application honors them.
  • If you publish the image to a registry, make sure users know that they need NVIDIA drivers and the runtime configured on their host to use the CUDA-enabled image.

Configuration Reference

The tool is configured via a JSON file. Below is a comprehensive guide to all available parameters.

Complete Example Configuration

Click to expand full example config.json
{
  "database": {
    "fasta": "path/to/proteins.fasta",
    "enzyme": {
      "name": "Trypsin/P",
      "cleave_at": "KR",
      "restrict": "P",
      "c_terminal": null,
      "min_len": 7,
      "max_len": 50,
      "missed_cleavages": 2
    },
    "peptide_min_mass": 500.0,
    "peptide_max_mass": 5000.0,
    "generate_decoys": true,
    "decoy_tag": "rev_",
    "static_mods": {
      "C": 57.0215
    },
    "variable_mods": {
      "M": [15.9949],
      "[": [42.0106]
    },
    "max_variable_mods": 2
  },
  "insilico_settings": {
    "precursor_charge": [2, 3, 4],
    "max_fragment_charge": 2,
    "min_transitions": 6,
    "max_transitions": 6,
    "fragmentation_model": "HCD",
    "allowed_fragment_types": ["b", "y"],
    "rt_scale": 100.0
  },
  "dl_feature_generators": {
    "retention_time": {
      "model_path": "path/to/rt_model.safetensors",
      "constants_path": "path/to/rt_model_const.yaml",
      "architecture": "rt_cnn_tf"
    },
    "ion_mobility": {
      "model_path": "path/to/ccs_model.safetensors",
      "constants_path": "path/to/ccs_model_const.yaml",
      "architecture": "ccs_cnn_tf"
    },
    "ms2_intensity": {
      "model_path": "path/to/ms2_model.pth",
      "constants_path": "path/to/ms2_model_const.yaml",
      "architecture": "ms2_bert"
    },
    "device": "cpu",
    "instrument": "timsTOF",
    "nce": 20.0,
    "batch_size": 64,
    "fine_tune_config": {
      "fine_tune": false,
      "train_data_path": "",
      "batch_size": 256,
      "epochs": 3,
      "learning_rate": 0.001,
      "save_model": false
    }
  },
  "peptide_chunking": 0,
  "output_file": "insilico_library.tsv",
  "write_report": true,
  "parquet_output": false
}

Configuration Sections

1. Database Settings (REQUIRED)

database - FASTA file, enzyme, modifications, and decoy generation
Parameter Type Default Description
fasta string REQUIRED Path to FASTA protein database file
generate_decoys boolean true Auto-generate decoy sequences by reversing protein sequences
decoy_tag string "rev_" Prefix added to decoy protein names
peptide_min_mass number 500.0 Minimum peptide mass in Daltons
peptide_max_mass number 5000.0 Maximum peptide mass in Daltons
max_variable_mods integer 2 Maximum number of variable modifications per peptide

Enzyme Configuration:

"enzyme": {
  "name": "Trypsin/P",          // Enzyme name (for reference)
  "cleave_at": "KR",             // Amino acids where enzyme cleaves
  "restrict": "P",               // Amino acid that prevents cleavage if following cleavage site
  "c_terminal": true,            // Cleavage occurs C-terminal to the cleavage site
  "min_len": 7,                  // Minimum peptide length
  "max_len": 50,                 // Maximum peptide length
  "missed_cleavages": 2          // Number of allowed missed cleavages
}

Static Modifications:

"static_mods": {
  "C": 57.0215    // Carbamidomethylation of Cysteine
}

Variable Modifications:

"variable_mods": {
  "M": [15.9949],    // Oxidation of Methionine
  "[": [42.0106]     // N-terminal Acetylation
}

Common modification masses:

  • Carbamidomethyl (C): 57.0215
  • Oxidation (M): 15.9949
  • Phosphorylation (STY): 79.9663
  • N-terminal Acetylation: 42.0106
  • Deamidation (NQ): 0.9840

2. In-Silico Library Settings (REQUIRED)

insilico_settings - Precursor/fragment charges, transitions, fragmentation model
Parameter Type Default Description
precursor_charge array[int] [2, 3, 4] Precursor charge states to generate
max_fragment_charge integer 2 Maximum fragment ion charge
min_transitions integer 6 Minimum number of transitions per precursor
max_transitions integer 6 Maximum number of transitions per precursor
fragmentation_model string "HCD" Fragmentation type: "HCD", "CID", or "ETD"
allowed_fragment_types array[string] ["b", "y"] Allowed fragment ion types: "b", "y"
rt_scale number 100.0 Retention time scaling factor (multiplies predicted RT)
unimod_annotation boolean true Reannotate mass-bracket modifications to UniMod accessions (e.g., [+57.0215](UniMod:4))
max_delta_unimod number 0.02 Maximum delta mass (Da) tolerance for matching to UniMod entries
enable_unannotated boolean true Keep original mass bracket when no UniMod match is found; if false, an error is raised
unimod_xml_path string null Path to a custom unimod.xml file. If omitted, the embedded UniMod database is used

[!NOTE] The current MS2 intensity prediction models only support "b" and "y" fragment ions.

Example:

"insilico_settings": {
  "precursor_charge": [2, 3],
  "max_fragment_charge": 1,
  "min_transitions": 6,
  "max_transitions": 12,
  "fragmentation_model": "HCD",
  "allowed_fragment_types": ["b", "y"],
  "rt_scale": 1.0,
  "unimod_annotation": true,
  "max_delta_unimod": 0.02,
  "enable_unannotated": true
}

[!NOTE] UniMod Reannotation: By default, mass-bracket modification annotations (e.g., [+57.0215]) are converted to UniMod accession notation (e.g., (UniMod:4)). This uses an embedded copy of the UniMod database. To use a custom unimod.xml, set unimod_xml_path to the file path. To disable reannotation entirely, set unimod_annotation to false.

3. Deep Learning Models (OPTIONAL)

Note

If no retention_time, ion_mobility, or ms2_intensity fields are provided under dl_feature_generators, pretrained models will be automatically downloaded and used. The current default pretrained models used are:

  • RT: rt_cnn_tf - A CNN-Transformer model trained on the ProteomicsML repository RT dataset. This model is based on AlphaPeptDeep's CNN-LSTM implementation, with the LSTM replaced by a Transformer encoder.
  • CCS: ccs_cnn_tf - A CNN-Transformer model trained on the ProteomicsML repository CCS dataset. This model is also based on AlphaPeptDeep's CNN-LSTM implementation, with the LSTM replaced by a Transformer encoder.
  • MS2: ms2_bert - A BERT-based model retreived from AlphaPeptDeep's pretrained models.
dl_feature_generators - Custom or pretrained RT/IM/MS2 prediction models

If this section is omitted or empty, pretrained AlphaPeptDeep models will be automatically downloaded and used.

Model Configuration:

Each model (RT, IM, MS2) requires three files:

{
  "model_path": "path/to/model.safetensors",     // Model weights (.pth or .safetensors)
  "constants_path": "path/to/model_const.yaml",  // Model configuration constants
  "architecture": "model_architecture_name"       // Architecture identifier
}
Parameter Type Default Description
retention_time object pretrained Custom RT prediction model
ion_mobility object pretrained Custom IM/CCS prediction model (timsTOF only)
ms2_intensity object pretrained Custom MS2 intensity prediction model
device string "cpu" Compute device: "cpu", "cuda", or "mps" (Apple Silicon)
instrument string "timsTOF" Instrument type: "QE" or "timsTOF"
nce number 20.0 Normalized collision energy for fragmentation
batch_size integer 64 Batch size for model inference
fine_tune_config object see below Optional fine-tuning configuration

Supported Architectures:

  • RT: "rt_cnn_tf", "rt_cnn_lstm"
  • IM/CCS: "ccs_cnn_tf", "ccs_cnn_lstm"
  • MS2: "ms2_bert"

4. Fine-Tuning (OPTIONAL)

fine_tune_config - Transfer learning on experimental data

Fine-tune pretrained models on your own experimental data for improved accuracy.

Parameter Type Default Description
fine_tune boolean false Enable fine-tuning
train_data_path string "" Path to training data TSV file
batch_size integer 256 Training batch size
epochs integer 3 Number of training epochs
learning_rate number 0.001 Learning rate for optimizer
save_model boolean false Save fine-tuned model weights to disk

Training Data Format (TSV):

Required columns:

  • sequence: Modified sequence with square bracket notation (e.g., MGC[+57.0215]AAR)
  • precursor_charge: Precursor charge state
  • retention_time: Experimental retention time
  • ion_mobility: CCS value (only if using timsTOF)
  • fragment_type: Fragment ion type (b, y, etc.)
  • fragment_series_number: Fragment position
  • product_charge: Fragment charge
  • intensity: Normalized fragment intensity

Example:

"fine_tune_config": {
  "fine_tune": true,
  "train_data_path": "experimental_data.tsv",
  "batch_size": 256,
  "epochs": 5,
  "learning_rate": 0.0001,
  "save_model": true
}

5. Output Settings (OPTIONAL)

Output file format, reporting, and memory management
Parameter Type Default Description
output_file string "insilico_library.tsv" Path for output library file
write_report boolean true Generate HTML quality control report
parquet_output boolean false Output in Parquet format instead of TSV
peptide_chunking integer 0 Peptides per chunk (0 = auto-calculate based on memory)

Peptide Chunking:

  • 0 (default): Automatically calculate chunk size based on available memory (recommended)
  • > 0: Manual chunk size for processing large FASTA files with limited RAM
  • Larger chunks = faster processing but more memory usage

Minimal Configuration

The minimum required configuration only needs a FASTA file:

{
  "database": {
    "fasta": "proteins.fasta"
  }
}

All other parameters will use sensible defaults and pretrained models will be auto-downloaded.

Command-Line Overrides

You can override JSON configuration values via command-line arguments:

easypqp-insilico config.json \
  --fasta my_proteins.fasta \
  --output_file my_library.tsv \
  --no-write-report \
  --parquet

You can also run without a JSON config file by providing only --fasta:

easypqp-insilico --fasta my_proteins.fasta

All other parameters will use sensible defaults.

Available flags:

  • --fasta <PATH>: Override database FASTA file
  • --output_file <PATH>: Override output file path
  • --no-write-report: Disable HTML report generation
  • --parquet: Output in Parquet format instead of TSV

Decoy Handling

When generate_decoys is enabled, reversed decoy peptides are generated automatically. The decoy_tag (default "DECOY_") is prefixed to each ProteinId, UniprotId, and GeneName for decoy entries, making them easy to distinguish during downstream analysis.

About

No description, website, or topics provided.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors