Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
7adece8
Incorporate ml_predictor module into pvactools; Uploaded associated m…
jyao36 Nov 3, 2025
3ef9937
Add imbalanced-learn dependency installation to .github/workflows/tes…
jyao36 Nov 4, 2025
6f44f1a
add add_ml_predictions.py and test script
jyao36 Nov 19, 2025
6a6e843
fixed test_pvacseq_add_ml_predictions.py
jyao36 Nov 21, 2025
de5cc22
Comment out tests that fail nondeterministically
susannasiebert Nov 25, 2025
60c46ea
Update pvactools/lib/ml_predictor.py
jyao36 Nov 26, 2025
0a100bd
Update pvactools/lib/ml_predictor.py
jyao36 Nov 26, 2025
65ec594
Update pvactools/lib/ml_predictor.py
jyao36 Nov 26, 2025
5992296
updated pvacview ui.R and server.R to include footnote for ML predict…
jyao36 Nov 26, 2025
5f37de5
update output file format to preserve original class I aggregated for…
jyao36 Dec 1, 2025
61260a8
update server.R to include footnote at all times when the prediction …
jyao36 Dec 2, 2025
8b35c79
Update python "None" value handling.
jyao36 Dec 2, 2025
c926053
documentation for ML predictor
jyao36 Dec 10, 2025
1ad3651
Merge branch '7.0.0' into ml_predictor3
jyao36 Dec 10, 2025
65b1ef8
pin scikit-learn version
jyao36 Dec 12, 2025
4132b45
update scikit-learn version
jyao36 Dec 12, 2025
4b0ca2d
incorporate comments on documentation files
jyao36 Dec 19, 2025
a9c7eae
add threshold-reject instead of hardcoding the number
jyao36 Jan 8, 2026
380675a
Added test in tes_pvacseeq.py; Updated ml_predictor.py to add missing…
jyao36 Jan 20, 2026
16cfad8
Merge remote-tracking branch 'origin/7.0.0' into ml_predictor3
susannasiebert Jan 23, 2026
358519b
Add missing dependency
susannasiebert Feb 23, 2026
0066b13
updated output file saved location to Class I folder
jyao36 Mar 7, 2026
ae97284
Update docs/pvacseq/output_files.rst
jyao36 Mar 10, 2026
96af94a
updated outdir parameter
jyao36 Mar 10, 2026
6210546
Fix failing test
susannasiebert Mar 10, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file modified .DS_Store
Binary file not shown.
1 change: 1 addition & 0 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ jobs:
pip install polars==0.16.18
pip install pypandoc==1.7.2
pip install "tensorflow<2.16"
pip install imbalanced-learn
pip install -e .
- name: List installed packages
run: |
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
37 changes: 37 additions & 0 deletions docs/pvacseq/optional_downstream_analysis_tools.rst
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,43 @@ epitopes are well-binding to. Lastly, the report will bin variants into tiers
that offer suggestions as to the suitability of variants for use in vaccines.
For a full definition of these tiers, see the pVACseq :ref:`output file documentation <aggregated>`.

Add Evaluation Predictions Using a Pre-Trained Machine Learning Model
------------------

.. program-output:: pvacseq add_ml_predictions -h

This tool adds machine learning (ML)-based neoantigen prioritization predictions to existing pVACseq output files.
It uses a trained random forest model to predict whether neoantigen candidates should be evaluated as "Accept",
"Reject", or "Pending" based on a comprehensive set of features derived from binding affinity predictions,
expression data, and variant characteristics.

This tool requires that you have already generated both MHC Class I and Class II aggregated reports using
the ``generate_aggregated_report`` command or by running the pVACseq pipeline (``pvacseq run``). It takes as input
the Class I aggregated TSV, Class I all epitopes TSV, and Class II aggregated TSV files from a pVACseq run.
The tool merges these files, performs data cleaning and imputation, and applies the ML model to generate evaluation predictions for each variant.

Note that the built-in ML model was trained with most of the features listed under :doc:`Features <features>`. It is STRONGLY recommended to use the `all` option for the `prediction_algorithms` parameter when running the pVACseq pipeline for the best predictions.

The output file is written to the same directory as the Class I aggregated file (the directory you pass as ``output_dir`` when using the standalone command) as ``<sample_name>.MHC_I.all_epitopes.aggregated.ML_predict.tsv``. The output file contains all columns from the original
Class I aggregated file with some changes:


.. list-table::

* - ``Evaluation``
- The ML-predicted evaluation status: "Accept", "Reject", or "Pending", based on the prediction probability score.
* - ``ML Prediction (score)``
- A formatted output combining the model-predicted evaluation with the prediction probability score (e.g.,
"Accept (0.72)"). It shows "NA" for variants where the model could not make a prediction, which may be due to a candidate having Class I
algorithm predictions but not Class II algorithm predictions, causing the Class I and Class II aggregated reports to have different numbers of rows.

The ``--ml-threshold-accept`` parameter controls the probability threshold for Accept predictions (default: 0.55).
Variants with prediction probabilities >= this threshold are evaluated as "Accept". The ``--ml-threshold-reject`` parameter
controls the probability threshold for Reject predictions (default: 0.30). Variants with prediction probabilities <=
this threshold are evaluated as "Reject". Everything in between is set to "Pending" for manual review.
The ``--artifacts-path`` parameter allows you to specify a custom directory containing ML model artifacts. By default
the tool uses the model artifacts included with the pvactools package.

Calculate Reference Proteome Similarity
---------------------------------------

Expand Down
30 changes: 27 additions & 3 deletions docs/pvacseq/output_files.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,7 @@ which prediction algorithms were chosen:
- ``MHC_Class_II``: for MHC class II prediction algorithms
- ``combined``: If both MHC class I and MHC class II prediction algorithms were run, this folder combines the neoepitope predictions from both

Each folder will contain the same list of output files (listed in the order
created):
Each folder will contain the same list of output files (listed in the order created):

.. list-table::
:header-rows: 1
Expand Down Expand Up @@ -55,7 +54,9 @@ created):
* - ``ui.R``, ``app.R``, ``server.R``, ``styling.R``, ``anchor_and_helper_functions.R``
- pVACview R Shiny application files. Not generated when running only with presentation and immunogenicity algorithms.
* - ``www`` (directory)
- Directory containing image files for pVACview. Not generated when running only with presentation and immunogenicity algorithms only.
- Directory containing image files for pVACview. Not generated when running with presentation and immunogenicity algorithms only.
* - ``<sample_name>.MHC_I.all_epitopes.aggregated.ML_predict.tsv`` (optional)
- A version of the ``<sample_name>.MHC_I.all_epitopes.aggregated.tsv`` with ML-based neoantigen evaluation predictions. Generated when both MHC Class I and Class II predictions are run and the ``--run-ml-predictions`` flag is set. Written only to the ``MHC_Class_I`` folder.

Comment thread
jyao36 marked this conversation as resolved.

Filters applied to the filtered.tsv file
Expand Down Expand Up @@ -387,6 +388,29 @@ included epitopes, selecting the best-scoring epitope, and which values are outp
* - ``Evaluation``
- Column to store the evaluation of each variant when evaluating the run in pVACview. Either ``Accept``, ``Reject``, or ``Review``.

.. _ml_prediction_output:

<sample_name>.MHC_I.all_epitopes.aggregated.ML_predict.tsv Report Columns
--------------------------------------------------

The ``<sample_name>.MHC_I.all_epitopes.aggregated.ML_predict.tsv`` file is generated when using the :ref:`add_ml_predictions <optional_downstream_analysis_tools_label>`
tool or when running pVACseq with both MHC Class I and Class II predictions and the ``--run-ml-predictions`` flag enabled.
This file contains all columns from the Class I aggregated file (``all_epitopes.aggregated.tsv``) with one additional ML prediction column added.

The file is written to the same folder as the Class I aggregated file (``MHC_Class_I`` within the output directory).

.. list-table::
:header-rows: 1

* - Column Name
- Description
* - All columns from ``<sample_name>.MHC_I.all_epitopes.aggregated.tsv``
- All columns described in the :ref:`aggregated` section above are included in this file.
* - ``Evaluation``
- Populated with ML-predicted evaluation status for each candidate. Values: ``Accept`` for variants with prediction probability >= ``ml-threshold-accept`` (default: 0.55), ``Reject`` for variants with prediction probability <= ``ml-threshold-reject`` (default: 0.30), and ``Pending`` for variants with prediction probability between ``ml-threshold-reject`` and ``ml-threshold-accept`` or when the ML model cannot make a prediction due to missing data.
* - ``ML Prediction (score)``
- ML-based prediction evaluation with probability score. Format: ``"<Evaluation> (<probability_score>)"`` (e.g., ``"Accept (0.72)"``, ``"Reject (0.15)"``, ``"Review (0.48)"``). Shows ``"NA"`` when the ML model cannot make a prediction due to missing data (e.g., when Class I and Class II aggregated files have different numbers of rows).

.. _pvacseq_best_peptide:

Best Peptide Criteria
Expand Down
71 changes: 71 additions & 0 deletions docs/pvacview/pvacseq_module/pvacseq_vignette.rst
Original file line number Diff line number Diff line change
Expand Up @@ -403,6 +403,77 @@ These potentially problematic characteristics are also flagged by the red boxes
Since the candidate peptide has a match in the reference proteome, we will reject this candidate by clicking the
thumbs-down button.

ML-Based Neoantigen Evaluation Predictions
__________________________________________

This ML prediction output file contains ML-based evaluation predictions that can help prioritize neoantigen candidates by presetting the evaluation status for each candidate.
When pVACseq is run with both MHC Class I and Class II predictions and the ``--run-ml-predictions`` flag enabled, or when using the :ref:`add_ml_predictions <optional_downstream_analysis_tools_label>`
tool, an aggregate report file with ML predictions (``<sample_name>.MHC_I.all_epitopes.aggregated.ML_predict.tsv``) is generated in the same folder as the Class I aggregated file (``MHC_Class_I``). This file can be loaded into pVACview in combination with the Class I metrics.json file and the Class II aggregated file from their usual locations.
This file contains ML-based evaluation predictions that can help prioritize neoantigen candidates by presetting the evaluation status for each candidate.
Comment thread
jyao36 marked this conversation as resolved.

The ML prediction file includes all columns from the Class I aggregated file with two columns different:

**Evaluation Column**

The ``Evaluation`` column is pre-populated with ML-predicted evaluation status for each candidate:

- ``Accept``: Variants with prediction probability >= ``--ml-threshold-accept`` (default: 0.55). These candidates are predicted to be favorable neoantigen candidates to be included in a vaccine.
- ``Reject``: Variants with prediction probability <= ``--ml-threshold-reject`` (default: 0.30). These candidates are predicted to be unfavorable. ``--ml-threshold-reject`` should be set to a value less than ``--ml-threshold-accept``.
- ``Pending``: Variants with prediction probability between ``--ml-threshold-reject`` and ``--ml-threshold-accept``, or when the ML model cannot make a prediction due to missing data. These candidates require manual review.

**ML Prediction (score) Column**

The ``ML Prediction (score)`` column provides additional context by displaying the evaluation status along with the underlying prediction probability score.
The format is ``"<Evaluation> (<probability_score>)"`` (e.g., ``"Accept (0.72)"``, ``"Reject (0.15)"``, ``"Review (0.48)"``).
The "Review" status is retained in this column as a suggestion for users to change the status in the "Evaluation" column to "Review", or "Accept" or "Reject" manually.
This column shows ``"NA"`` when the ML model cannot make a prediction due to missing data (e.g., when a candidate is found in the Class I aggregated report but not in the Class II aggregated report).

The ``<probability_score>`` represents the model's confidence that a candidate should be accepted to be in a vaccine, with values closer to 1.0 indicating higher confidence in acceptance.


**Important Features Used by the ML Model**

The ML model integrates information from multiple sources to make its predictions. The following features are among the five most important factors considered:

- Allele expression
- RNA VAF
- RNA Expression
- NetMHCpan MT IC50 Score
- TSL

The model combines these features (and many more other features) using a trained random forest algorithm that has learned patterns from expert-reviewed neoantigen candidates.
The predictions serve as a starting point for evaluation, but should be reviewed in conjunction with the detailed information available in pVACview,
including binding affinity plots, anchor position analysis, and reference proteome matches.

**pVACview ML Predictions Example**

To view predictions on pVACview, load the following files:
1. The ML prediction file (``<sample_name>.MHC_I.all_epitopes.aggregated.ML_predict.tsv``) in place of the Class I tsv file.
2. The metrics.json file of Class I data.
3. The Class II aggregated.tsv file.
4. A list of genes of interest (optional).

.. figure:: ../../images/screenshots/vignette/pvacview-ml-predictions-example.png
:width: 1000px
:align: right
:alt: pVACview ML Predictions Example
:figclass: align-left

Comment thread
jyao36 marked this conversation as resolved.

In the pVACview interface shown above, the ML prediction file is loaded in place of the standard Class I TSV file, with all
other inputs as described. Candidate evaluation statuses are automatically pre-populated based on the ML predictions, as shown in the “Acpt,”
“Rej,” and “Rev” columns, with prediction scores displayed in the “ML Prediction (score)” column. Users may review and override these assignments
as needed.

In this example, MAU2 is classified in the Pass tier by pVACseq and predicted as Accept by the ML model, providing concordant support for its
selection. In contrast, TUBGCP6 is labeled as a PoorBinder by pVACseq but predicted as Accept by the ML model, likely due to favorable features
such as high expression and variant allele frequency (VAF), as well as potential Class II binding indicated in the Additional Data table (shown below). While
this candidate may be provisionally accepted, further evaluation is needed to confirm that all Class II selection criteria are met.

.. figure:: ../../images/screenshots/vignette/pvacview-ml-predictions-example2.png
:width: 500px
:align: center
:alt: pVACview ML Predictions Example TUBGCP6 Class II Additional Data

Export
______
Expand Down
Loading
Loading