griffithlab · jyao36 · Nov 3, 2025 · Nov 4, 2025 · Nov 19, 2025 · Nov 21, 2025
diff --git a/.DS_Store b/.DS_Store
diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
@@ -39,6 +39,7 @@ jobs:
           pip install polars==0.16.18
           pip install pypandoc==1.7.2
           pip install "tensorflow<2.16"
+          pip install imbalanced-learn
           pip install -e .
       - name: List installed packages
         run: |

diff --git a/docs/images/screenshots/vignette/pvacview-ml-predictions-example.png b/docs/images/screenshots/vignette/pvacview-ml-predictions-example.png
diff --git a/docs/images/screenshots/vignette/pvacview-ml-predictions-example2.png b/docs/images/screenshots/vignette/pvacview-ml-predictions-example2.png
diff --git a/docs/pvacseq/optional_downstream_analysis_tools.rst b/docs/pvacseq/optional_downstream_analysis_tools.rst
@@ -61,6 +61,43 @@ epitopes are well-binding to. Lastly, the report will bin variants into tiers
 that offer suggestions as to the suitability of variants for use in vaccines.
 For a full definition of these tiers, see the pVACseq :ref:`output file documentation <aggregated>`.
 
+Add Evaluation Predictions Using a Pre-Trained Machine Learning Model
+------------------
+
+.. program-output:: pvacseq add_ml_predictions -h
+
+This tool adds machine learning (ML)-based neoantigen prioritization predictions to existing pVACseq output files. 
+It uses a trained random forest model to predict whether neoantigen candidates should be evaluated as "Accept", 
+"Reject", or "Pending" based on a comprehensive set of features derived from binding affinity predictions, 
+expression data, and variant characteristics.
+
+This tool requires that you have already generated both MHC Class I and Class II aggregated reports using 
+the ``generate_aggregated_report`` command or by running the pVACseq pipeline (``pvacseq run``). It takes as input 
+the Class I aggregated TSV, Class I all epitopes TSV, and Class II aggregated TSV files from a pVACseq run. 
+The tool merges these files, performs data cleaning and imputation, and applies the ML model to generate evaluation predictions for each variant.
+
+Note that the built-in ML model was trained with most of the features listed under :doc:`Features <features>`. It is STRONGLY recommended to use the `all` option for the `prediction_algorithms` parameter when running the pVACseq pipeline for the best predictions.
+
+The output file is written to the same directory as the Class I aggregated file (the directory you pass as ``output_dir`` when using the standalone command) as ``<sample_name>.MHC_I.all_epitopes.aggregated.ML_predict.tsv``. The output file contains all columns from the original 
+Class I aggregated file with some changes:  
+
+
+.. list-table::
+
+ * - ``Evaluation``
+   - The ML-predicted evaluation status: "Accept", "Reject", or "Pending", based on the prediction probability score.
+ * - ``ML Prediction (score)``
+   - A formatted output combining the model-predicted evaluation with the prediction probability score (e.g., 
+     "Accept (0.72)"). It shows "NA" for variants where the model could not make a prediction, which may be due to a candidate having Class I 
+     algorithm predictions but not Class II algorithm predictions, causing the Class I and Class II aggregated reports to have different numbers of rows.
+
+The ``--ml-threshold-accept`` parameter controls the probability threshold for Accept predictions (default: 0.55). 
+Variants with prediction probabilities >= this threshold are evaluated as "Accept". The ``--ml-threshold-reject`` parameter 
+controls the probability threshold for Reject predictions (default: 0.30). Variants with prediction probabilities <= 
+this threshold are evaluated as "Reject". Everything in between is set to "Pending" for manual review. 
+The ``--artifacts-path`` parameter allows you to specify a custom directory containing ML model artifacts. By default 
+the tool uses the model artifacts included with the pvactools package.
+
 Calculate Reference Proteome Similarity
 ---------------------------------------
 

diff --git a/docs/pvacseq/output_files.rst b/docs/pvacseq/output_files.rst
@@ -14,8 +14,7 @@ which prediction algorithms were chosen:
 - ``MHC_Class_II``: for MHC class II prediction algorithms
 - ``combined``: If both MHC class I and MHC class II prediction algorithms were run, this folder combines the neoepitope predictions from both
 
-Each folder will contain the same list of output files (listed in the order
-created):
+Each folder will contain the same list of output files (listed in the order created):
 
 .. list-table::
    :header-rows: 1
@@ -55,7 +54,9 @@ created):
    * - ``ui.R``, ``app.R``, ``server.R``, ``styling.R``, ``anchor_and_helper_functions.R``
      - pVACview R Shiny application files. Not generated when running only with presentation and immunogenicity algorithms.
    * - ``www`` (directory)
-     - Directory containing image files for pVACview. Not generated when running only with presentation and immunogenicity algorithms only.
+     - Directory containing image files for pVACview. Not generated when running with presentation and immunogenicity algorithms only.  
+   * - ``<sample_name>.MHC_I.all_epitopes.aggregated.ML_predict.tsv`` (optional)
+     - A version of the ``<sample_name>.MHC_I.all_epitopes.aggregated.tsv`` with ML-based neoantigen evaluation predictions. Generated when both MHC Class I and Class II predictions are run and the ``--run-ml-predictions`` flag is set. Written only to the ``MHC_Class_I`` folder.
 
 
 Filters applied to the filtered.tsv file
@@ -387,6 +388,29 @@ included epitopes, selecting the best-scoring epitope, and which values are outp
    * - ``Evaluation``
      - Column to store the evaluation of each variant when evaluating the run in pVACview. Either ``Accept``, ``Reject``, or ``Review``.
 
+.. _ml_prediction_output:
+
+<sample_name>.MHC_I.all_epitopes.aggregated.ML_predict.tsv Report Columns
+--------------------------------------------------
+
+The ``<sample_name>.MHC_I.all_epitopes.aggregated.ML_predict.tsv`` file is generated when using the :ref:`add_ml_predictions <optional_downstream_analysis_tools_label>` 
+tool or when running pVACseq with both MHC Class I and Class II predictions and the ``--run-ml-predictions`` flag enabled. 
+This file contains all columns from the Class I aggregated file (``all_epitopes.aggregated.tsv``) with one additional ML prediction column added.  
+
+The file is written to the same folder as the Class I aggregated file (``MHC_Class_I`` within the output directory).
+
+.. list-table::
+   :header-rows: 1
+
+   * - Column Name
+     - Description
+   * - All columns from ``<sample_name>.MHC_I.all_epitopes.aggregated.tsv``
+     - All columns described in the :ref:`aggregated` section above are included in this file.
+   * - ``Evaluation``
+     - Populated with ML-predicted evaluation status for each candidate. Values: ``Accept`` for variants with prediction probability >= ``ml-threshold-accept`` (default: 0.55), ``Reject`` for variants with prediction probability <= ``ml-threshold-reject`` (default: 0.30), and ``Pending`` for variants with prediction probability between ``ml-threshold-reject`` and ``ml-threshold-accept`` or when the ML model cannot make a prediction due to missing data.
+   * - ``ML Prediction (score)``
+     - ML-based prediction evaluation with probability score. Format: ``"<Evaluation> (<probability_score>)"`` (e.g., ``"Accept (0.72)"``, ``"Reject (0.15)"``, ``"Review (0.48)"``). Shows ``"NA"`` when the ML model cannot make a prediction due to missing data (e.g., when Class I and Class II aggregated files have different numbers of rows).
+
 .. _pvacseq_best_peptide:
 
 Best Peptide Criteria

diff --git a/docs/pvacview/pvacseq_module/pvacseq_vignette.rst b/docs/pvacview/pvacseq_module/pvacseq_vignette.rst
@@ -403,6 +403,77 @@ These potentially problematic characteristics are also flagged by the red boxes
 Since the candidate peptide has a match in the reference proteome, we will reject this candidate by clicking the
 thumbs-down button.
 
+ML-Based Neoantigen Evaluation Predictions
+__________________________________________
+
+This ML prediction output file contains ML-based evaluation predictions that can help prioritize neoantigen candidates by presetting the evaluation status for each candidate. 
+When pVACseq is run with both MHC Class I and Class II predictions and the ``--run-ml-predictions`` flag enabled, or when using the :ref:`add_ml_predictions <optional_downstream_analysis_tools_label>` 
+tool, an aggregate report file with ML predictions (``<sample_name>.MHC_I.all_epitopes.aggregated.ML_predict.tsv``) is generated in the same folder as the Class I aggregated file (``MHC_Class_I``). This file can be loaded into pVACview in combination with the Class I metrics.json file and the Class II aggregated file from their usual locations.
+This file contains ML-based evaluation predictions that can help prioritize neoantigen candidates by presetting the evaluation status for each candidate.
+
+The ML prediction file includes all columns from the Class I aggregated file with two columns different:
+
+**Evaluation Column**
+
+The ``Evaluation`` column is pre-populated with ML-predicted evaluation status for each candidate:
+
+- ``Accept``: Variants with prediction probability >= ``--ml-threshold-accept`` (default: 0.55). These candidates are predicted to be favorable neoantigen candidates to be included in a vaccine. 
+- ``Reject``: Variants with prediction probability <= ``--ml-threshold-reject`` (default: 0.30). These candidates are predicted to be unfavorable. ``--ml-threshold-reject`` should be set to a value less than ``--ml-threshold-accept``.
+- ``Pending``: Variants with prediction probability between ``--ml-threshold-reject`` and ``--ml-threshold-accept``, or when the ML model cannot make a prediction due to missing data. These candidates require manual review.
+
+**ML Prediction (score) Column**
+
+The ``ML Prediction (score)`` column provides additional context by displaying the evaluation status along with the underlying prediction probability score. 
+The format is ``"<Evaluation> (<probability_score>)"`` (e.g., ``"Accept (0.72)"``, ``"Reject (0.15)"``, ``"Review (0.48)"``). 
+The "Review" status is retained in this column as a suggestion for users to change the status in the "Evaluation" column to "Review", or "Accept" or "Reject" manually.
+This column shows ``"NA"`` when the ML model cannot make a prediction due to missing data (e.g., when a candidate is found in the Class I aggregated report but not in the Class II aggregated report).
+
+The ``<probability_score>`` represents the model's confidence that a candidate should be accepted to be in a vaccine, with values closer to 1.0 indicating higher confidence in acceptance.
+
+
+**Important Features Used by the ML Model**
+
+The ML model integrates information from multiple sources to make its predictions. The following features are among the five most important factors considered:
+
+- Allele expression
+- RNA VAF
+- RNA Expression
+- NetMHCpan MT IC50 Score
+- TSL
+
+The model combines these features (and many more other features) using a trained random forest algorithm that has learned patterns from expert-reviewed neoantigen candidates. 
+The predictions serve as a starting point for evaluation, but should be reviewed in conjunction with the detailed information available in pVACview, 
+including binding affinity plots, anchor position analysis, and reference proteome matches.
+
+**pVACview ML Predictions Example**
+
+To view predictions on pVACview, load the following files: 
+1. The ML prediction file (``<sample_name>.MHC_I.all_epitopes.aggregated.ML_predict.tsv``) in place of the Class I tsv file. 
+2. The metrics.json file of Class I data. 
+3. The Class II aggregated.tsv file.  
+4. A list of genes of interest (optional).
+
+.. figure:: ../../images/screenshots/vignette/pvacview-ml-predictions-example.png
+    :width: 1000px
+    :align: right
+    :alt: pVACview ML Predictions Example
+    :figclass: align-left
+
+
+In the pVACview interface shown above, the ML prediction file is loaded in place of the standard Class I TSV file, with all 
+other inputs as described. Candidate evaluation statuses are automatically pre-populated based on the ML predictions, as shown in the “Acpt,” 
+“Rej,” and “Rev” columns, with prediction scores displayed in the “ML Prediction (score)” column. Users may review and override these assignments 
+as needed.
+
+In this example, MAU2 is classified in the Pass tier by pVACseq and predicted as Accept by the ML model, providing concordant support for its 
+selection. In contrast, TUBGCP6 is labeled as a PoorBinder by pVACseq but predicted as Accept by the ML model, likely due to favorable features 
+such as high expression and variant allele frequency (VAF), as well as potential Class II binding indicated in the Additional Data table (shown below). While 
+this candidate may be provisionally accepted, further evaluation is needed to confirm that all Class II selection criteria are met.
+
+.. figure:: ../../images/screenshots/vignette/pvacview-ml-predictions-example2.png
+    :width: 500px
+    :align: center
+    :alt: pVACview ML Predictions Example TUBGCP6 Class II Additional Data
 
 Export
 ______