Added the ImmuScope class II prediction algorithm#1371
Added the ImmuScope class II prediction algorithm#1371ldhtnp wants to merge 19 commits intogriffithlab:7.0.0from
Conversation
There was a problem hiding this comment.
I added a few small suggested changes to the code itself.
I would like to see the output from this tool as an input file to the all class ii output parser test. To create this input file you will need to run ImmuScope on the 1-200 fasta chunk of HCC1395 data using allele DRB1*04:05 and length 12. I can send you this file.
The all class ii output file created by this test should then be used as the updated input to the all class ii aggregate report creation test. Updating these two tests will ensure that ImmuScope gets parsed correctly and its detailed data included in the metrics file. Edit to add: I just realized that this test is not available in your branch. I added it as part of #1376 so you can disregard this comment.
Have you run HCC1395 on this PR? If not, I can make a test docker container once these updates have been made and start a run. I like to load to results into pVACview to check that everything looks as expected.
| tmp_input_file.write("allele\tpeptide\tseq_num\tstart\n") | ||
| tmp_input_file = tempfile.NamedTemporaryFile('w', dir=tmp_dir, delete=False, newline='') | ||
| writer = csv.writer(tmp_input_file, delimiter='\t', lineterminator='\n') | ||
| writer.writerow(["allele", "peptide", "seq_num", "start"]) |
There was a problem hiding this comment.
Is it required by the predictor to add the seq_num and start to the input file? If not I think these columns can be removed.
There was a problem hiding this comment.
Yes, they are currently required by the predictor/wrapper interface as implemented. If you would rather them be excluded, I can update the fork to make these optional
There was a problem hiding this comment.
Gotcha. I think we could generate the file with these columns filled in by creating it in the same block of code above where we read in the fasta file (line 878+). The determine_neoepitopes method returns a hash with the start position as the key and the epitope as the value. The fasta sequence header can be used as the seq_num.
I assume that the output includes these two columns as well so that would then save us from having to map back each epitope to it's seq num and start position (line 934+). This would be at the expense of potentially having duplicate epitopes in that file if there are repetitive regions etc which could make ImmunoScope slower (not sure if they accounted for this).
There was a problem hiding this comment.
I pushed a commit that keeps the deduped peptide set for scoring, but captures seq_num and start during the initial FASTA parsing and then merges them back onto the ImmuScope output. This lets us drop the remapping loop while still preserving those fields cleanly.
The performance of ImmuScope would be impacted if we passed every epitope occurrence directly to the wrapper with seq_num/start filled in, since it would score duplicates instead of just unique peptides. This approach avoids that by keeping the input deduplicated and only expanding back afterward.
There was a problem hiding this comment.
I wasn't sure if Immuscope was being smart and deduplicates epitopes on their end.
I created a fork of the original ImmuScope repository to refine it for use in pVACtools. This involved creating a wrapper to allow ImmuScope to output an immunogenicity score given a peptide + hla pair. I based the implementation off of BigMHC_IM and updated the documentation.
This tool was suggested by Malachi in issue #1330