FEATURE - Detect tandem duplications with cigar#148
FEATURE - Detect tandem duplications with cigar#148Irallia wants to merge 4 commits intoseqan:masterfrom
Conversation
Codecov Report
@@ Coverage Diff @@
## master #148 +/- ##
==========================================
- Coverage 98.40% 98.09% -0.32%
==========================================
Files 19 19
Lines 878 944 +66
==========================================
+ Hits 864 926 +62
- Misses 14 18 +4
Continue to review full report at Codecov.
|
a0408ac to
7c8e74b
Compare
f8e8809 to
5cd7ed5
Compare
| auto & res = *results.begin(); | ||
| // TODO (irallia 17.8.21): The mismatches should give us the opportunity to allow a given amount of errors in the | ||
| // duplication. | ||
| size_t matches = res.score() % 100; |
There was a problem hiding this comment.
! modulo works wierd with negative values!
| * ref AAAACCGCGTAGCGGG----------TACGTAACGGTACG | ||
| * |||||||||||||| |||||||| -> inserted sequence: GCGGGGCGGG | ||
| * read AACCGCGTAGCGGGGCGGGGCGGGTACGTAAC | ||
| * | ||
| * suffix_sequence AAAACCGCGTAGCGGG -> free_end_gaps_sequence1_leading{true}, | ||
| * ||||| free_end_gaps_sequence1_trailing{false} | ||
| * inserted_bases GCGGGGCGGG -> free_end_gaps_sequence2_leading{false}, | ||
| * free_end_gaps_sequence2_trailing{true} | ||
| * -> tandem_dup_count = 3, duplicated_bases = GCGGG | ||
| * | ||
| * Case 2: The duplication (insertion) comes before the matched sequence. | ||
| * ref AAAACCGCGTA----------GCGGGTACGTAACGGTACG | ||
| * ||||||||| ||||||||||||| -> inserted sequence: GCGGGGCGGG | ||
| * read AACCGCGTAGCGGGGCGGGGCGGGTACGTAAC | ||
| * | ||
| * prefix_sequence GCGGGTACGTAACGGTACG -> free_end_gaps_sequence1_leading{false}, | ||
| * ||||| free_end_gaps_sequence1_trailing{true} | ||
| * inserted_bases GCGGGGCGGG -> free_end_gaps_sequence2_leading{true}, | ||
| * free_end_gaps_sequence2_trailing{false} | ||
| * -> tandem_dup_count = 3, duplicated_bases = GCGGG |
There was a problem hiding this comment.
Other Idea:
create suffix tree of the inserted sequence and search for longest common repeated substring without overlap (with errors) and than map this repeated substring (without errors?).
Other input: Burrows Wheeler, occurence table, FM index; reg Expression -> build minimal automat; ZIP Hoffmann code
| */ | ||
| std::tuple<size_t, size_t> align_suffix_or_prefix(auto const & config, | ||
| int32_t const min_length, | ||
| std::span<const seqan3::dna5> & sequence, |
There was a problem hiding this comment.
| std::span<const seqan3::dna5> & sequence, | |
| std::span<const seqan3::dna5> const sequence, |
| // TODO (irallia 17.8.21): The mismatches should give us the opportunity to allow a given amount of errors in the | ||
| // duplication. | ||
| size_t matches = res.score() % 100; | ||
| size_t mismatches = (res.score() - matches) * (-1); |
There was a problem hiding this comment.
Isn't this the same as:
mismatches = floor(res.score() / 100) * 100;
?
| std::span<seqan3::dna5 const> & sequence, | ||
| std::span<seqan3::dna5 const> & inserted_bases, |
| auto & res = *results.begin(); | ||
| // TODO (irallia 17.8.21): The mismatches should give us the opportunity to allow a given amount of errors in the | ||
| // duplication. | ||
| size_t matches = res.score() % 100; |
5cd7ed5 to
abbf672
Compare
abbf672 to
5301059
Compare
Signed-off-by: Lydia Buntrock <lydia.buntrock@fu-berlin.de>
Signed-off-by: Lydia Buntrock <lydia.buntrock@fu-berlin.de>
5301059 to
4a44217
Compare
Resolves #166
With this PR we can now detect tandem duplications in the CIGAR string. We only collect tandem duplications with no errors. In a follow up PR, we will allow errors aswell. Thus I wrote some TODOs in the code.