Skip to content

Redundant k-mer in contigs #11

@rob-p

Description

@rob-p

Hi @IlyaMinkin,

We've run into another minor issue that we think is a bug. I wanted to report the behavior here to get your feedback on it. Basically, what we're seeing is that, for a small number of contigs that TwoPaCo is returning, the contigs contain both a k-mer and its reverse complement. Thus, in the compacted dBG, the k-mer itself is repeated --- which we believe shouldn't happen. I realize that this is possible in the GFA output when the k-mer occurs at the end of a contig, since the GFA file is written such that the overlaps themselves are of length k and hence these k-mers will occur at least twice. However, these repeated k-mers are internal (and seem to happen, in fact, when the entire contig is its own reverse complement).

This issue was discovered by my student @fataltes, who did the legwork to provide the following example. We're working with this reference sequence. We ran TwoPaCo with -k set to 31, and then used graphdump to obtain a GFA1 file. Most of the contigs / segments in this file are OK, but a few of them contain the same k-mer (once in the forward and once in the reverse complement orientation) more than once. Here is the list of such contigs / segments in our output:

2232549 ATGTGTGTGTGTGTATATATATATATATATACACACACACACAT
196044 TGTGTATATATATACACATATATACGTATATATGTGTATATATATACACA
557083 TTTCATGTTTATATATATATATATATGTATATATATATACATATATATATATATATAAACATGAAA
659373 GTGTGTGTGTATATATATATATATATATATACACACACAC
2222892 ATTATATATATATAATATATATATATTATATATATATAAT
2307911 ATATATATATATCATATATATGATATATATATAT
2309111 ATATACATATATATATATATATATATATGTATAT
2861563 AAAAAAAAAAAAAAAAATTTTTTTTTTTTTTTTT
2237088 TGTGTGTGTATGTATATTATATAATATACATACACACACA
555324 TATATATATATATACCATATATATGGTATATATATATATA
659376 TGTGTGTGTGTGTGTATATATACACACACACACACA
554875 TATATATATATATATATAATATATATATATATATA
555396 TATATATATATAAATATATATATATTTATATATATATA
162527 TTATATATATATTATATATATAATATATATATAA
2307775 ATATATATGTGTGTATATATATACACACATATATAT
554899 ATATATATATATATATGCATATATATATATATAT
214284 TGTATGTGTGTATATATGTGTGTATATATATATACACACATATATACACACATACA

As you can see, these segments contain quite a few cases where both a k=31-mer and its reverse complement (and even larger k-mers) are present in the same contig. As we are indexing k-mers in the TwoPaCo representation, and expecting each k-mer to occur at most once, this is causing some issues for us. Interestingly, all of these cases seem to be occurring as substrings of segments which are their own reverse complements. So, I presume that this is either (1) expected behavior and we are possibly interpreting the compacted dBG differently from TwoPaCo or (2) some minor corner-case in the contig generation code.

Please let me know if you have any questions about this case or any difficulty re-generating this example. Thanks again!

--Rob

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions