Skip to content
This repository was archived by the owner on Mar 23, 2023. It is now read-only.
This repository was archived by the owner on Mar 23, 2023. It is now read-only.

graph-based concatemer alignment  #53

@callumparr

Description

@callumparr

I was just reading your nice nature biotech paper and it has more detail now about what the pore-c tools do. I was interested in this graph based concatemer alignment step.

I was wondering why has to be so complicated. Once you have the sub-read alignments, filter low map scores, and keep one alignment if multiple overlaps on the portion of reading, why can you not build up a list of contacts from this.

For completeness, I guess you should make sure that at the start and end of each alignment, the position in the genome should be demarcated by a restriction site. Although I guess gDNA could be randomly fragmented in situ and still under go proximity ligation.
albeit less efficent than the RE mediated complementary overhang ligat
I cannot understand what this graphing process does or why it is needed.

Graph-based concatemer alignment
We developed an algorithm for Pore-C read alignment with the goal of identifying the most likely set of monomers comprising a given concatemer. Because each Pore-C read is a chimera of (potentially) distant monomeric sequences, the processing of these reads requires performing a multiway split alignment. The goal of our split alignment algorithm was to identify a set of minimally overlapping alignments that maximally cover the read sequence.

In this algorithm, raw sequencing data are initially aligned with bwa bwasw with parameters –b 5 –q 2 –r 1 –T 15 –z 10, allowing supplementary and secondary alignments for each read. We remove low mapping quality (mapping quality = 0) alignments. In addition, for each set of alignments with identical positions on the read, we take the single highest scoring alignment. The remaining alignments are used to build a weighted directed acyclic graph whose optimal traversal determines the set of final, filtered alignments (that is, monomers) associated with the read (that is, concatemer).

The nodes of this directed acyclic graph G = (V, E) are composed of the alignments as well two additional nodes (START, END) representing the start and end, respectively, of the read. Each node a ∈ V is associated with a start s(a) and end e(a) position on the read. Each alignment a ∈ V is also associated with an alignment score AS(a). Directed edges (a, b) ∈ E connect an alignment pair a, b ∈ V if e(a) < e(b) or otherwise if e(a) = e(b) and s(a) < s(b). Additional directed edges connect START to each alignment node and each alignment node to END. The resulting graph structure has a topological ordering from the START to the END node.

Each edge (a, b) ∈ E is assigned a weight w((a, b)) as a function of the alignment score and start and end properties of the associated nodes. For edges (a, b), where a is the START node, w((a, b)) = s(b) − AS(b). For edges (a, b), where b is the END node, w((a, b)) = s(b) − e(a). Finally, for all remaining edges (a, b), where both a and b are alignments, w((a, b)) = ∣e(a) − s(b)∣ − AS(b). The Bellman–Ford algorithm is then applied to this weighted graph (G, w) to identify the minimal scoring path from the START to END node.

In this formulation, edges between overlapping or distant alignment pairs are penalized. Similarly, edges connecting to low-scoring alignments will receive a higher weight. As a result, the minimally weighted path, as computed by the Bellman–Ford algorithm, will identify a set of minimally overlapping high-quality alignments that maximally cover the read. These are returned as the set of monomers associated with the given read/concatemer.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions