Add random sampling for read depth estimation. by joshuak94 · Pull Request #80 · seqan/iGenVar

joshuak94 · 2021-03-09T15:44:30Z

This pull request adds the misc subfolder in the include directory. For now it only has the header file for calculating the random positions for sampling. I'm open to suggestions on the organization of this!

Other than that, I created a preliminary random position generator which generates random chromosomes and positions. It isn't perfect, because it's based off of the standard library. But for an initial implementation it should suffice.

The random positions are basically pre-calculated and sorted, and then the depth for these positions are calculated on the fly as the reads overlap them. This assumes the BAM file is of course sorted.

joshuak94 · 2021-03-09T15:45:41Z

Eventually I'd like to implement estimations for insert length for paired end reads and read lengths in general.

joshuak94 · 2021-03-09T15:49:20Z

Also, this currently adds any read overlapping a position to the average calculation, regardless of the CIGAR at that exact position. So a read could actually be soft clipped/deletion at this base position and it will still count. I figured this would still be okay because it's only an approximation, but otherwise I would do something along these lines:

if (read_overlaps_pos(cigar, offset, std::get<1>rand_sample.front())
{
     ++cur_depth;
}

And then in read_overlaps_pos I can look specifically at what the CIGAR character is at this position. But I guess this will make things a little less efficient.

Irallia

I have already looked over it once. But I think this is a PR that can have as 1st review someone from the seqan team.

src/detect_breakends.cpp

Irallia · 2021-03-15T11:54:21Z

src/detect_breakends/junction_detection.cpp

+        read_length = get_read_length(cigar);
+        // If there are positions to be sampled, then sample them.
+        if (!rand_sample.empty() && sample_size > 0)
+        {
+            if (ref_id == std::get<0>(rand_sample.front()) &&           // Correct chromosome.
+                pos <= std::get<1>(rand_sample.front()) &&              // Read starts before or at the position.
+                std::get<1>(rand_sample.front()) <= pos + read_length)  // Read ends after or at position.
+            {
+                ++cur_depth;
+            }
+            else if (ref_id > std::get<0>(rand_sample.front()) ||       // Record is at next chromosome
+                     (ref_id == std::get<0>(rand_sample.front()) &&     // Record is at same chromosome BUT
+                      pos > std::get<1>(rand_sample.front())))          // in a later position.
+            {
+                ++num_pos;
+                avg_depth += cur_depth;
+                cur_depth = 0;
+                rand_sample.pop_front();
+            }
+        }


Could you create a calculate_average_read_depth() or something similar?
And would it be possible to build a small api test for this function?

I'm not sure if it's such a good idea to put this into a separate function. I mean for one, it's basically just three if statements. The other thing is that I'm computing the depth at the current position, and then when we move to the next position this gets added to the average and then the current depth is zeroed out. So I'd have to pass all of these variables in and I think it would just get more confusing jumping back and forth between functions.

My general feeling is that decoupling is always better and "just" 3 or 4 or 5 or 20 if branches can change pretty fast.

src/detect_breakends/junction_detection.cpp

marehr

Some remarks, no hard feeling, you can also just ignore it :)

marehr · 2021-03-22T16:14:38Z

include/misc/statistics.hpp

+ */
+std::forward_list<std::pair<int32_t, int32_t>> get_random_positions(uint64_t sample_size, seqan3::sam_file_header<std::deque<std::string>> & header)
+{
+    std::forward_list<std::pair<int32_t, int32_t>> output{};


Is this performance critical? I wouldn't use forward lists otherwise.

Suggested change

std::forward_list<std::pair<int32_t, int32_t>> output{};

std::vector<std::pair<int32_t, int32_t>> random_positions(sample_size); // pre-allocate sample-size many

marehr · 2021-03-22T16:16:20Z

include/misc/statistics.hpp

+    std::forward_list<std::pair<int32_t, int32_t>> output{};
+    if (sample_size == 0)
+    {
+        output.push_front(std::pair<int32_t, int32_t>{-1, -1});


Suggested change

output.push_front(std::pair<int32_t, int32_t>{-1, -1});

output.push_front(std::pair<int32_t, int32_t>{-1, -1});

return output;

this makes it clear that this is a special case.

marehr · 2021-03-22T16:21:33Z

include/misc/statistics.hpp

+    for (int i = 0; i < sample_size; ++i)
+    {
+        rand_chr = std::rand() % header.ref_ids().size();
+        rand_pos = std::rand() % std::get<0>(header.ref_id_info[rand_chr]);
+
+        output.push_front(std::pair<int32_t, int32_t>{rand_chr, rand_pos});
+    }


Suggested change

for (int i = 0; i < sample_size; ++i)

{

rand_chr = std::rand() % header.ref_ids().size();

rand_pos = std::rand() % std::get<0>(header.ref_id_info[rand_chr]);

output.push_front(std::pair<int32_t, int32_t>{rand_chr, rand_pos});

}

```suggestion

std::ranges::generate(random_positions, [&header]()

{

int32_t rand_chr = std::rand() % header.ref_ids().size();

int32_t rand_pos = std::rand() % std::get<0>(header.ref_id_info[rand_chr]);

return std::pair<int32_t, int32_t>{rand_chr, rand_pos});

});

marehr · 2021-03-22T16:24:51Z

include/misc/statistics.hpp

+
+    for (int i = 0; i < sample_size; ++i)
+    {
+        rand_chr = std::rand() % header.ref_ids().size();


You should consider using the "real" random module.

https://en.cppreference.com/w/cpp/numeric/random/uniform_int_distribution#Example

That would also make clear what your distribution is.

Furthermore, I would require passing in a random generator into this function so that the "randomness" can be controlled from the outside.

marehr · 2021-03-22T16:30:36Z

include/misc/statistics.hpp

+    uint32_t result{0};
+    for (auto e : cigar)
+    {
+        result += get<0>(e);


Is this correct?

The sam specification says:

Op BAM Description Consumes query Consumes reference

M 0 alignment match (can be a sequence match or mismatch) yes yes

I 1 insertion to the reference yes no

D 2 deletion from the reference no yes

N 3 skipped region from the reference no yes

S 4 soft clipping (clipped sequences present in SEQ) yes no

H 5 hard clipping (clipped sequences NOT present in SEQ) no no

P 6 padding (silent deletion from padded reference) no no

= 7 sequence match yes yes

X 8 sequence mismatch yes yes

“Consumes query” and “consumes reference” indicate whether the CIGAR operation causes the alignment to step along the query sequence and the reference sequence respectively.

So I would say that you over-count the read length.

https://samtools.github.io/hts-specs/SAMv1.pdf#page=8

marehr · 2021-03-22T16:33:44Z

src/variant_detection/variant_detection.cpp

+        // If there are positions to be sampled, then sample them.
+        if (!rand_sample.empty() && sample_size > 0)
+        {
+            if (ref_id == std::get<0>(rand_sample.front()) &&           // Correct chromosome.


Suggested change

if (ref_id == std::get<0>(rand_sample.front()) && // Correct chromosome.

auto && [chr, pod] = rand_sample.front();

if (ref_id == chr) && // Correct chromosome.

give the pair content a name that makes it easier to understand.

marehr · 2021-03-22T16:34:05Z

src/variant_detection/variant_detection.cpp

+        if (!rand_sample.empty() && sample_size > 0)
+        {
+            if (ref_id == std::get<0>(rand_sample.front()) &&           // Correct chromosome.
+                pos <= std::get<1>(rand_sample.front()) &&              // Read starts before or at the position.


again, not all cigar length's are for reads, they can also only affect the sequence.

marehr · 2021-03-22T16:37:10Z

src/variant_detection/variant_detection.cpp

+                ++num_pos;
+                avg_depth += cur_depth;
+                cur_depth = 0;
+                rand_sample.pop_front();


I'm not completely sure what this does and why you consume your "random" sample throughout the loop without "refreshing" it.

What happens if your random sample doesn't get fully consumed, but your next reference starts? Is that possible?

joshuak94 requested a review from Irallia March 9, 2021 15:49

joshuak94 force-pushed the readdepth_estimate branch from 3ebc91e to f90c9a4 Compare March 10, 2021 14:54

Irallia reviewed Mar 15, 2021

View reviewed changes

Irallia requested review from a team and marehr and removed request for a team March 15, 2021 12:08

joshuak94 force-pushed the readdepth_estimate branch from 359bb29 to c9c2c98 Compare March 16, 2021 09:46

Add random sampling for read depth estimation.

27cc62e

joshuak94 force-pushed the readdepth_estimate branch from c9c2c98 to 27cc62e Compare March 16, 2021 09:49

marehr reviewed Mar 22, 2021

View reviewed changes

	std::forward_list<std::pair<int32_t, int32_t>> output{};
	std::vector<std::pair<int32_t, int32_t>> random_positions(sample_size); // pre-allocate sample-size many

	output.push_front(std::pair<int32_t, int32_t>{-1, -1});
	output.push_front(std::pair<int32_t, int32_t>{-1, -1});
	return output;

-    for (int i = 0; i < sample_size; ++i)
-    {
-        rand_chr = std::rand() % header.ref_ids().size();
-        rand_pos = std::rand() % std::get<0>(header.ref_id_info[rand_chr]);
-        output.push_front(std::pair<int32_t, int32_t>{rand_chr, rand_pos});
-    }
+```suggestion
+    std::ranges::generate(random_positions, [&header]()
+    {
+        int32_t rand_chr = std::rand() % header.ref_ids().size();
+        int32_t rand_pos = std::rand() % std::get<0>(header.ref_id_info[rand_chr]);
+        return std::pair<int32_t, int32_t>{rand_chr, rand_pos});
+    });

Op	BAM	Description	Consumes query	Consumes reference
M	0	alignment match (can be a sequence match or mismatch)	yes	yes
I	1	insertion to the reference	yes	no
D	2	deletion from the reference	no	yes
N	3	skipped region from the reference	no	yes
S	4	soft clipping (clipped sequences present in SEQ)	yes	no
H	5	hard clipping (clipped sequences NOT present in SEQ)	no	no
P	6	padding (silent deletion from padded reference)	no	no
=	7	sequence match	yes	yes
X	8	sequence mismatch	yes	yes

	if (ref_id == std::get<0>(rand_sample.front()) && // Correct chromosome.
	auto && [chr, pod] = rand_sample.front();
	if (ref_id == chr) && // Correct chromosome.

Conversation

joshuak94 commented Mar 9, 2021

Uh oh!

joshuak94 commented Mar 9, 2021

Uh oh!

joshuak94 commented Mar 9, 2021

Uh oh!

Irallia left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

marehr left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants