Add random sampling for read depth estimation.#80
Add random sampling for read depth estimation.#80joshuak94 wants to merge 1 commit intoseqan:masterfrom
Conversation
|
Eventually I'd like to implement estimations for insert length for paired end reads and read lengths in general. |
|
Also, this currently adds any read overlapping a position to the average calculation, regardless of the CIGAR at that exact position. So a read could actually be soft clipped/deletion at this base position and it will still count. I figured this would still be okay because it's only an approximation, but otherwise I would do something along these lines: if (read_overlaps_pos(cigar, offset, std::get<1>rand_sample.front())
{
++cur_depth;
}And then in read_overlaps_pos I can look specifically at what the CIGAR character is at this position. But I guess this will make things a little less efficient. |
3ebc91e to
f90c9a4
Compare
Irallia
left a comment
There was a problem hiding this comment.
I have already looked over it once. But I think this is a PR that can have as 1st review someone from the seqan team.
| read_length = get_read_length(cigar); | ||
| // If there are positions to be sampled, then sample them. | ||
| if (!rand_sample.empty() && sample_size > 0) | ||
| { | ||
| if (ref_id == std::get<0>(rand_sample.front()) && // Correct chromosome. | ||
| pos <= std::get<1>(rand_sample.front()) && // Read starts before or at the position. | ||
| std::get<1>(rand_sample.front()) <= pos + read_length) // Read ends after or at position. | ||
| { | ||
| ++cur_depth; | ||
| } | ||
| else if (ref_id > std::get<0>(rand_sample.front()) || // Record is at next chromosome | ||
| (ref_id == std::get<0>(rand_sample.front()) && // Record is at same chromosome BUT | ||
| pos > std::get<1>(rand_sample.front()))) // in a later position. | ||
| { | ||
| ++num_pos; | ||
| avg_depth += cur_depth; | ||
| cur_depth = 0; | ||
| rand_sample.pop_front(); | ||
| } | ||
| } |
There was a problem hiding this comment.
Could you create a calculate_average_read_depth() or something similar?
And would it be possible to build a small api test for this function?
There was a problem hiding this comment.
I'm not sure if it's such a good idea to put this into a separate function. I mean for one, it's basically just three if statements. The other thing is that I'm computing the depth at the current position, and then when we move to the next position this gets added to the average and then the current depth is zeroed out. So I'd have to pass all of these variables in and I think it would just get more confusing jumping back and forth between functions.
There was a problem hiding this comment.
My general feeling is that decoupling is always better and "just" 3 or 4 or 5 or 20 if branches can change pretty fast.
359bb29 to
c9c2c98
Compare
c9c2c98 to
27cc62e
Compare
marehr
left a comment
There was a problem hiding this comment.
Some remarks, no hard feeling, you can also just ignore it :)
| */ | ||
| std::forward_list<std::pair<int32_t, int32_t>> get_random_positions(uint64_t sample_size, seqan3::sam_file_header<std::deque<std::string>> & header) | ||
| { | ||
| std::forward_list<std::pair<int32_t, int32_t>> output{}; |
There was a problem hiding this comment.
Is this performance critical? I wouldn't use forward lists otherwise.
There was a problem hiding this comment.
| std::forward_list<std::pair<int32_t, int32_t>> output{}; | |
| std::vector<std::pair<int32_t, int32_t>> random_positions(sample_size); // pre-allocate sample-size many |
| std::forward_list<std::pair<int32_t, int32_t>> output{}; | ||
| if (sample_size == 0) | ||
| { | ||
| output.push_front(std::pair<int32_t, int32_t>{-1, -1}); |
There was a problem hiding this comment.
| output.push_front(std::pair<int32_t, int32_t>{-1, -1}); | |
| output.push_front(std::pair<int32_t, int32_t>{-1, -1}); | |
| return output; |
this makes it clear that this is a special case.
| for (int i = 0; i < sample_size; ++i) | ||
| { | ||
| rand_chr = std::rand() % header.ref_ids().size(); | ||
| rand_pos = std::rand() % std::get<0>(header.ref_id_info[rand_chr]); | ||
|
|
||
| output.push_front(std::pair<int32_t, int32_t>{rand_chr, rand_pos}); | ||
| } |
There was a problem hiding this comment.
| for (int i = 0; i < sample_size; ++i) | |
| { | |
| rand_chr = std::rand() % header.ref_ids().size(); | |
| rand_pos = std::rand() % std::get<0>(header.ref_id_info[rand_chr]); | |
| output.push_front(std::pair<int32_t, int32_t>{rand_chr, rand_pos}); | |
| } | |
| ```suggestion | |
| std::ranges::generate(random_positions, [&header]() | |
| { | |
| int32_t rand_chr = std::rand() % header.ref_ids().size(); | |
| int32_t rand_pos = std::rand() % std::get<0>(header.ref_id_info[rand_chr]); | |
| return std::pair<int32_t, int32_t>{rand_chr, rand_pos}); | |
| }); |
|
|
||
| for (int i = 0; i < sample_size; ++i) | ||
| { | ||
| rand_chr = std::rand() % header.ref_ids().size(); |
There was a problem hiding this comment.
You should consider using the "real" random module.
https://en.cppreference.com/w/cpp/numeric/random/uniform_int_distribution#Example
That would also make clear what your distribution is.
Furthermore, I would require passing in a random generator into this function so that the "randomness" can be controlled from the outside.
| uint32_t result{0}; | ||
| for (auto e : cigar) | ||
| { | ||
| result += get<0>(e); |
There was a problem hiding this comment.
Is this correct?
The sam specification says:
| Op | BAM | Description | Consumes query | Consumes reference |
|---|---|---|---|---|
| M | 0 | alignment match (can be a sequence match or mismatch) | yes | yes |
| I | 1 | insertion to the reference | yes | no |
| D | 2 | deletion from the reference | no | yes |
| N | 3 | skipped region from the reference | no | yes |
| S | 4 | soft clipping (clipped sequences present in SEQ) | yes | no |
| H | 5 | hard clipping (clipped sequences NOT present in SEQ) | no | no |
| P | 6 | padding (silent deletion from padded reference) | no | no |
| = | 7 | sequence match | yes | yes |
| X | 8 | sequence mismatch | yes | yes |
“Consumes query” and “consumes reference” indicate whether the CIGAR operation causes the alignment to step along the query sequence and the reference sequence respectively.
So I would say that you over-count the read length.
| // If there are positions to be sampled, then sample them. | ||
| if (!rand_sample.empty() && sample_size > 0) | ||
| { | ||
| if (ref_id == std::get<0>(rand_sample.front()) && // Correct chromosome. |
There was a problem hiding this comment.
| if (ref_id == std::get<0>(rand_sample.front()) && // Correct chromosome. | |
| auto && [chr, pod] = rand_sample.front(); | |
| if (ref_id == chr) && // Correct chromosome. |
give the pair content a name that makes it easier to understand.
| if (!rand_sample.empty() && sample_size > 0) | ||
| { | ||
| if (ref_id == std::get<0>(rand_sample.front()) && // Correct chromosome. | ||
| pos <= std::get<1>(rand_sample.front()) && // Read starts before or at the position. |
There was a problem hiding this comment.
again, not all cigar length's are for reads, they can also only affect the sequence.
| ++num_pos; | ||
| avg_depth += cur_depth; | ||
| cur_depth = 0; | ||
| rand_sample.pop_front(); |
There was a problem hiding this comment.
I'm not completely sure what this does and why you consume your "random" sample throughout the loop without "refreshing" it.
What happens if your random sample doesn't get fully consumed, but your next reference starts? Is that possible?
This pull request adds the misc subfolder in the include directory. For now it only has the header file for calculating the random positions for sampling. I'm open to suggestions on the organization of this!
Other than that, I created a preliminary random position generator which generates random chromosomes and positions. It isn't perfect, because it's based off of the standard library. But for an initial implementation it should suffice.
The random positions are basically pre-calculated and sorted, and then the depth for these positions are calculated on the fly as the reads overlap them. This assumes the BAM file is of course sorted.