Skip to content

Add random sampling for read depth estimation.#80

Open
joshuak94 wants to merge 1 commit intoseqan:masterfrom
joshuak94:readdepth_estimate
Open

Add random sampling for read depth estimation.#80
joshuak94 wants to merge 1 commit intoseqan:masterfrom
joshuak94:readdepth_estimate

Conversation

@joshuak94
Copy link
Copy Markdown
Collaborator

This pull request adds the misc subfolder in the include directory. For now it only has the header file for calculating the random positions for sampling. I'm open to suggestions on the organization of this!

Other than that, I created a preliminary random position generator which generates random chromosomes and positions. It isn't perfect, because it's based off of the standard library. But for an initial implementation it should suffice.

The random positions are basically pre-calculated and sorted, and then the depth for these positions are calculated on the fly as the reads overlap them. This assumes the BAM file is of course sorted.

@joshuak94
Copy link
Copy Markdown
Collaborator Author

Eventually I'd like to implement estimations for insert length for paired end reads and read lengths in general.

@joshuak94
Copy link
Copy Markdown
Collaborator Author

Also, this currently adds any read overlapping a position to the average calculation, regardless of the CIGAR at that exact position. So a read could actually be soft clipped/deletion at this base position and it will still count. I figured this would still be okay because it's only an approximation, but otherwise I would do something along these lines:

if (read_overlaps_pos(cigar, offset, std::get<1>rand_sample.front())
{
     ++cur_depth;
}

And then in read_overlaps_pos I can look specifically at what the CIGAR character is at this position. But I guess this will make things a little less efficient.

@joshuak94 joshuak94 requested a review from Irallia March 9, 2021 15:49
@joshuak94 joshuak94 force-pushed the readdepth_estimate branch from 3ebc91e to f90c9a4 Compare March 10, 2021 14:54
Copy link
Copy Markdown
Collaborator

@Irallia Irallia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have already looked over it once. But I think this is a PR that can have as 1st review someone from the seqan team.

Comment on lines +69 to +88
read_length = get_read_length(cigar);
// If there are positions to be sampled, then sample them.
if (!rand_sample.empty() && sample_size > 0)
{
if (ref_id == std::get<0>(rand_sample.front()) && // Correct chromosome.
pos <= std::get<1>(rand_sample.front()) && // Read starts before or at the position.
std::get<1>(rand_sample.front()) <= pos + read_length) // Read ends after or at position.
{
++cur_depth;
}
else if (ref_id > std::get<0>(rand_sample.front()) || // Record is at next chromosome
(ref_id == std::get<0>(rand_sample.front()) && // Record is at same chromosome BUT
pos > std::get<1>(rand_sample.front()))) // in a later position.
{
++num_pos;
avg_depth += cur_depth;
cur_depth = 0;
rand_sample.pop_front();
}
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you create a calculate_average_read_depth() or something similar?
And would it be possible to build a small api test for this function?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it's such a good idea to put this into a separate function. I mean for one, it's basically just three if statements. The other thing is that I'm computing the depth at the current position, and then when we move to the next position this gets added to the average and then the current depth is zeroed out. So I'd have to pass all of these variables in and I think it would just get more confusing jumping back and forth between functions.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My general feeling is that decoupling is always better and "just" 3 or 4 or 5 or 20 if branches can change pretty fast.

@Irallia Irallia requested review from a team and marehr and removed request for a team March 15, 2021 12:08
@joshuak94 joshuak94 force-pushed the readdepth_estimate branch from 359bb29 to c9c2c98 Compare March 16, 2021 09:46
@joshuak94 joshuak94 force-pushed the readdepth_estimate branch from c9c2c98 to 27cc62e Compare March 16, 2021 09:49
Copy link
Copy Markdown
Member

@marehr marehr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some remarks, no hard feeling, you can also just ignore it :)

*/
std::forward_list<std::pair<int32_t, int32_t>> get_random_positions(uint64_t sample_size, seqan3::sam_file_header<std::deque<std::string>> & header)
{
std::forward_list<std::pair<int32_t, int32_t>> output{};
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this performance critical? I wouldn't use forward lists otherwise.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::forward_list<std::pair<int32_t, int32_t>> output{};
std::vector<std::pair<int32_t, int32_t>> random_positions(sample_size); // pre-allocate sample-size many

std::forward_list<std::pair<int32_t, int32_t>> output{};
if (sample_size == 0)
{
output.push_front(std::pair<int32_t, int32_t>{-1, -1});
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
output.push_front(std::pair<int32_t, int32_t>{-1, -1});
output.push_front(std::pair<int32_t, int32_t>{-1, -1});
return output;

this makes it clear that this is a special case.

Comment on lines +24 to +30
for (int i = 0; i < sample_size; ++i)
{
rand_chr = std::rand() % header.ref_ids().size();
rand_pos = std::rand() % std::get<0>(header.ref_id_info[rand_chr]);

output.push_front(std::pair<int32_t, int32_t>{rand_chr, rand_pos});
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for (int i = 0; i < sample_size; ++i)
{
rand_chr = std::rand() % header.ref_ids().size();
rand_pos = std::rand() % std::get<0>(header.ref_id_info[rand_chr]);
output.push_front(std::pair<int32_t, int32_t>{rand_chr, rand_pos});
}
```suggestion
std::ranges::generate(random_positions, [&header]()
{
int32_t rand_chr = std::rand() % header.ref_ids().size();
int32_t rand_pos = std::rand() % std::get<0>(header.ref_id_info[rand_chr]);
return std::pair<int32_t, int32_t>{rand_chr, rand_pos});
});


for (int i = 0; i < sample_size; ++i)
{
rand_chr = std::rand() % header.ref_ids().size();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should consider using the "real" random module.

https://en.cppreference.com/w/cpp/numeric/random/uniform_int_distribution#Example

That would also make clear what your distribution is.

Furthermore, I would require passing in a random generator into this function so that the "randomness" can be controlled from the outside.

uint32_t result{0};
for (auto e : cigar)
{
result += get<0>(e);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct?

The sam specification says:

Op BAM Description Consumes query Consumes reference
M 0 alignment match (can be a sequence match or mismatch) yes yes
I 1 insertion to the reference yes no
D 2 deletion from the reference no yes
N 3 skipped region from the reference no yes
S 4 soft clipping (clipped sequences present in SEQ) yes no
H 5 hard clipping (clipped sequences NOT present in SEQ) no no
P 6 padding (silent deletion from padded reference) no no
= 7 sequence match yes yes
X 8 sequence mismatch yes yes

“Consumes query” and “consumes reference” indicate whether the CIGAR operation causes the alignment to step along the query sequence and the reference sequence respectively.

So I would say that you over-count the read length.

https://samtools.github.io/hts-specs/SAMv1.pdf#page=8

// If there are positions to be sampled, then sample them.
if (!rand_sample.empty() && sample_size > 0)
{
if (ref_id == std::get<0>(rand_sample.front()) && // Correct chromosome.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (ref_id == std::get<0>(rand_sample.front()) && // Correct chromosome.
auto && [chr, pod] = rand_sample.front();
if (ref_id == chr) && // Correct chromosome.

give the pair content a name that makes it easier to understand.

if (!rand_sample.empty() && sample_size > 0)
{
if (ref_id == std::get<0>(rand_sample.front()) && // Correct chromosome.
pos <= std::get<1>(rand_sample.front()) && // Read starts before or at the position.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, not all cigar length's are for reads, they can also only affect the sequence.

++num_pos;
avg_depth += cur_depth;
cur_depth = 0;
rand_sample.pop_front();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not completely sure what this does and why you consume your "random" sample throughout the loop without "refreshing" it.

What happens if your random sample doesn't get fully consumed, but your next reference starts? Is that possible?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants