Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 53 additions & 0 deletions include/misc/statistics.hpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
#pragma once

#include <forward_list>

#include <seqan3/io/sam_file/input.hpp>

/*! \brief Returns a list of pairs, where the first integer is a random chromosome and the second is a random position.
* The list is sorted by default.
* \param sample_size - The number of positions to generate.
* \param header - The header from the SAM file input.
*
* \return Returns a sorted forward_list of pairs of reference chromosome IDs and positions.
*/
std::forward_list<std::pair<int32_t, int32_t>> get_random_positions(uint64_t sample_size, seqan3::sam_file_header<std::deque<std::string>> & header)
{
std::forward_list<std::pair<int32_t, int32_t>> output{};
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this performance critical? I wouldn't use forward lists otherwise.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::forward_list<std::pair<int32_t, int32_t>> output{};
std::vector<std::pair<int32_t, int32_t>> random_positions(sample_size); // pre-allocate sample-size many

if (sample_size == 0)
{
output.push_front(std::pair<int32_t, int32_t>{-1, -1});
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
output.push_front(std::pair<int32_t, int32_t>{-1, -1});
output.push_front(std::pair<int32_t, int32_t>{-1, -1});
return output;

this makes it clear that this is a special case.

}
int32_t rand_chr{};
int32_t rand_pos{};

for (int i = 0; i < sample_size; ++i)
{
rand_chr = std::rand() % header.ref_ids().size();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should consider using the "real" random module.

https://en.cppreference.com/w/cpp/numeric/random/uniform_int_distribution#Example

That would also make clear what your distribution is.

Furthermore, I would require passing in a random generator into this function so that the "randomness" can be controlled from the outside.

rand_pos = std::rand() % std::get<0>(header.ref_id_info[rand_chr]);

output.push_front(std::pair<int32_t, int32_t>{rand_chr, rand_pos});
}
Comment on lines +24 to +30
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for (int i = 0; i < sample_size; ++i)
{
rand_chr = std::rand() % header.ref_ids().size();
rand_pos = std::rand() % std::get<0>(header.ref_id_info[rand_chr]);
output.push_front(std::pair<int32_t, int32_t>{rand_chr, rand_pos});
}
```suggestion
std::ranges::generate(random_positions, [&header]()
{
int32_t rand_chr = std::rand() % header.ref_ids().size();
int32_t rand_pos = std::rand() % std::get<0>(header.ref_id_info[rand_chr]);
return std::pair<int32_t, int32_t>{rand_chr, rand_pos});
});


output.sort();
return output;
}

/*! \brief Get the read length of a read from its cigar string.
* \param cigar - The cigar according to the seqan3 representation (a vector of cigar alphabet).
*
* \return Returns the length of the cigar string.
*/

uint32_t get_read_length(std::vector<seqan3::cigar> & cigar)
{
using seqan3::get;

uint32_t result{0};
for (auto e : cigar)
{
result += get<0>(e);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct?

The sam specification says:

Op BAM Description Consumes query Consumes reference
M 0 alignment match (can be a sequence match or mismatch) yes yes
I 1 insertion to the reference yes no
D 2 deletion from the reference no yes
N 3 skipped region from the reference no yes
S 4 soft clipping (clipped sequences present in SEQ) yes no
H 5 hard clipping (clipped sequences NOT present in SEQ) no no
P 6 padding (silent deletion from padded reference) no no
= 7 sequence match yes yes
X 8 sequence mismatch yes yes

“Consumes query” and “consumes reference” indicate whether the CIGAR operation causes the alignment to step along the query sequence and the reference sequence respectively.

So I would say that you over-count the read length.

https://samtools.github.io/hts-specs/SAMv1.pdf#page=8

}

return result;
}
4 changes: 3 additions & 1 deletion include/variant_detection/variant_detection.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@
* 2: sVirl_refinement_method)
* \param min_var_length - minimum length of variants to detect (default 30 bp)
* \param output_file_path output file - path for the VCF file
* \param sample_size - how many positions to sample for read depth estimation.
*
* \details Detects novel junctions from read alignment records using different detection methods.
* The junctions are clustered using one of several clustering methods.
Expand All @@ -36,4 +37,5 @@ void detect_variants_in_alignment_file(const std::filesystem::path & alignment_f
const clustering_methods & clustering_method,
const refinement_methods & refinement_method,
const uint64_t & min_var_length,
const std::filesystem::path & output_file_path);
const std::filesystem::path & output_file_path,
const uint64_t & sample_size);
9 changes: 8 additions & 1 deletion src/iGenVar.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,7 @@ struct cmd_arguments
clustering_methods clustering_method{simple_clustering}; // default: simple clustering method
refinement_methods refinement_method{no_refinement}; // default: no refinement
uint64_t min_var_length = 30;
uint64_t sample_size{100000}; // default is to sample 100000
};

void initialize_argument_parser(seqan3::argument_parser & parser, cmd_arguments & args)
Expand Down Expand Up @@ -115,6 +116,11 @@ void initialize_argument_parser(seqan3::argument_parser & parser, cmd_arguments
parser.add_option(args.min_var_length, 'l', "min_var_length",
"Specify what should be the minimum length of your SVs to be detected (default 30 bp).",
seqan3::option_spec::advanced);

// Options - Miscellaneous
parser.add_option(args.sample_size, 's', "sample_size",
"This value is used in sampling positions for average read depth, insert size, and read length. Setting to 0 will skip sampling.",
seqan3::option_spec::advanced);
}

int main(int argc, char ** argv)
Expand Down Expand Up @@ -151,7 +157,8 @@ int main(int argc, char ** argv)
args.clustering_method,
args.refinement_method,
args.min_var_length,
args.output_file_path);
args.output_file_path,
args.sample_size);

return 0;
}
38 changes: 37 additions & 1 deletion src/variant_detection/variant_detection.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
#include <seqan3/io/sam_file/input.hpp> // SAM/BAM support
#include <seqan3/io/sequence_file/output.hpp> // FASTA support

#include "misc/statistics.hpp" // for calculating average read depth
#include "modules/clustering/simple_clustering_method.hpp" // for the simple clustering method
#include "modules/sv_detection_methods/analyze_cigar_method.hpp" // for the split read method
#include "modules/sv_detection_methods/analyze_sa_tag_method.hpp" // for the cigar string method
Expand All @@ -20,7 +21,8 @@ void detect_variants_in_alignment_file(const std::filesystem::path & alignment_f
const clustering_methods & clustering_method,
const refinement_methods & refinement_method,
const uint64_t & min_var_length,
const std::filesystem::path & output_file_path)
const std::filesystem::path & output_file_path,
const uint64_t & sample_size)
{
// Open input alignment file
using my_fields = seqan3::fields<seqan3::field::id,
Expand All @@ -42,6 +44,12 @@ void detect_variants_in_alignment_file(const std::filesystem::path & alignment_f
std::vector<seqan3::dna5_vector> insertion_alleles{};
uint16_t num_good = 0;

// Obtain list of random chromosomes and positions to get read depth from.
std::forward_list<std::pair<int32_t, int32_t>> rand_sample = get_random_positions(sample_size, alignment_file.header());
uint32_t avg_depth{0};
uint32_t cur_depth{0};
uint32_t num_pos{0};

for (auto & rec : alignment_file)
{
const std::string query_name = seqan3::get<seqan3::field::id>(rec); // 1: QNAME
Expand All @@ -60,6 +68,26 @@ void detect_variants_in_alignment_file(const std::filesystem::path & alignment_f
continue;

const std::string ref_name = ref_ids[ref_id];
uint32_t read_length{get_read_length(cigar)};
// If there are positions to be sampled, then sample them.
if (!rand_sample.empty() && sample_size > 0)
{
if (ref_id == std::get<0>(rand_sample.front()) && // Correct chromosome.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (ref_id == std::get<0>(rand_sample.front()) && // Correct chromosome.
auto && [chr, pod] = rand_sample.front();
if (ref_id == chr) && // Correct chromosome.

give the pair content a name that makes it easier to understand.

pos <= std::get<1>(rand_sample.front()) && // Read starts before or at the position.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, not all cigar length's are for reads, they can also only affect the sequence.

std::get<1>(rand_sample.front()) <= pos + read_length) // Read ends after or at position.
{
++cur_depth;
}
else if (ref_id > std::get<0>(rand_sample.front()) || // Record is at next chromosome
(ref_id == std::get<0>(rand_sample.front()) && // Record is at same chromosome BUT
pos > std::get<1>(rand_sample.front()))) // in a later position.
{
++num_pos;
avg_depth += cur_depth;
cur_depth = 0;
rand_sample.pop_front();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not completely sure what this does and why you consume your "random" sample throughout the loop without "refreshing" it.

What happens if your random sample doesn't get fully consumed, but your next reference starts? Is that possible?

}
}
for (uint8_t method : methods) {
switch (method)
{
Expand Down Expand Up @@ -101,6 +129,14 @@ void detect_variants_in_alignment_file(const std::filesystem::path & alignment_f
seqan3::debug_stream << num_good << " good alignments" << std::endl;
}
}
if (sample_size == 0 || num_pos == 0) // User requested no sampling, or all of the random positions generated were not covered.
{
avg_depth = 0;
}
else
{
avg_depth = avg_depth/num_pos;
}
std::sort(junctions.begin(), junctions.end());

seqan3::debug_stream << "Start clustering...\n";
Expand Down
10 changes: 7 additions & 3 deletions test/api/junction_detection_methods_test.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ std::filesystem::path tmp_dir = std::filesystem::temp_directory_path(); // g
std::filesystem::path empty_path{};
const std::vector<detection_methods> all_methods{cigar_string, split_read, read_pairs, read_depth};
const uint64_t sv_default_length = 30;
const uint64_t sample_size = 10;

// Explanation for the strings:
// Reference\tm2257/8161/CCS\t41972616\tForward\tRead\t0\t2294\tForward\tchr21
Expand Down Expand Up @@ -73,7 +74,8 @@ TEST(junction_detection, single_method_only)
simple_clustering,
no_refinement,
sv_default_length,
empty_path);
empty_path,
sample_size);

//TODO (eldariont): Currently, this (CLI-like) test compares the stdout of the method with an expected string in VCF format.
//We need to replace this with real API tests that check inidividual parts of the pipeline.
Expand Down Expand Up @@ -121,7 +123,8 @@ TEST(junction_detection, method_pairs)
simple_clustering,
no_refinement,
sv_default_length,
empty_path);
empty_path,
sample_size);

//TODO (eldariont): Currently, this (CLI-like) test compares the stdout of the method with an expected string in VCF format.
//We need to replace this with real API tests that check inidividual parts of the pipeline.
Expand Down Expand Up @@ -184,7 +187,8 @@ TEST(junction_detection, method_triples)
simple_clustering,
no_refinement,
sv_default_length,
empty_path);
empty_path,
sample_size);

//TODO (eldariont): Currently, this (CLI-like) test compares the stdout of the method with an expected string in VCF format.
//We need to replace this with real API tests that check inidividual parts of the pipeline.
Expand Down
19 changes: 13 additions & 6 deletions test/api/junction_detection_test.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ std::filesystem::path tmp_dir = std::filesystem::temp_directory_path(); // g
std::filesystem::path empty_path{};
const std::vector<detection_methods> default_methods{cigar_string, split_read, read_pairs, read_depth};
const uint64_t sv_default_length = 30;
const uint64_t sample_size = 10;

// Explanation for the strings:
// Reference\tchr21\t41972616\tForward\tRead\t0\t2294\tForward\t1
Expand Down Expand Up @@ -56,7 +57,8 @@ TEST(junction_detection, fasta_out_not_empty)
simple_clustering,
no_refinement,
sv_default_length,
empty_path);
empty_path,
sample_size);

// check_output_and_cleanup(expected_res);
check_output_and_cleanup(empty_res);
Expand All @@ -75,7 +77,8 @@ TEST(junction_detection, clustering_method_hierarchical)
hierarchical_clustering,
no_refinement,
sv_default_length,
empty_path);
empty_path,
sample_size);

// check_output_and_cleanup("");
check_output_and_cleanup(empty_res);
Expand All @@ -92,7 +95,8 @@ TEST(junction_detection, clustering_method_self_balancing_binary_tree)
self_balancing_binary_tree,
no_refinement,
sv_default_length,
empty_path);
empty_path,
sample_size);

// check_output_and_cleanup("");
check_output_and_cleanup(empty_res);
Expand All @@ -109,7 +113,8 @@ TEST(junction_detection, clustering_method_candidate_selection_based_on_voting)
candidate_selection_based_on_voting,
no_refinement,
sv_default_length,
empty_path);
empty_path,
sample_size);

// check_output_and_cleanup("");
check_output_and_cleanup(empty_res);
Expand All @@ -128,7 +133,8 @@ TEST(junction_detection, refinement_method_sViper)
simple_clustering,
sViper_refinement_method,
sv_default_length,
empty_path);
empty_path,
sample_size);

// check_output_and_cleanup(expected_res);
check_output_and_cleanup(empty_res);
Expand All @@ -145,7 +151,8 @@ TEST(junction_detection, refinement_method_sVirl)
simple_clustering,
sVirl_refinement_method,
sv_default_length,
empty_path);
empty_path,
sample_size);

// check_output_and_cleanup(expected_res);
check_output_and_cleanup(empty_res);
Expand Down
4 changes: 4 additions & 0 deletions test/cli/iGenVar_cli_test.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -82,6 +82,10 @@ const std::string help_page_advanced
" -l, --min_var_length (unsigned 64 bit integer)\n"
" Specify what should be the minimum length of your SVs to be detected\n"
" (default 30 bp). Default: 30.\n"
" -s, --sample_size (unsigned 64 bit integer)\n"
" This value is used in sampling positions for average read depth,\n"
" insert size, and read length. Setting to 0 will skip sampling.\n"
" Default: 100000.\n"
};

std::string expected_res
Expand Down
2 changes: 1 addition & 1 deletion test/data/datasources.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ include (cmake/app_datasources.cmake)
# head -20999 simulated.minimap2.hg19.coordsorted.sam | tail -4 > simulated.minimap2.hg19.coordsorted_cutoff.sam
declare_datasource (FILE simulated.minimap2.hg19.coordsorted_cutoff.sam
URL ${CMAKE_SOURCE_DIR}/test/data/simulated.minimap2.hg19.coordsorted_cutoff.sam
URL_HASH SHA256=af188e36dfab9be4d351985bca8d9b103d64cb859f422cb2c7671b220ccc9f7a)
URL_HASH SHA256=2423106a8d7d72508ab9009a07f2dc814fa3b01b3983d5975339004fd4b1f9df)

# copies file to <build>/data/detect_breakends_shorted.vcf
declare_datasource (FILE detect_breakends_shorted.vcf
Expand Down
Loading