ReportFpgOutputForObservationalModel needs to only report the barcode alleles#159
ReportFpgOutputForObservationalModel needs to only report the barcode alleles#159stitova-idm wants to merge 2 commits intoEMOD-Hub:mainfrom
Conversation
…eing reported on in the report. We do not want to track other ones (such as PfEMP1, MSP, HRP2, Drug Resistance) if they are present.
There was a problem hiding this comment.
Pull request overview
Updates ReportFpgOutputForObservationalModel so the numpy allele/root outputs only include neutral “barcode” SNPs rather than the full parasite genome loci.
Changes:
- Set reported genome dimension to the number of barcode SNP indices (instead of total base pairs).
- Filter allele and root arrays to only include values at barcode indices when writing numpy output.
- Add a fast-path for cases where the stored sequence/root arrays already appear to be barcode-only.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| }; | ||
|
|
||
| write_data_func write_func_roots = [ this ]( size_t num_bytes, std::ofstream& file ) | ||
| write_data_func write_func_roots = [ this , &r_barcode_indexes, &snp_data_size ]( size_t num_bytes, std::ofstream& file ) |
There was a problem hiding this comment.
Both lambdas capture snp_data_size by reference even though it’s a local variable. Capturing it by value avoids potential lifetime issues (if the callable is stored/used later) and makes the capture intent clearer. Same applies to capturing r_barcode_indexes by reference if you don’t require it.
| if(r_barcode_indexes.size() == m_GenomeList[0].GetNucleotideSequence().size()) | ||
| { | ||
| file.write( (char *)r_genome.GetNucleotideSequence().data(), genome_size ); | ||
| bytes_written += genome_size; | ||
| // in case we are storing only barcode SNPs as the full nucleotide sequence | ||
| for( const ParasiteGenome& r_genome : m_GenomeList ) | ||
| { | ||
| file.write( (char*)r_genome.GetNucleotideSequence().data(), snp_data_size ); | ||
| bytes_written += snp_data_size ; | ||
| } | ||
| } | ||
| else |
There was a problem hiding this comment.
Using r_barcode_indexes.size() == GetNucleotideSequence().size() as a proxy for 'sequence is already barcode-only' can silently produce incorrect output if a non-barcode sequence happens to have the same length as the barcode list (or the storage format changes). A safer approach is to always select by r_barcode_indexes, or gate the fast-path behind an explicit/authoritative indicator (e.g., a config flag or a genome/parasite-genetics API that declares the stored representation is barcode-only).
| // in case we are storing only barcode SNPs as the full nucleotide sequence | ||
| for( const ParasiteGenome& r_genome : m_GenomeList ) | ||
| { | ||
| file.write( (char*)r_genome.GetNucleotideSequence().data(), snp_data_size ); |
There was a problem hiding this comment.
These file.write calls use C-style casts and pass a size_t length. Prefer reinterpret_cast<const char*>(...) (or std::as_bytes-style patterns where applicable) and cast the byte count to std::streamsize to match the iostream API, improving type-safety and avoiding signed/unsigned narrowing pitfalls.
| release_assert( index < r_genome.GetNucleotideSequence().size() ); | ||
| snps.push_back( r_genome.GetNucleotideSequence()[index] ); | ||
| } | ||
| file.write( (char*)snps.data(), snp_data_size ); |
There was a problem hiding this comment.
These file.write calls use C-style casts and pass a size_t length. Prefer reinterpret_cast<const char*>(...) (or std::as_bytes-style patterns where applicable) and cast the byte count to std::streamsize to match the iostream API, improving type-safety and avoiding signed/unsigned narrowing pitfalls.
| // in case we are storing only barcode SNPs as the full nucleotide sequence | ||
| for(const ParasiteGenome& r_genome : m_GenomeList) | ||
| { | ||
| file.write( (char*)r_genome.GetAlleleRoots().data(), snp_data_size ); |
There was a problem hiding this comment.
These file.write calls use C-style casts and pass a size_t length. Prefer reinterpret_cast<const char*>(...) (or std::as_bytes-style patterns where applicable) and cast the byte count to std::streamsize to match the iostream API, improving type-safety and avoiding signed/unsigned narrowing pitfalls.
| release_assert( index < r_genome.GetAlleleRoots().size() ); | ||
| snps.push_back( r_genome.GetAlleleRoots()[index] ); | ||
| } | ||
| file.write( (char*)snps.data(), snp_data_size ); |
There was a problem hiding this comment.
These file.write calls use C-style casts and pass a size_t length. Prefer reinterpret_cast<const char*>(...) (or std::as_bytes-style patterns where applicable) and cast the byte count to std::streamsize to match the iostream API, improving type-safety and avoiding signed/unsigned narrowing pitfalls.
| // --- file such that they can be read as a 2D array. | ||
| // -------------------------------------------------------------------------------- | ||
| write_data_func write_func_alleles = [ this ]( size_t num_bytes, std::ofstream& file ) | ||
| write_data_func write_func_alleles = [this, &r_barcode_indexes, &snp_data_size]( size_t num_bytes, std::ofstream& file ) |
There was a problem hiding this comment.
The allele and root write lambdas duplicate the same filtering logic (size check, allocate snps, loop over genomes, loop over indices, write bytes, update bytes_written). Consider extracting a shared helper that takes a function/ref returning the underlying std::vector<int32_t>& for a genome (nucleotide sequence vs allele roots). This reduces duplication and lowers the chance of future fixes being applied to only one path.
| }; | ||
|
|
||
| write_data_func write_func_roots = [ this ]( size_t num_bytes, std::ofstream& file ) | ||
| write_data_func write_func_roots = [ this , &r_barcode_indexes, &snp_data_size ]( size_t num_bytes, std::ofstream& file ) |
There was a problem hiding this comment.
The allele and root write lambdas duplicate the same filtering logic (size check, allocate snps, loop over genomes, loop over indices, write bytes, update bytes_written). Consider extracting a shared helper that takes a function/ref returning the underlying std::vector<int32_t>& for a genome (nucleotide sequence vs allele roots). This reduces duplication and lowers the chance of future fixes being applied to only one path.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Added code to make sure that only the "neutral"/barcode alleles are being reported on in the report. We do not want to track other ones (such as PfEMP1, MSP, HRP2, Drug Resistance) if they are present.