Skip to content

Conversation

@MrTomRod
Copy link
Contributor

@MrTomRod MrTomRod commented Dec 5, 2025

Description

This PR modifies the genome_size helper command to support LJA as an alternative assembler. Previously, the command hardcoded the use of Raven.

Motivation: LJA is significantly faster than Raven (though it may only be used for PacBio HiFi data?)

Usage Example:

# Default behavior (uses Raven)
autocycler helper genome_size --reads reads.fastq.gz

# Use LJA (faster for HiFi)
autocycler helper genome_size --reads reads.fastq.gz --assembler lja

Performance

Benchmarking on a test dataset shows LJA is approximately 6x faster in wall-clock time and 9x faster in CPU time compared to Raven:

Assembler Real Time User Time Sys Time
LJA 1m39.646s 9m24.194s 0m6.220s
Raven 10m39.524s 82m46.248s 0m14.592s

Note: LJA support is currently limited to PacBio HiFi reads.

Changes

  • CLI: Added --assembler argument to autocycler helper genome_size.
  • Logic: Updated src/helper.rs to handle the new argument.
    • If --assembler lja is passed, it calls genome_size_lja.
    • If --assembler raven (or nothing) is passed, it defaults to genome_size_raven.

This adds an optional `--assembler` argument to the `autocycler helper genome_size` command, allowing users to choose between Raven (default) and LJA for genome size estimation.
LJA offers significantly faster performance for PacBio HiFi reads compared to Raven (approx. 10x faster in testing), making it a valuable alternative for large datasets.
- Modified `src/main.rs` to parse the new `--assembler` flag.
- Updated `src/helper.rs` to dispatch to `genome_size_raven` or `genome_size_lja` based on the argument.
@rrwick
Copy link
Owner

rrwick commented Dec 16, 2025

Hi Thomas,

Thanks for this! Some thoughts/observations:

  • Even though LJA is advertised as a HiFi assembler, it can be used for ONT reads, though its speed performance suffers (see Figure S5 from the Autocycler paper). So Raven is probably preferable for ONT reads.
  • I tried LJA, Raven and LRGE on an E. coli HiFi read set (same one from this post, first 40k reads):
    • Using 8 threads, LJA took 1:55, Raven took 2:30, LRGE took 0:28.
    • Using 32 threads, LJA took 1:21, Raven took 0:52, LRGE took 0:10.
    • My LJA assembly was very accurate for genome size estimation (only off by 1 bp). But the Raven assembly size was definitely good enough (0.15% error), since Autocycler only needs an approximate genome size. LRGE had the most error (~5%), but this is still good enough for Autocycler.

I'm curious why you got much better speed performance with LJA than I did. What was your read set like? How deep? How long were the reads? How big was the genome?

My inclination is to not merge this PR, since it adds a bit of complexity to the tool and I'm not sure it's needed. I think Raven is a good default choice (often faster than LJA in my tests), and If users really need faster genome size estimation, they can use LRGE. But am I missing something?

Thanks,
Ryan

@MrTomRod
Copy link
Contributor Author

MrTomRod commented Dec 18, 2025

I didn't realize that LJA is sometimes slower than Raven. I never observed this, but I'm always using PacBio HiFi reads.

Barcode Quality HiFi Reads HiFi Read Length (mean, bp) HiFi Read Quality (median, QV) HiFi Yield (bp) Polymerase Read Length (mean, bp) FASTQ Size Genome Size Coverage
98.4 134,289 7,783 Q47 1,045,225,151 152,594 430 MB 2.5 Mbp ~400x

Feel free to reject the PR.

I also tried LRGE and am using that now, too.

Happy holidays!

@MrTomRod MrTomRod closed this Dec 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants