Processing utilities for CycloneSeq nanopore raw-current signals, implemented in pure Rust for performance and reproducibility.
Author: Runsheng Li runsheng.li@cityu.edu.hk
- Signal IO –
Fast5Readerstreams raw ADC traces and optional basecalls directly from FAST5/HDF5 containers without extracting to intermediate files. - Event Detection –
EventDetectorapplies robust normalisation, changepoint scouting, and adaptive splitting to obtain ~2–3 events/base for Cyclone’s 5 kHz sampling. - Models –
Model::from_tsvloads k-mer emission tables (mean, sd) and precomputes z-score equivalents; supports either raw-level or z-scored models. - Alignment – Two paths:
- Adaptive banded DP tuned to nanopore dwell statistics.
- Optional HMM aligner (f5c-inspired) for dense event stretches.
- Outputs – Per-read JSON event tables, alignment JSON/JSONL payloads, and PAF summaries with f5c-compatible
ssmove tags.
The Rust crate lives in curea/ and exposes both a library API and a CLI binary (curea) that stitches these stages.
This repository includes a small test dataset in the ./test directory to allow you to quickly run and verify the software.
After following the installation instructions below, you can run the alignment on the test data:
# Ensure you have built the binary first with `cargo build --release` inside the `curea` directory.
./curea/target/release/curea align \
--fast5 test/two_reads.fast5 \
--model models/init.DNA.tsv \
--reference test/reference.fa \
--sam test/two_reads_filtered.bam \
--combined-paf test/output/test_run_output.paf \
--threads 2This command will align the two reads in two_reads.fast5 and produce a test_run_output.paf file in test/output. You can compare this to the positive_control/first_2_reads.paf file to check for correctness.
-
Rust toolchain (Rust 1.74+ recommended) with
cargoon PATH. -
HDF5 runtime via the project’s Conda environment:
source /Users/li/miniconda3/etc/profile.d/conda.sh conda activate nanocurea -
macOS HDF5 linker fix (needed for binary runtime):
export HDF5_DIR="$CONDA_PREFIX" export HDF5_LIBDIR="$CONDA_PREFIX/lib" export DYLD_LIBRARY_PATH="$CONDA_PREFIX/lib:${DYLD_LIBRARY_PATH:-}" export DYLD_FALLBACK_LIBRARY_PATH="$CONDA_PREFIX/lib:${DYLD_FALLBACK_LIBRARY_PATH:-}"
If
otool/install_name_toolare available, ensure$CONDA_PREFIX/libis on the binary rpath (see CLAUDE.md for scriptlet). -
Data – Raw Cyclone FAST5/slow5 reads in their original directories (e.g.,
fast5_WGA_p/,blow5_WGA_p/), and reference FASTA (reference.fa).
- System packages (Debian/Ubuntu):
sudo apt-get update sudo apt-get install -y build-essential pkg-config libhdf5-dev zlib1g-dev # Optional: patchelf for rpath adjustments sudo apt-get install -y patchelf - System packages (RHEL/CentOS/Rocky):
sudo yum groupinstall -y "Development Tools" sudo yum install -y hdf5-devel zlib-devel pkgconfig # Optional: patchelf sudo yum install -y patchelf
- Conda route (cross‑platform):
conda create -n nanocurea -c conda-forge hdf5 rust conda activate nanocurea export HDF5_DIR="$CONDA_PREFIX" export HDF5_LIBDIR="$CONDA_PREFIX/lib" export LD_LIBRARY_PATH="$CONDA_PREFIX/lib:${LD_LIBRARY_PATH:-}"
Build the binary:
cd curea
cargo build --release
# Binary at: target/release/cureaIf the runtime cannot find HDF5 on Linux, ensure:
LD_LIBRARY_PATHcontains$CONDA_PREFIX/lib(for Conda installs), or/usr/lib/x86_64-linux-gnu(Debian) //usr/lib64(RHEL) for system installs.- As a last resort, set an rpath into the binary (requires
patchelf):patchelf --set-rpath "$CONDA_PREFIX/lib" target/release/curea
cd curea
cargo build --release
# Binary: target/release/cureaUse cargo run -- <args> during development; switch to the compiled binary for batch jobs.
- Raw-level models (
models/init.DNA.tsv): provide absolute picoamp means/SDs. Use default alignment mode (--use-z-scoreoff). - Z-score models (
models/wgafull_self.kmer.tsv): provide normalised means/SDs. Supply--use-z-scoreso emissions compare event z-scores directly. - Ensure k-mer size matches model expectations (e.g., 6-mer for Cyclone). The loader validates consistency and rejects malformed TSV rows.
Run curea --help for the full synopsis. Key subcommands are summarised below.
This example uses the provided test data.
./curea/target/release/curea align \
--fast5 test/two_reads.fast5 \
--model models/init.DNA.tsv \
--reference test/reference.fa \
--sam test/two_reads_filtered.bam \
--combined-paf test/output/two_reads.paf \
--combined-json test/output/two_reads.jsonl \
--output test/output/aligned_reads \
--threads 2 \
--sampling-rate 5000 \
--min-events-to-rescale 200 \
--scale-varOptional flags:
--read-idlimits processing to one read (required when--fast5is a single file with multiple reads and you only want to process one).--sequence-filesupplies basecalled FASTA/FASTQ if no FAST5 basecall exists.--use-fast5-basecallextracts basecalls directly from FAST5.--use-z-scoreswitches emissions to z-score mode for z-scored models.
This example uses one of the reads from the test dataset. First, you need to find the read ID.
./curea/target/release/curea list --input test/two_reads.fast5
# This will output the read IDs, for example: 2d116543-704b-455b-990a-5c8e97a2e04e
# Now use the read ID with the 'single' command
./curea/target/release/curea single \
--input test/two_reads.fast5 \
--read-id 2d116543-704b-455b-990a-5c8e97a2e04e \
--output test/output/one_read.events.json \
--expected-bases 2000 \
--sampling-rate 5000Outputs optional JSON containing the event table.
To test batch detection, you would place multiple FAST5 files in a directory. As test/two_reads.fast5 contains multiple reads, the align command is better suited for batch processing from a single file. If you had a directory test/many_fast5_files/, the command would be:
./curea/target/release/curea batch \
--input-dir test/many_fast5_files \
--output-dir test/output/events \
--threads 8 \
--sampling-rate 5000Generates {fast5stem}_{read}.events.json files per read.
./curea/target/release/curea list --input test/two_reads.fast5- Prepare environment
# macOS (Conda) source /Users/li/miniconda3/etc/profile.d/conda.sh && conda activate nanocurea export HDF5_DIR="$CONDA_PREFIX"; export HDF5_LIBDIR="$CONDA_PREFIX/lib" export DYLD_LIBRARY_PATH="$CONDA_PREFIX/lib:${DYLD_LIBRARY_PATH:-}" # Linux (Conda) conda activate nanocurea export HDF5_DIR="$CONDA_PREFIX"; export HDF5_LIBDIR="$CONDA_PREFIX/lib" export LD_LIBRARY_PATH="$CONDA_PREFIX/lib:${LD_LIBRARY_PATH:-}"
- Build:
cargo build --release - Run alignment (WGA example above) to produce:
- Combined PAF with per-read scaling tags (
sc,sh,ss, optionallyzs). - Per-read JSON payloads (
*.nanocurea.json) undertest/output/aligned_reads/. - Combined JSONL (if requested) for downstream analytics.
- Combined PAF with per-read scaling tags (
Equivalent direct curea align (no script):
./curea/target/release/curea align \
--fast5 test/two_reads.fast5 \
--model models/6merZ.tsv \
--reference test/reference.fa \
--sam test/two_reads.sam \
--combined-paf test/output/two_reads_zscore.paf \
--combined-json test/output/two_reads_zscore.jsonl \
--threads 2 \
--sampling-rate 5000 \
--min-events-to-rescale 200 \
--scale-var \
--use-z-score- Event JSON – Array of segments
{start,end,length,mean,stdv,mean_z,stdv_z}plus metadata (sample_count,trimmed_start, global stats). - Alignment JSON – Includes event table, alignment records, base-to-event map, scaling parameters, and
used_z_scoreflag. - PAF – Standard fields plus:
sc:f:<scale>/sh:f:<shift>– affine calibration.ss:Z:<string>– summarised move-string mirroring f5c.zs:Z:1(when z-score mode is active).
Successful runs print per-read summaries (event counts, events/base) and highlight failures (e.g., missing basecalls, empty alignments).
- Verify model type vs.
--use-z-score. Z-scored TSVs require the flag; raw-level models should run without it. - If HDF5 linkage fails, re-run the environment exports or install HDF5 via Conda (
conda install hdf5). - Empty event sets typically indicate truncated signals or incorrect sampling rate: adjust
--sampling-rateor confirm FAST5 metadata. - For long batches, review stderr logs for reads that failed detection or alignment; they are summarised at the end of
alignruns.
The ./test directory contains a small, self-contained dataset for testing and demonstration.
test/two_reads.fast5: A FAST5 file containing two reads.test/reference.fa: The reference sequence for alignment.test/two_reads.sam: SAM version of the alignments.test/two_reads.fastq: Basecalled sequences for the reads.models/6merZ.tsv: Z-score k-mer model. Use with--use-z-score.test/output/two_reads.paf: An example of the expected output PAF file for validation.
With these files, users can run both detection and alignment examples within seconds and verify PAF/JSON outputs.
- Model generation – The current 6merZ.tsv model was generated from limited E.coli WGA data (~8000 reads). The Kmer model could be further updated afterwards from the basecall movetable.
- Alternative inputs – Blow5/BAM readers can reuse the same detection/alignment logic once Rust-side IO is implemented.
- Documentation – Sphinx docs reside in
docs/; rebuild withmake htmlfrom that directory when new APIs are added. We have detailed description of rust functions in rust_code.md - BAM support - We skipped the support for bam file for alignmnet and run on SAM file instead. This is because we have issues in
MIT License.