Skip to content

lrslab/CurEA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NanoCurEA Event Alignment Toolkit

Processing utilities for CycloneSeq nanopore raw-current signals, implemented in pure Rust for performance and reproducibility.

Author: Runsheng Li runsheng.li@cityu.edu.hk


Design Overview

  • Signal IOFast5Reader streams raw ADC traces and optional basecalls directly from FAST5/HDF5 containers without extracting to intermediate files.
  • Event DetectionEventDetector applies robust normalisation, changepoint scouting, and adaptive splitting to obtain ~2–3 events/base for Cyclone’s 5 kHz sampling.
  • ModelsModel::from_tsv loads k-mer emission tables (mean, sd) and precomputes z-score equivalents; supports either raw-level or z-scored models.
  • Alignment – Two paths:
    • Adaptive banded DP tuned to nanopore dwell statistics.
    • Optional HMM aligner (f5c-inspired) for dense event stretches.
  • Outputs – Per-read JSON event tables, alignment JSON/JSONL payloads, and PAF summaries with f5c-compatible ss move tags.

The Rust crate lives in curea/ and exposes both a library API and a CLI binary (curea) that stitches these stages.


Quick Start with Test Data

This repository includes a small test dataset in the ./test directory to allow you to quickly run and verify the software.

After following the installation instructions below, you can run the alignment on the test data:

# Ensure you have built the binary first with `cargo build --release` inside the `curea` directory.
./curea/target/release/curea align \
  --fast5 test/two_reads.fast5 \
  --model models/init.DNA.tsv \
  --reference test/reference.fa \
  --sam test/two_reads_filtered.bam \
  --combined-paf test/output/test_run_output.paf \
  --threads 2

This command will align the two reads in two_reads.fast5 and produce a test_run_output.paf file in test/output. You can compare this to the positive_control/first_2_reads.paf file to check for correctness.


Prerequisites

  1. Rust toolchain (Rust 1.74+ recommended) with cargo on PATH.

  2. HDF5 runtime via the project’s Conda environment:

    source /Users/li/miniconda3/etc/profile.d/conda.sh
    conda activate nanocurea
  3. macOS HDF5 linker fix (needed for binary runtime):

    export HDF5_DIR="$CONDA_PREFIX"
    export HDF5_LIBDIR="$CONDA_PREFIX/lib"
    export DYLD_LIBRARY_PATH="$CONDA_PREFIX/lib:${DYLD_LIBRARY_PATH:-}"
    export DYLD_FALLBACK_LIBRARY_PATH="$CONDA_PREFIX/lib:${DYLD_FALLBACK_LIBRARY_PATH:-}"

    If otool/install_name_tool are available, ensure $CONDA_PREFIX/lib is on the binary rpath (see CLAUDE.md for scriptlet).

  4. Data – Raw Cyclone FAST5/slow5 reads in their original directories (e.g., fast5_WGA_p/, blow5_WGA_p/), and reference FASTA (reference.fa).


Installation on Linux

  • System packages (Debian/Ubuntu):
    sudo apt-get update
    sudo apt-get install -y build-essential pkg-config libhdf5-dev zlib1g-dev
    # Optional: patchelf for rpath adjustments
    sudo apt-get install -y patchelf
  • System packages (RHEL/CentOS/Rocky):
    sudo yum groupinstall -y "Development Tools"
    sudo yum install -y hdf5-devel zlib-devel pkgconfig
    # Optional: patchelf
    sudo yum install -y patchelf
  • Conda route (cross‑platform):
    conda create -n nanocurea -c conda-forge hdf5 rust
    conda activate nanocurea
    export HDF5_DIR="$CONDA_PREFIX"
    export HDF5_LIBDIR="$CONDA_PREFIX/lib"
    export LD_LIBRARY_PATH="$CONDA_PREFIX/lib:${LD_LIBRARY_PATH:-}"

Build the binary:

cd curea
cargo build --release
# Binary at: target/release/curea

If the runtime cannot find HDF5 on Linux, ensure:

  • LD_LIBRARY_PATH contains $CONDA_PREFIX/lib (for Conda installs), or /usr/lib/x86_64-linux-gnu (Debian) / /usr/lib64 (RHEL) for system installs.
  • As a last resort, set an rpath into the binary (requires patchelf):
    patchelf --set-rpath "$CONDA_PREFIX/lib" target/release/curea

Building the CLI

cd curea
cargo build --release
# Binary: target/release/curea

Use cargo run -- <args> during development; switch to the compiled binary for batch jobs.


Model Files

  • Raw-level models (models/init.DNA.tsv): provide absolute picoamp means/SDs. Use default alignment mode (--use-z-score off).
  • Z-score models (models/wgafull_self.kmer.tsv): provide normalised means/SDs. Supply --use-z-score so emissions compare event z-scores directly.
  • Ensure k-mer size matches model expectations (e.g., 6-mer for Cyclone). The loader validates consistency and rejects malformed TSV rows.

CLI Usage

Run curea --help for the full synopsis. Key subcommands are summarised below.

Main function fo user: Alignment against a model/reference

This example uses the provided test data.

./curea/target/release/curea align \
  --fast5 test/two_reads.fast5 \
  --model models/init.DNA.tsv \
  --reference test/reference.fa \
  --sam test/two_reads_filtered.bam \
  --combined-paf test/output/two_reads.paf \
  --combined-json test/output/two_reads.jsonl \
  --output test/output/aligned_reads \
  --threads 2 \
  --sampling-rate 5000 \
  --min-events-to-rescale 200 \
  --scale-var

Optional flags:

  • --read-id limits processing to one read (required when --fast5 is a single file with multiple reads and you only want to process one).
  • --sequence-file supplies basecalled FASTA/FASTQ if no FAST5 basecall exists.
  • --use-fast5-basecall extracts basecalls directly from FAST5.
  • --use-z-score switches emissions to z-score mode for z-scored models.

Functions used for debug and development

1. Detect events from a single FAST5

This example uses one of the reads from the test dataset. First, you need to find the read ID.

./curea/target/release/curea list --input test/two_reads.fast5
# This will output the read IDs, for example: 2d116543-704b-455b-990a-5c8e97a2e04e

# Now use the read ID with the 'single' command
./curea/target/release/curea single \
  --input test/two_reads.fast5 \
  --read-id 2d116543-704b-455b-990a-5c8e97a2e04e \
  --output test/output/one_read.events.json \
  --expected-bases 2000 \
  --sampling-rate 5000

Outputs optional JSON containing the event table.

2. Batch detection over a directory

To test batch detection, you would place multiple FAST5 files in a directory. As test/two_reads.fast5 contains multiple reads, the align command is better suited for batch processing from a single file. If you had a directory test/many_fast5_files/, the command would be:

./curea/target/release/curea batch \
  --input-dir test/many_fast5_files \
  --output-dir test/output/events \
  --threads 8 \
  --sampling-rate 5000

Generates {fast5stem}_{read}.events.json files per read.

3. List FAST5 read IDs

./curea/target/release/curea list --input test/two_reads.fast5

Example Workflow

  1. Prepare environment
    # macOS (Conda)
    source /Users/li/miniconda3/etc/profile.d/conda.sh && conda activate nanocurea
    export HDF5_DIR="$CONDA_PREFIX"; export HDF5_LIBDIR="$CONDA_PREFIX/lib"
    export DYLD_LIBRARY_PATH="$CONDA_PREFIX/lib:${DYLD_LIBRARY_PATH:-}"
    
    # Linux (Conda)
    conda activate nanocurea
    export HDF5_DIR="$CONDA_PREFIX"; export HDF5_LIBDIR="$CONDA_PREFIX/lib"
    export LD_LIBRARY_PATH="$CONDA_PREFIX/lib:${LD_LIBRARY_PATH:-}"
  2. Build: cargo build --release
  3. Run alignment (WGA example above) to produce:
    • Combined PAF with per-read scaling tags (sc, sh, ss, optionally zs).
    • Per-read JSON payloads (*.nanocurea.json) under test/output/aligned_reads/.
    • Combined JSONL (if requested) for downstream analytics.

Equivalent direct curea align (no script):

./curea/target/release/curea align \
  --fast5 test/two_reads.fast5 \
  --model models/6merZ.tsv \
  --reference test/reference.fa \
  --sam test/two_reads.sam \
  --combined-paf test/output/two_reads_zscore.paf \
  --combined-json test/output/two_reads_zscore.jsonl \
  --threads 2 \
  --sampling-rate 5000 \
  --min-events-to-rescale 200 \
  --scale-var \
  --use-z-score

Expected Outputs

  • Event JSON – Array of segments {start,end,length,mean,stdv,mean_z,stdv_z} plus metadata (sample_count, trimmed_start, global stats).
  • Alignment JSON – Includes event table, alignment records, base-to-event map, scaling parameters, and used_z_score flag.
  • PAF – Standard fields plus:
    • sc:f:<scale> / sh:f:<shift> – affine calibration.
    • ss:Z:<string> – summarised move-string mirroring f5c.
    • zs:Z:1 (when z-score mode is active).

Successful runs print per-read summaries (event counts, events/base) and highlight failures (e.g., missing basecalls, empty alignments).


Validation & Troubleshooting

  • Verify model type vs. --use-z-score. Z-scored TSVs require the flag; raw-level models should run without it.
  • If HDF5 linkage fails, re-run the environment exports or install HDF5 via Conda (conda install hdf5).
  • Empty event sets typically indicate truncated signals or incorrect sampling rate: adjust --sampling-rate or confirm FAST5 metadata.
  • For long batches, review stderr logs for reads that failed detection or alignment; they are summarised at the end of align runs.

Test Data

The ./test directory contains a small, self-contained dataset for testing and demonstration.

  • test/two_reads.fast5: A FAST5 file containing two reads.
  • test/reference.fa: The reference sequence for alignment.
  • test/two_reads.sam: SAM version of the alignments.
  • test/two_reads.fastq: Basecalled sequences for the reads.
  • models/6merZ.tsv: Z-score k-mer model. Use with --use-z-score.
  • test/output/two_reads.paf: An example of the expected output PAF file for validation.

With these files, users can run both detection and alignment examples within seconds and verify PAF/JSON outputs.


Extending the Pipeline/Todos

  • Model generation – The current 6merZ.tsv model was generated from limited E.coli WGA data (~8000 reads). The Kmer model could be further updated afterwards from the basecall movetable.
  • Alternative inputs – Blow5/BAM readers can reuse the same detection/alignment logic once Rust-side IO is implemented.
  • Documentation – Sphinx docs reside in docs/; rebuild with make html from that directory when new APIs are added. We have detailed description of rust functions in rust_code.md
  • BAM support - We skipped the support for bam file for alignmnet and run on SAM file instead. This is because we have issues in

License

MIT License.

About

CurEA (Current Event Alignment) Toolkit, designed for CycloneSeq data

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages