-
Notifications
You must be signed in to change notification settings - Fork 3
Home
Chen Shuai edited this page Dec 11, 2025
·
43 revisions
This document contains detailed instructions for using the KMERIA software, aiming to guide users to running the complete analysis pipeline.
kmeria_wrapper.pl - A parallel wrapper for KMERIA pipeline (v2.0)
kmeria_wrapper.pl --step <step> [options]
Steps:
count Count k-mers from FASTQ files using KMC
kctm Build population-level k-mer matrices
filter Filter raw k-mer matrices
m2b Convert k-mer matrices to BIMBAM dosage format
asso Conduct k-mer association study
all Run the complete KMERIA pipeline
Global Options:
--help|-h Show detailed help message
--threads|-t [INT] Number of threads per job [default: 16]
--step [STR] Pipeline step to run (required)
--scheduler|-s [STR] Job scheduler (local, slurm, sge, pbs) [default: local]
--samples [FILE] File containing list of samples (one per line)
--queue|-q [STR] Queue/Partition for job submission [default: share]
--memory|-m [STR] Memory per job [default: 32G]
--time [STR] Time limit for job [default: 720:00:00]
Command-specific options:
For 'count' (K-mer counting with KMC or kmeria count):
--input|-i [DIR] Directory with FASTQ/FASTA files
--output|-o [DIR] Output directory [default: 01_kmer_counts]
--kmer|-k [INT] K-mer size [default: 31]
--min-abund [INT] Minimum k-mer abundance (KMC only) [default: 4]
--max-abund [INT] Maximum k-mer abundance (KMC only) [default: 1000]
--batch-size|-b [INT] Number of samples per batch script [default: 4]
KMC-specific options:
--use-kmc Use KMC instead of kmeria count
--kmc-memory [INT] Memory allocation for KMC in GB [default: 16]
kmeria count-specific options (default):
-C|--count-separate-strands Count strands separately (no canonical)
-T|--text-output Text output instead of binary
--partition-bits [INT] Partitioning bits [default: 16]
-c|--compress-homopolymers Compress homopolymers (experimental)
For 'kctm' (K-mer matrix construction):
--input|-i [DIR] Directory with sorted k-mer databases
--output|-o [DIR] Output directory [default: 02_kmer_matrices]
--kctm-batch [INT] Batch size for kctm processing [default: 10000]
For 'filter' (Matrix filtering):
--input|-i [DIR] Directory with k-mer matrices
--output|-o [DIR] Output directory [default: 03_filtered_matrices]
--max-abund [INT] Maximum k-mer abundance [default: 1000]
--missing [FLOAT] Missing ratio threshold [default: 0.6]
--ploidy|-p [INT] Genome ploidy [default: 4]
--depth-file|-d [FILE] Sample depth file (REQUIRED)
For 'm2b' (Matrix to BIMBAM conversion):
--input|-i [DIR] Directory with filtered matrices
--output|-o [DIR] Output directory [default: 04_bimbam]
--bgzf-threads [INT] Threads for BGZF compression [default: 16]
--sketch-size [INT] Number of k-mers for sampling [default: 8000000]
--depth-file|-d [FILE] Sample depth file for sample list generation
For 'asso' (Association analysis):
--input|-i [DIR] Directory with BIMBAM files
--output|-o [DIR] Output directory [default: 05_association]
--pheno [FILE] Phenotype file (REQUIRED)
--pheno-col|-n [INT] Phenotype column [default: 1]
--covar [FILE] Covariate file (if not provided, PCA will be calculated)
--kinship [FILE] Kinship matrix file (if not provided, will be calculated)
--use-bimbam-tools Use bimbamAsso mode in kassoc (default: gemma)
--kinship-precision [INT] Precision for kinship matrix [default: 10]
--output-precision [INT] Precision for association output [default: 5]
kmeria_wrapper.pl is a wrapper script for generating job scripts for the KMERIA v2.0 pipeline. It generates bash scripts that can be manually submitted to cluster job schedulers (SLURM, SGE, PBS) or executed locally.
Key changes in v2.0:
- kmeria count outputs binary format by default (optional text with -T);
- Updated kctm parameters (batch mode, no-header);
- New filter command with compressed output format;
- Enhanced m2b with BGZF compression and statistics - Updated association analysis using bimbamAsso tool;
The complete KMERIA pipeline consists of 5 steps:
1. count - Count k-mers from FASTQ/FASTA files (kmeria count by default, or KMC with --use-kmc);
2. kctm - Build k-mer count matrices from count results;
3. filter - Filter matrices by abundance, ploidy, and missing rate;
4. m2b - Convert matrices to BIMBAM format and generate VCF/PLINK files;
5. asso - Perform association analysis with kinship correction using bimbamAsso
Generate scripts for complete pipeline
perl kmeria_wrapper.pl --step all \
--input /data/fastq_files \
--output /results/kmeria_analysis \
--samples sample.list \ --depth-file sample_depth.tsv \
--pheno phenotypes.txt \
--threads 32 \
--memory 32 \
--kmer 31 \
--min-abund 4 \
--max-abund 1000 \
--batch-size 4 \
--ploidy 4 \
--missing 0.6 \
--scheduler slurm \
--queue normal
Count k-mers using 'kmeria count'
perl kmeria_wrapper.pl --step count \
--input /data/fastq_files \
--output /results/01_kmer_counts \
--samples sample.list \
--threads 32 \
--kmer 31 \
-C \
--batch-size 4 \
--scheduler slurm \
--queue normal
Count k-mers using KMC (alternative method,recommended)
perl kmeria_wrapper.pl --step count \
--input /data/fastq_files \
--output /results/01_kmer_counts \
--samples sample.list \
--threads 32 \
--kmer 31 \
--min-abund 4 \
--batch-size 4 \
--use-kmc \
--kmc-memory 16 \
--scheduler slurm \
--queue normal
Count k-mers with text output and separate strand counting
perl kmeria_wrapper.pl --step count \
--input /data/fastq_files \
--output /results/01_kmer_counts \
--threads 32 \
--kmer 31 \
-C \
-T \
--batch-size 4
Association analysis with gemma mode
perl kmeria_wrapper.pl --step asso \
--input /results/04_bimbam \
--output /results/05_association \
--pheno phenotypes.txt \
--covar covariates.txt \
--threads 64 \
Association analysis with bimbamAsso mode
perl kmeria_wrapper.pl --step asso \
--input /results/04_bimbam \
--output /results/05_association \
--pheno phenotypes.txt \
--covar covariates.txt \
--threads 64 \
--use-bimbam-tools \
--kinship-precision 10 \
--output-precision 5
The wrapper supports two k-mer counting methods:
Method 1: kmeria count (Default)
Usage: perl kmeria_wrapper.pl --step count --input /data --output /results --kmer 31
Generated command example: kmeria count -k 31 -t 32 -o sample1_k31.bin input.fq.gz
Options specific to kmeria count:
-C, --count-separate-strands : Count forward and reverse strands separately (disables canonical k-mer mode)
-T, --text-output : Output in text format instead of binary # --partition-bits INT : Partitioning bits for memory optimization [default: 16] -c,
--compress-homopolymers : Compress homopolymer regions (experimental)
Output files:
- Binary format (default): sample_k31.bin
- Text format (with -T): sample_k31.txt
Method 2: KMC (Alternative)
Advantages:
- Well-established tool
- Highly optimized for large datasets
- Sorted output compatible with kctm
Usage: perl kmeria_wrapper.pl --step count --input data/ --output results/ --kmer 31 --use-kmc --kmc-memory 16
Generated command example:
kmc -k31 -t32 -m16 -b -ci4 -cs1000 @sample_list.txt output_k31 temp_dir kmc_tools transform output_k31 sort output_sort_k31
Options specific to KMC:
--use-kmc : Enable KMC mode
--kmc-memory INT : Memory allocation in GB [default: 16]
--min-abund INT : Minimum k-mer abundance threshold [default: 4]
--max-abund INT : Maximum k-mer abundance threshold [default: 1000]
Output files:
- sample_k31.kmc_pre, sample_k31.kmc_suf (unsorted KMC database)
- sample_sort_k31.kmc_pre, sample_sort_k31.kmc_suf (sorted for kctm)
Complete pipeline with parallel association
Full workflow ending with parallel bimbamAsso
perl kmeria_wrapper.pl --step all \
-i /data/fastq_files \
-o /results/full_analysis \
--samples sample_list.txt \
-d sample_depth.tsv \
--pheno traits.txt \
-k 31 \
-t 32 \
-p 4 \
--missing 0.6 \
--use-bimbam-tools \
--scheduler slurm \
--queue normal
Same pipeline using KMC for counting
perl kmeria_wrapper.pl --step all \
-i /data/fastq_files \
-o /results/full_analysis \
--samples sample_list.txt \
-d sample_depth.tsv \
--pheno traits.txt \
-k 31 \
-t 32 \
-p 4 \
--missing 0.6 \
--use-kmc \
--kmc-memory 16 \
--min-abund 5 \
--scheduler slurm \
--queue normal
Software dependencies:
- KMERIA v2.0+[](https://github.com/Sh1ne111/KMERIA)
- KMC v3.0+ (optional, use with --use-kmc)[](https://github.com/refresh-bio/KMC)
- PLINK v1.9+
-kassoc (custom association tool) - bimbamKin and bimbamAsso (if using --use-bimbam-tools)
Perl modules:
- Getopt::Long
- File::Basename
- File::Path
- Cwd
- Pod::Usage
Sample list file
Plain text file with one sample name per line (no header):
sample1
sample2
sample3
Depth file (sample_depth.tsv)
Tab-separated file with sample names and depth values:
sample1 45.2
sample2 52.8
sample3 38.9
Phenotype file
Tab or space-separated file with format:
sample1 1.5 0
sample2 2.3 1
sample3 1.8 1
Note: The assoc tool assumes the phenotype file has sample in column 1 and phenotype in the specified column (default 1, meaning column 2 if file has sample PHENO).
Covariate file
Tab or space-separated file with format:
FAMID INDID COV1 COV2 COV3 ...
sample1 sample1 0.5 1.2 -0.3
sample2 sample2 -0.2 0.8 0.5
Batch Size Selection The --batch-size parameter controls how many samples are processed per job script:
- Small batches (2-4): Better for job scheduler queue management
- Large batches (10-20): Fewer job scripts, easier to track - Consider: total samples, wall time limits, checkpoint/restart needs
Association Analysis Optimization The association step uses the kassoc tool, which handles internal parallelism:
- Use --threads to control concurrency level - For large datasets, use higher --threads (e.g., 64)
- Pre-compute kinship and covariates for faster runs
Common Issues and Solutions
Issue: "No samples found to process"
- Check that FASTQ/FASTA files have correct extensions (.fq, .fastq, .fa, .fasta)
- Verify --samples file contains valid sample names (one per line)
- Ensure input directory path is correct
Checking Job Status
After submitting jobs to the cluster:
# SLURM
squeue -u $USER
sacct -j JOBID --format=JobID,JobName,State,ExitCode
# SGE
qstat -u $USER
qacct -j JOBID
# PBS
qstat -u $USER
tracejob JOBID
Log File Locations
Check log files for errors: output_dir/01_kmer_counts/count_batch_0.log
output_dir/01_kmer_counts/count_batch_0.err
output_dir/02_kmer_matrices/kctm_job.log
output_dir/03_filtered_matrices/filter_job.log
output_dir/04_bimbam/convert_job.log
output_dir/05_association/asso_job.log
Should I use kmeria count or KMC?
Use kmeria count (default) for:
- Most standard analyses
- Direct KMERIA pipeline integration
Use KMC (--use-kmc) for:
- Very large datasets (>100GB per sample)
- When you need strict abundance filtering
- Compatibility with other KMC-based workflows
- Faster
Consider:
- Shorter k-mers: More sensitive, more false positives, less memory
- Longer k-mers: More specific, fewer false positives, more memory
How do I process paired-end reads?
Both methods automatically detect and process paired-end files:
- Files matching: sample_R1.fq.gz and sample_R2.fq.gz
- Or: sample_1.fq.gz and sample_2.fq.gz
Can I restart a failed pipeline?
Yes! Since each step generates independent job scripts:
1. Identify which step failed (check log files)
2. Fix the issue (add memory, correct input files, etc.)
3. Re-run only that specific step: --step count|kctm|filter|m2b|asso
4. Continue with subsequent steps
How do I speed up association analysis?
The association step handles internal parallelism:
- Use --threads to set concurrency (e.g., 64)
- Ensure fast I/O (SSD storage)
- Pre-compute kinship and covariates
Choose tool mode with --use-bimbam-tools for bimbamAsso mode.
Updated for KMERIA v2.0 pipeline Original concept based on KMERIA by Chen
Shuai (chensss1209@gmail.com)
v2.0.0 (2025): - Added support for all kmeria count parameters (-C, -T,
-p, -c) - Updated filter step to use new compressed output format -
Enhanced m2b step with BGZF compression and statistics - Updated
association step to use kassoc tool
KMERIA documentation: https://github.com/Sh1ne111/KMERIA