Skip to content
Chen Shuai edited this page Dec 11, 2025 · 43 revisions

KMERIA Wiki Documentation

This document contains detailed instructions for using the KMERIA software, aiming to guide users to running the complete analysis pipeline.

NAME

kmeria_wrapper.pl - A parallel wrapper for KMERIA pipeline (v2.0)

SYNOPSIS

kmeria_wrapper.pl --step <step> [options]

 Steps:
    count            Count k-mers from FASTQ files using KMC
    kctm             Build population-level k-mer matrices
    filter           Filter raw k-mer matrices
    m2b              Convert k-mer matrices to BIMBAM dosage format
    asso             Conduct k-mer association study
    all              Run the complete KMERIA pipeline

 Global Options:
   --help|-h                    Show detailed help message
   --threads|-t      [INT]      Number of threads per job [default: 16]
   --step            [STR]      Pipeline step to run (required)
   --scheduler|-s    [STR]      Job scheduler (local, slurm, sge, pbs) [default: local]
   --samples         [FILE]     File containing list of samples (one per line)
   --queue|-q        [STR]      Queue/Partition for job submission [default: share]
   --memory|-m       [STR]      Memory per job [default: 32G]
   --time            [STR]      Time limit for job [default: 720:00:00]

 Command-specific options:

 For 'count' (K-mer counting with KMC or kmeria count):
   --input|-i        [DIR]      Directory with FASTQ/FASTA files
   --output|-o       [DIR]      Output directory [default: 01_kmer_counts]
   --kmer|-k         [INT]      K-mer size [default: 31]
   --min-abund       [INT]      Minimum k-mer abundance (KMC only) [default: 4]
   --max-abund       [INT]      Maximum k-mer abundance (KMC only) [default: 1000]
   --batch-size|-b   [INT]      Number of samples per batch script [default: 4]

   KMC-specific options:
   --use-kmc                    Use KMC instead of kmeria count
   --kmc-memory      [INT]      Memory allocation for KMC in GB [default: 16]

   kmeria count-specific options (default):
   -C|--count-separate-strands  Count strands separately (no canonical)
   -T|--text-output             Text output instead of binary
   --partition-bits  [INT]      Partitioning bits [default: 16]
   -c|--compress-homopolymers   Compress homopolymers (experimental)

 For 'kctm' (K-mer matrix construction):
   --input|-i        [DIR]      Directory with sorted k-mer databases
   --output|-o       [DIR]      Output directory [default: 02_kmer_matrices]
   --kctm-batch      [INT]      Batch size for kctm processing [default: 10000]

 For 'filter' (Matrix filtering):
   --input|-i        [DIR]      Directory with k-mer matrices
   --output|-o       [DIR]      Output directory [default: 03_filtered_matrices]
   --max-abund       [INT]      Maximum k-mer abundance [default: 1000]
   --missing         [FLOAT]    Missing ratio threshold [default: 0.6]
   --ploidy|-p       [INT]      Genome ploidy [default: 4]
   --depth-file|-d   [FILE]     Sample depth file (REQUIRED)

 For 'm2b' (Matrix to BIMBAM conversion):
   --input|-i        [DIR]      Directory with filtered matrices
   --output|-o       [DIR]      Output directory [default: 04_bimbam]
   --bgzf-threads    [INT]      Threads for BGZF compression [default: 16]
   --sketch-size     [INT]      Number of k-mers for sampling [default: 8000000]
   --depth-file|-d   [FILE]     Sample depth file for sample list generation

 For 'asso' (Association analysis):
   --input|-i        [DIR]      Directory with BIMBAM files
   --output|-o       [DIR]      Output directory [default: 05_association]
   --pheno           [FILE]     Phenotype file (REQUIRED)
   --pheno-col|-n    [INT]      Phenotype column [default: 1]
   --covar           [FILE]     Covariate file (if not provided, PCA will be calculated)
   --kinship         [FILE]     Kinship matrix file (if not provided, will be calculated)
   --use-bimbam-tools           Use bimbamAsso mode in kassoc (default: gemma)
   --kinship-precision [INT]    Precision for kinship matrix [default: 10]
   --output-precision  [INT]    Precision for association output [default: 5]

DESCRIPTION

kmeria_wrapper.pl is a wrapper script for generating job scripts for the KMERIA v2.0 pipeline. It generates bash scripts that can be  manually submitted to cluster job schedulers (SLURM, SGE, PBS) or executed locally.

Key changes in v2.0: 
- kmeria count outputs binary format by default (optional text with -T); 
- Updated kctm parameters (batch mode, no-header);
- New filter command with compressed output format;
- Enhanced m2b with BGZF compression and statistics - Updated association analysis using bimbamAsso tool;

WORKFLOW

The complete KMERIA pipeline consists of 5 steps:

1. count - Count k-mers from FASTQ/FASTA files (kmeria count by default, or KMC with --use-kmc); 
2. kctm - Build k-mer count matrices from count results;
3. filter - Filter matrices by abundance, ploidy, and missing rate;
4. m2b - Convert matrices to BIMBAM format and generate VCF/PLINK files; 
5. asso - Perform association analysis with kinship correction using bimbamAsso

EXAMPLES

Generate scripts for complete pipeline

perl kmeria_wrapper.pl --step all \ 
                       --input /data/fastq_files \ 
                       --output /results/kmeria_analysis \ 
                       --samples sample.list \ --depth-file sample_depth.tsv \ 
                       --pheno phenotypes.txt \ 
                       --threads 32 \ 
                       --memory 32 \
                       --kmer 31 \ 
                       --min-abund 4 \ 
                       --max-abund 1000 \ 
                       --batch-size 4 \ 
                       --ploidy 4 \ 
                       --missing 0.6 \ 
                       --scheduler slurm \ 
                       --queue normal

Count k-mers using 'kmeria count'

  perl kmeria_wrapper.pl --step count \ 
                         --input /data/fastq_files \ 
                         --output /results/01_kmer_counts \ 
                         --samples sample.list \ 
                         --threads 32 \ 
                         --kmer 31 \ 
                         -C \
                         --batch-size 4 \  
                         --scheduler slurm \
                         --queue normal

Count k-mers using KMC (alternative method,recommended)

  perl kmeria_wrapper.pl --step count \ 
                         --input /data/fastq_files \ 
                         --output /results/01_kmer_counts \ 
                         --samples sample.list \ 
                         --threads 32 \ 
                         --kmer 31 \ 
                         --min-abund 4 \ 
                         --batch-size 4 \ 
                         --use-kmc \ 
                         --kmc-memory 16 \
                         --scheduler slurm \
                         --queue normal

Count k-mers with text output and separate strand counting

  perl kmeria_wrapper.pl --step count \ 
                         --input /data/fastq_files \
                         --output /results/01_kmer_counts \ 
                         --threads 32 \ 
                         --kmer 31 \ 
                         -C \ 
                         -T \
                         --batch-size 4

Association analysis with gemma mode

  perl kmeria_wrapper.pl --step asso \ 
                         --input /results/04_bimbam \ 
                         --output /results/05_association \ 
                         --pheno phenotypes.txt \ 
                         --covar covariates.txt \ 
                         --threads 64 \ 

Association analysis with bimbamAsso mode

   perl kmeria_wrapper.pl --step asso \
                          --input /results/04_bimbam \ 
                          --output /results/05_association \ 
                          --pheno phenotypes.txt \ 
                          --covar covariates.txt \ 
                          --threads 64 \ 
                          --use-bimbam-tools \ 
                          --kinship-precision 10 \
                          --output-precision 5

K-MER COUNTING METHODS

The wrapper supports two k-mer counting methods:

 Method 1: kmeria count (Default)
    Usage: perl kmeria_wrapper.pl --step count --input /data --output /results --kmer 31

    Generated command example: kmeria count -k 31 -t 32 -o sample1_k31.bin input.fq.gz

   Options specific to kmeria count: 
     -C, --count-separate-strands : Count forward and reverse strands separately (disables canonical k-mer mode) 
     -T, --text-output : Output in text format instead of binary # --partition-bits INT : Partitioning bits for memory optimization [default: 16] -c,
     --compress-homopolymers : Compress homopolymer regions (experimental)

    Output files: 
      - Binary format (default): sample_k31.bin 
      - Text format (with -T): sample_k31.txt

 Method 2: KMC (Alternative)
    Advantages: 
    - Well-established tool 
    - Highly optimized for large datasets
    - Sorted output compatible with kctm

    Usage: perl kmeria_wrapper.pl --step count --input data/ --output results/ --kmer 31 --use-kmc --kmc-memory 16

    Generated command example: 
    kmc -k31 -t32 -m16 -b -ci4 -cs1000 @sample_list.txt output_k31 temp_dir kmc_tools transform output_k31 sort output_sort_k31

    Options specific to KMC: 
    --use-kmc : Enable KMC mode 
    --kmc-memory INT : Memory allocation in GB [default: 16] 
    --min-abund INT : Minimum k-mer abundance threshold [default: 4] 
    --max-abund INT : Maximum k-mer abundance threshold [default: 1000]

    Output files: 
    - sample_k31.kmc_pre, sample_k31.kmc_suf (unsorted KMC database) 
    - sample_sort_k31.kmc_pre, sample_sort_k31.kmc_suf (sorted for kctm)

  Complete pipeline with parallel association
    Full workflow ending with parallel bimbamAsso 
    perl kmeria_wrapper.pl --step all \ 
                           -i /data/fastq_files \ 
                           -o /results/full_analysis \ 
                           --samples sample_list.txt \ 
                           -d sample_depth.tsv \ 
                           --pheno traits.txt \ 
                           -k 31 \ 
                           -t 32 \ 
                           -p 4 \ 
                           --missing 0.6 \ 
                           --use-bimbam-tools \ 
                           --scheduler slurm \
                           --queue normal

   Same pipeline using KMC for counting
   perl kmeria_wrapper.pl --step all \
                          -i /data/fastq_files \ 
                          -o /results/full_analysis \ 
                          --samples sample_list.txt \ 
                          -d sample_depth.tsv \ 
                          --pheno traits.txt \ 
                          -k 31 \ 
                          -t 32 \ 
                          -p 4 \ 
                          --missing 0.6 \ 
                          --use-kmc \ 
                          --kmc-memory 16 \ 
                          --min-abund 5 \
                          --scheduler slurm \
                          --queue normal

    Software dependencies: 
    - KMERIA v2.0+[](https://github.com/Sh1ne111/KMERIA) 
    - KMC v3.0+ (optional, use with --use-kmc)[](https://github.com/refresh-bio/KMC) 
    - PLINK v1.9+ 
    -kassoc (custom association tool) - bimbamKin and bimbamAsso (if using --use-bimbam-tools)

    Perl modules: 
    - Getopt::Long 
    - File::Basename 
    - File::Path 
    - Cwd 
    - Pod::Usage

FILE FORMATS

Sample list file
Plain text file with one sample name per line (no header):
  sample1
  sample2
  sample3

Depth file (sample_depth.tsv)
Tab-separated file with sample names and depth values:
  sample1    45.2
  sample2    52.8
  sample3    38.9

Phenotype file
Tab or space-separated file with format:
  sample1  1.5    0
  sample2  2.3    1
  sample3  1.8    1
Note: The assoc tool assumes the phenotype file has sample in column 1 and phenotype in the specified column (default 1, meaning column 2 if file has sample PHENO).

Covariate file
Tab or space-separated file with format:
  FAMID  INDID  COV1  COV2  COV3  ...
  sample1      sample1  0.5   1.2   -0.3
  sample2      sample2  -0.2  0.8   0.5

PERFORMANCE TIPS

Batch Size Selection The --batch-size parameter controls how many samples are processed per job script:

  • Small batches (2-4): Better for job scheduler queue management
  • Large batches (10-20): Fewer job scripts, easier to track - Consider: total samples, wall time limits, checkpoint/restart needs

Association Analysis Optimization The association step uses the kassoc tool, which handles internal parallelism:

  • Use --threads to control concurrency level - For large datasets, use higher --threads (e.g., 64)
  • Pre-compute kinship and covariates for faster runs

TROUBLESHOOTING

  Common Issues and Solutions
    Issue: "No samples found to process" 
     - Check that FASTQ/FASTA files have correct extensions (.fq, .fastq, .fa, .fasta) 
     - Verify --samples file contains valid sample names (one per line) 
     - Ensure input directory path is correct


  Checking Job Status
    After submitting jobs to the cluster:

      # SLURM
      squeue -u $USER
      sacct -j JOBID --format=JobID,JobName,State,ExitCode
  
      # SGE
      qstat -u $USER
      qacct -j JOBID
  
      # PBS
      qstat -u $USER
      tracejob JOBID

  Log File Locations
    Check log files for errors: output_dir/01_kmer_counts/count_batch_0.log
    output_dir/01_kmer_counts/count_batch_0.err
    output_dir/02_kmer_matrices/kctm_job.log
    output_dir/03_filtered_matrices/filter_job.log
    output_dir/04_bimbam/convert_job.log
    output_dir/05_association/asso_job.log

FREQUENTLY ASKED QUESTIONS

FAQs

Should I use kmeria count or KMC?
    Use kmeria count (default) for: 
           - Most standard analyses 
           - Direct KMERIA pipeline integration
    Use KMC (--use-kmc) for: 
            - Very large datasets (>100GB per sample) 
            - When you need strict abundance filtering 
            - Compatibility with other KMC-based workflows
            - Faster
    Consider: 
            - Shorter k-mers: More sensitive, more false positives, less memory 
            - Longer k-mers: More specific, fewer false positives, more memory

How do I process paired-end reads?
    Both methods automatically detect and process paired-end files: 
            - Files matching: sample_R1.fq.gz and sample_R2.fq.gz 
            - Or: sample_1.fq.gz and sample_2.fq.gz

Can I restart a failed pipeline?
    Yes! Since each step generates independent job scripts: 
    1. Identify which step failed (check log files) 
    2. Fix the issue (add memory, correct input files, etc.) 
    3. Re-run only that specific step: --step count|kctm|filter|m2b|asso
    4. Continue with subsequent steps

How do I speed up association analysis?
    The association step handles internal parallelism: 
     - Use --threads to set concurrency (e.g., 64) 
     - Ensure fast I/O (SSD storage) 
     - Pre-compute kinship and covariates
     Choose tool mode with --use-bimbam-tools for bimbamAsso mode.

AUTHOR

Updated for KMERIA v2.0 pipeline Original concept based on KMERIA by Chen
Shuai (chensss1209@gmail.com)

CHANGELOG

v2.0.0 (2025): - Added support for all kmeria count parameters (-C, -T,
-p, -c) - Updated filter step to use new compressed output format -
Enhanced m2b step with BGZF compression and statistics - Updated
association step to use kassoc tool

SEE ALSO

KMERIA documentation: https://github.com/Sh1ne111/KMERIA

Clone this wiki locally