Skip to content

NeLLi-team/moducomp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

moducomp: metabolic module completeness of genomes and metabolic complementarity in microbiomes

moducomp is a bioinformatics pipeline designed to identify and analyze metabolic module completeness and complementarity in microbial communities. It processes genomic data (protein sequences in FAA format) to map KEGG Orthology (KO) terms to KEGG modules, calculates module completeness for individual genomes and combinations of genomes within a microbiome, specifically reports potential complementary relationships where a metabolic module is 100% complete in a combination of N genomes but not in any individual member or smaller subset.

Features

  • Annotation of protein sequences using eggNOG-mapper to obtain KO terms.
  • Mapping of KOs to KEGG metabolic modules based on kegg-pathways-completeness-tool to obtain metabolic module completeness.
  • Parallel processing support for faster KPCT (KEGG Pathways Completeness Tool) analysis with automatic chunking and checkpointing.
  • Reporting of module completeness for individual genomes.
  • Calculation of module completeness for N-member genome combinations.
  • Generation of complementarity reports highlighting modules completed through genome partnerships.
  • Tracks and reports the actual proteins that are responsible for the completion of the module in the combination of N genomes.
  • Automatic resource monitoring with timestamped logs tracking CPU usage, memory consumption, and runtime for reproducibility.

Installation

moducomp uses Pixi for managing dependencies and environment setup. Pixi ensures that you have a consistent and reproducible environment.

  1. Install Pixi: If you don't have Pixi installed, follow the instructions on the official Pixi website.
curl -fsSL https://pixi.sh/install.sh | sh
  1. Clone the Repository (if you haven't already):
git clone https://github.com/NeLLi-team/moducomp.git
cd moducomp
  1. Install Dependencies using Pixi: Navigate to the project's root directory (where pixi.toml is located) and run:
pixi install

This command will read the pixi.toml file, resolve the dependencies (including Python, eggNOG-mapper, kegg-pathways-completeness-tool, pandas, etc.), and set up a local environment in a .pixi subdirectory.

Note on eggNOG-mapper data: download the EggNOG-mapper DB that is required for functional annotation. This can be a large download and may take time. Ensure you have sufficient disk space and an internet connection:

pixi shell
export EGGNOG_DATA_DIR="/path/to/datadir/to/store/eggnog-db"
download_eggnog_data.py

Alternatively, you can configure eggNOG-mapper to use a pre-downloaded database.

export EGGNOG_DATA_DIR="/path/to/datadir/to/store/eggnog-db"

Usage

moducomp provides two main commands: pipeline and analyze-ko-matrix. You can run these commands using Pixi tasks defined in pixi.toml or directly within the Pixi environment.

Performance and parallel processing

moducomp includes parallel processing capabilities for the KPCT (KEGG Pathways Completeness Tool) analysis, which can significantly improve performance for large datasets:

  • Automatic chunking: Input files are automatically split into chunks for parallel processing
  • Checkpointing: Resume capability if processing is interrupted - already processed chunks are automatically detected and skipped
  • Fallback mechanism: If parallel processing fails, the system automatically falls back to sequential processing
  • Configurable CPU usage: Use the --ncpus parameter to control how many CPU cores to use

CPU Configuration:

  • The --ncpus parameter controls the number of CPU cores used for both eggNOG-mapper annotation and KPCT analysis
  • For KPCT parallel processing, the system creates the same number of chunks as CPU cores specified
  • Example: --ncpus 8 will use 8 cores and create 8 chunks for optimal parallel processing

⚠️ Important note 1

Prepare FAA files: Ensure FAA headers are in the form >genomeName|proteinId, or use the --adapt-headers option to format your headers into >fileName_prefix|protein_id_counter.

⚠️ Important note 2

moducomp is specifically designed for large scale analysis of microbiomes with hundreds of members, and works on Linux systems with at least 64GB of RAM. Nevertheless, it can be run on smaller systems with less RAM, using the flag --lowmem when running the pipeline command.

To activate the Pixi environment shell:

pixi shell

Once in the shell, you can run ./moducomp.py --help for a full list of commands and options.

Testing the pipeline (example)

To test the pipeline, you'll need a directory with some sample genome FAA files.

  1. Prepare FAA files: Create a directory (e.g., test_genomes/) and place your sample .faa files in it. Ensure FAA headers are in the form >genomeName|proteinId, or use the --adapt-headers option to format your headers into >fileName_prefix|protein_id_counter.

  2. Run the pipeline command:

If you have enough RAM (>64GB), you can run the pipeline with the following command:

pixi shell
./moducomp.py pipeline ./test_genomes ./output_test_pipeline --ncpus 16 --calculate-complementarity 3
# For verbose output with detailed progress information:
# ./moducomp.py pipeline ./test_genomes ./output_test_pipeline_verbose3 --ncpus 128 --calculate-complementarity 3 --verbose

⚠️ Note: If don't have enough RAM (>64GB), you can use the --lowmem flag to run the pipeline with less memory but it will be slower.

pixi shell
./moducomp.py pipeline ./test_genomes ./output_test_pipeline_lowmem --ncpus 16 --calculate-complementarity 3 --lowmem

Running with your samples

pipeline command

Use the pipeline command to process a directory of genome FAA files from scratch.

pixi shell

./moducomp.py pipeline \
    /path/to/your/faa_files \
    /path/to/your/output_directory \
    --ncpus <number_of_cpus_to_use> \
    --calculate-complementarity <N>  # 0 to disable, 2 for 2-member, 3 for 3-member complementarity.
    # Optional flags:
    # --lowmem                    # Optional: Use this if you have less than 64GB of RAM
    # --adapt-headers             # If your FASTA headers need modification
    # --del-tmp                   # To delete temporary files
    # --verbose                   # Enable verbose output with detailed progress information

analyze-ko-matrix command

Use the analyze-ko-matrix command if you already have a KO matrix file (CSV format, where rows are genomes/combinations, columns are KOs, and values are KO counts).

The KO matrix file should have a taxon_oid column for genome identifiers, and subsequent columns for each KO (e.g., K00001, K00002) with integer counts.

pixi shell

./moducomp.py analyze-ko-matrix \
    /path/to/your/kos_matrix.csv \
    /path/to/your/output_directory \
    --ncpus <number_of_cpus_to_use> \
    --calculate-complementarity <N>  # 0 to disable, 2 for 2-member, 3 for 3-member complementarity.

    # Optional flags:
    # --del-tmp false
    # --verbose                   # Enable verbose output with detailed progress information

Parallel processing features

moducomp includes advanced parallel processing capabilities for improved performance:

KPCT parallel processing

When using the --ncpus parameter with a value greater than 1, moducomp automatically enables parallel processing for the KPCT (KEGG Pathways Completeness Tool) analysis:

  • Automatic chunking: Input files are split into ncpus chunks for optimal load balancing
  • Concurrent processing: Multiple chunks are processed simultaneously using multiprocessing.ProcessPoolExecutor
  • Resume capability: If processing is interrupted, completed chunks are automatically detected and skipped on restart
  • Automatic fallback: If parallel processing fails, the system seamlessly falls back to sequential processing

Performance tips

  • CPU cores: Start with --ncpus 8 for moderate datasets, increase to --ncpus 16 or higher for large datasets
  • Memory considerations: Each parallel worker requires memory; reduce --ncpus if you encounter memory issues
  • Large datasets: For datasets with hundreds of genomes, parallel processing can reduce KPCT analysis time by 50-80%

Example with parallel processing

# For large datasets with sufficient resources
./moducomp.py pipeline ./large_genome_collection ./output_large --ncpus 32 --calculate-complementarity 3

# For moderate datasets with verbose output
./moducomp.py analyze-ko-matrix ./ko_matrix.csv ./output_moderate --ncpus 16 --calculate-complementarity 2 --verbose

# For systems with limited memory
./moducomp.py pipeline ./genomes ./output_lowmem --ncpus 8 --lowmem --calculate-complementarity 2

Output files

moducomp generates several output files in the specified output directory:

  • kos_matrix.csv: Matrix of KO counts for each genome
  • module_completeness.tsv: Module completeness scores for individual genomes and combinations
  • module_completeness_complementarity_Nmember.tsv: Complementarity reports (if requested)
  • resource_usage_YYYYMMDD_HHMMSS.log: Resource monitoring log with CPU, memory, and runtime metrics for reproducibility
  • moducomp_YYYYMMDD_HHMMSS.log: Detailed pipeline execution log

Citation

Villada, JC. & Schulz, F. (2025). Assessment of metabolic module completeness of genomes and metabolic complementarity in microbiomes with moducomp . moducomp (v0.5.1) Zenodo. https://doi.org/10.5281/zenodo.16116092

About

Metabolic module completeness of genomes and metabolic complementarity in microbiomes

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages