moducomp is a bioinformatics pipeline designed to identify and analyze metabolic module completeness and complementarity in microbial communities. It processes genomic data (protein sequences in FAA format) to map KEGG Orthology (KO) terms to KEGG modules, calculates module completeness for individual genomes and combinations of genomes within a microbiome, specifically reports potential complementary relationships where a metabolic module is 100% complete in a combination of N genomes but not in any individual member or smaller subset.
- Annotation of protein sequences using
eggNOG-mapperto obtain KO terms. - Mapping of KOs to KEGG metabolic modules based on
kegg-pathways-completeness-toolto obtain metabolic module completeness. - Parallel processing support for faster KPCT (KEGG Pathways Completeness Tool) analysis with automatic chunking and checkpointing.
- Reporting of module completeness for individual genomes.
- Calculation of module completeness for N-member genome combinations.
- Generation of complementarity reports highlighting modules completed through genome partnerships.
- Tracks and reports the actual proteins that are responsible for the completion of the module in the combination of N genomes.
- Automatic resource monitoring with timestamped logs tracking CPU usage, memory consumption, and runtime for reproducibility.
moducomp uses Pixi for managing dependencies and environment setup. Pixi ensures that you have a consistent and reproducible environment.
- Install Pixi: If you don't have Pixi installed, follow the instructions on the official Pixi website.
curl -fsSL https://pixi.sh/install.sh | sh- Clone the Repository (if you haven't already):
git clone https://github.com/NeLLi-team/moducomp.git
cd moducomp- Install Dependencies using Pixi:
Navigate to the project's root directory (where
pixi.tomlis located) and run:
pixi installThis command will read the pixi.toml file, resolve the dependencies (including Python, eggNOG-mapper, kegg-pathways-completeness-tool, pandas, etc.), and set up a local environment in a .pixi subdirectory.
Note on eggNOG-mapper data: download the EggNOG-mapper DB that is required for functional annotation. This can be a large download and may take time. Ensure you have sufficient disk space and an internet connection:
pixi shell
export EGGNOG_DATA_DIR="/path/to/datadir/to/store/eggnog-db"
download_eggnog_data.pyAlternatively, you can configure eggNOG-mapper to use a pre-downloaded database.
export EGGNOG_DATA_DIR="/path/to/datadir/to/store/eggnog-db"moducomp provides two main commands: pipeline and analyze-ko-matrix. You can run these commands using Pixi tasks defined in pixi.toml or directly within the Pixi environment.
moducomp includes parallel processing capabilities for the KPCT (KEGG Pathways Completeness Tool) analysis, which can significantly improve performance for large datasets:
- Automatic chunking: Input files are automatically split into chunks for parallel processing
- Checkpointing: Resume capability if processing is interrupted - already processed chunks are automatically detected and skipped
- Fallback mechanism: If parallel processing fails, the system automatically falls back to sequential processing
- Configurable CPU usage: Use the
--ncpusparameter to control how many CPU cores to use
CPU Configuration:
- The
--ncpusparameter controls the number of CPU cores used for both eggNOG-mapper annotation and KPCT analysis - For KPCT parallel processing, the system creates the same number of chunks as CPU cores specified
- Example:
--ncpus 8will use 8 cores and create 8 chunks for optimal parallel processing
Prepare FAA files: Ensure FAA headers are in the form >genomeName|proteinId, or use the --adapt-headers option to format your headers into >fileName_prefix|protein_id_counter.
moducomp is specifically designed for large scale analysis of microbiomes with hundreds of members, and works on Linux systems with at least 64GB of RAM. Nevertheless, it can be run on smaller systems with less RAM, using the flag --lowmem when running the pipeline command.
To activate the Pixi environment shell:
pixi shellOnce in the shell, you can run ./moducomp.py --help for a full list of commands and options.
To test the pipeline, you'll need a directory with some sample genome FAA files.
-
Prepare FAA files: Create a directory (e.g.,
test_genomes/) and place your sample.faafiles in it. Ensure FAA headers are in the form>genomeName|proteinId, or use the--adapt-headersoption to format your headers into>fileName_prefix|protein_id_counter. -
Run the pipeline command:
If you have enough RAM (>64GB), you can run the pipeline with the following command:
pixi shell
./moducomp.py pipeline ./test_genomes ./output_test_pipeline --ncpus 16 --calculate-complementarity 3
# For verbose output with detailed progress information:
# ./moducomp.py pipeline ./test_genomes ./output_test_pipeline_verbose3 --ncpus 128 --calculate-complementarity 3 --verbose--lowmem flag to run the pipeline with less memory but it will be slower.
pixi shell
./moducomp.py pipeline ./test_genomes ./output_test_pipeline_lowmem --ncpus 16 --calculate-complementarity 3 --lowmemUse the pipeline command to process a directory of genome FAA files from scratch.
pixi shell
./moducomp.py pipeline \
/path/to/your/faa_files \
/path/to/your/output_directory \
--ncpus <number_of_cpus_to_use> \
--calculate-complementarity <N> # 0 to disable, 2 for 2-member, 3 for 3-member complementarity.
# Optional flags:
# --lowmem # Optional: Use this if you have less than 64GB of RAM
# --adapt-headers # If your FASTA headers need modification
# --del-tmp # To delete temporary files
# --verbose # Enable verbose output with detailed progress informationUse the analyze-ko-matrix command if you already have a KO matrix file (CSV format, where rows are genomes/combinations, columns are KOs, and values are KO counts).
The KO matrix file should have a taxon_oid column for genome identifiers, and subsequent columns for each KO (e.g., K00001, K00002) with integer counts.
pixi shell
./moducomp.py analyze-ko-matrix \
/path/to/your/kos_matrix.csv \
/path/to/your/output_directory \
--ncpus <number_of_cpus_to_use> \
--calculate-complementarity <N> # 0 to disable, 2 for 2-member, 3 for 3-member complementarity.
# Optional flags:
# --del-tmp false
# --verbose # Enable verbose output with detailed progress informationmoducomp includes advanced parallel processing capabilities for improved performance:
When using the --ncpus parameter with a value greater than 1, moducomp automatically enables parallel processing for the KPCT (KEGG Pathways Completeness Tool) analysis:
- Automatic chunking: Input files are split into
ncpuschunks for optimal load balancing - Concurrent processing: Multiple chunks are processed simultaneously using
multiprocessing.ProcessPoolExecutor - Resume capability: If processing is interrupted, completed chunks are automatically detected and skipped on restart
- Automatic fallback: If parallel processing fails, the system seamlessly falls back to sequential processing
- CPU cores: Start with
--ncpus 8for moderate datasets, increase to--ncpus 16or higher for large datasets - Memory considerations: Each parallel worker requires memory; reduce
--ncpusif you encounter memory issues - Large datasets: For datasets with hundreds of genomes, parallel processing can reduce KPCT analysis time by 50-80%
# For large datasets with sufficient resources
./moducomp.py pipeline ./large_genome_collection ./output_large --ncpus 32 --calculate-complementarity 3
# For moderate datasets with verbose output
./moducomp.py analyze-ko-matrix ./ko_matrix.csv ./output_moderate --ncpus 16 --calculate-complementarity 2 --verbose
# For systems with limited memory
./moducomp.py pipeline ./genomes ./output_lowmem --ncpus 8 --lowmem --calculate-complementarity 2moducomp generates several output files in the specified output directory:
kos_matrix.csv: Matrix of KO counts for each genomemodule_completeness.tsv: Module completeness scores for individual genomes and combinationsmodule_completeness_complementarity_Nmember.tsv: Complementarity reports (if requested)resource_usage_YYYYMMDD_HHMMSS.log: Resource monitoring log with CPU, memory, and runtime metrics for reproducibilitymoducomp_YYYYMMDD_HHMMSS.log: Detailed pipeline execution log
Villada, JC. & Schulz, F. (2025). Assessment of metabolic module completeness of genomes and metabolic complementarity in microbiomes with moducomp . moducomp (v0.5.1) Zenodo. https://doi.org/10.5281/zenodo.16116092