From 2b0949b24c2cfdf1e9e728061b11fe85403bdc5e Mon Sep 17 00:00:00 2001 From: "google-labs-jules[bot]" <161369871+google-labs-jules[bot]@users.noreply.github.com> Date: Sun, 28 Dec 2025 20:16:10 +0000 Subject: [PATCH] Add high-level code overview document dd-20251228.md This document provides a summary of the current codebase structure, modules, and data flow, reflecting the simplified architecture in the `yogimass` package. --- dd-20251228.md | 62 ++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 62 insertions(+) create mode 100644 dd-20251228.md diff --git a/dd-20251228.md b/dd-20251228.md new file mode 100644 index 0000000..d11dc80 --- /dev/null +++ b/dd-20251228.md @@ -0,0 +1,62 @@ +# Yogimass Code Overview (dd-20251228) + +## Introduction + +Yogimass is a Python toolkit designed for processing and analyzing LC-MS/MS tandem spectra. It leverages the `matchms` library for spectral cleaning, metadata harmonization, and similarity scoring. The project aims to provide a modular and reproducible workflow for building and searching spectral libraries. + +## Architecture Overview + +The current codebase is organized as a flat Python package `yogimass` containing four main modules. This structure is simplified compared to earlier or planned versions described in the project's main `README.md`. + +**Package Structure:** +``` +yogimass/ +├── __init__.py # Exposes io, processing, and similarity modules +├── cli.py # Command-line interface entry point +├── io.py # Input/Output operations (MGF, MSP, JSON, Pickle) +├── processing.py # Spectral cleaning and filtering pipelines +└── similarity.py # Similarity scoring wrappers (Cosine, Modified Cosine) +``` + +## Core Modules + +### 1. `cli.py` (Command Line Interface) +The `cli.py` module serves as the main entry point for the application. It uses `argparse` to define commands and arguments. +- **`clean` command**: Orchestrates the library cleaning process. It accepts an input file (`.msp` or `.mgf`), an output directory, and an output format. It delegates the actual processing to `processing.py` and saving to `io.py`. +- **`search` command**: A placeholder for future similarity search functionality. + +### 2. `io.py` (Input/Output) +This module handles reading and writing spectral data. +- **Loading**: Wraps `matchms` functions to load spectra from MGF and MSP files. +- **Listing**: helper functions to list available library files in a directory. +- **Saving**: specific functions to export processed spectra to MGF, MSP, JSON, and Python Pickle formats. +- **Spectra Access**: `fetch_mgflib_spectrum` provides access to raw peak data and metadata for specific spectra, likely for visualization or detailed analysis. + +### 3. `processing.py` (Data Processing) +Contains the pipelines for cleaning and filtering spectral data. +- **`metadata_processing`**: Applies a series of filters to repair and harmonize metadata (e.g., SMILES/InChI repair, charge standardization, ion mode derivation). +- **`peak_processing`**: Filters peaks based on intensity (absolute and relative) and m/z range, and normalizes intensities. +- **Library Cleaning**: `clean_mgf_library` and `clean_msp_library` apply these pipelines sequentially to all spectra in a given file. + +### 4. `similarity.py` (Similarity Scoring) +Wraps `matchms` scoring algorithms to provide easy access to similarity calculations. +- **Cosine Similarity**: `calculate_cosscores` computes Cosine Greedy similarity between reference and query spectra. +- **Modified Cosine**: `modified_cosine_scores` computes Modified Cosine similarity. +- **Matching Helpers**: `top10_cosine_matches` and `threshold_matches` provide convenient ways to filter and inspect high-scoring matches. + +## Data Flow + +1. **Input**: The user invokes the CLI with a path to a raw spectral library (MGF or MSP). +2. **Ingestion**: `io.py` loads the raw data into `matchms` Spectrum objects. +3. **Processing**: `processing.py` iterates through the spectra: + * Metadata is repaired and standardized. + * Noise peaks are removed and intensities are normalized. +4. **Output**: The cleaned spectra are passed back to `io.py` to be serialized into the requested format (e.g., a cleaned MGF or a Pickle file) in the output directory. +5. **Analysis (Search)**: While the CLI wrapper is pending, the `similarity.py` module allows programmatic calculation of similarity scores between cleaned spectra and reference libraries. + +## Dependencies + +The core functionality relies heavily on: +- **`matchms`**: For the underlying data structures, filtering algorithms, and similarity metrics. +- **`pandas`**: For data manipulation in `io.py`. +- **`numpy`**: implied usage via `matchms` and `pandas`.