Skip to content
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 62 additions & 0 deletions dd-20251228.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# Yogimass Code Overview (dd-20251228)

## Introduction

Yogimass is a Python toolkit designed for processing and analyzing LC-MS/MS tandem spectra. It leverages the `matchms` library for spectral cleaning, metadata harmonization, and similarity scoring. The project aims to provide a modular and reproducible workflow for building and searching spectral libraries.

## Architecture Overview

The current codebase is organized as a flat Python package `yogimass` containing four main modules. This structure is simplified compared to earlier or planned versions described in the project's main `README.md`.

**Package Structure:**
```
yogimass/
├── __init__.py # Exposes io, processing, and similarity modules
├── cli.py # Command-line interface entry point
├── io.py # Input/Output operations (MGF, MSP, JSON, Pickle)
├── processing.py # Spectral cleaning and filtering pipelines
└── similarity.py # Similarity scoring wrappers (Cosine, Modified Cosine)
```

## Core Modules

### 1. `cli.py` (Command Line Interface)
The `cli.py` module serves as the main entry point for the application. It uses `argparse` to define commands and arguments.
- **`clean` command**: Orchestrates the library cleaning process. It accepts an input file (`.msp` or `.mgf`), an output directory, and an output format. It delegates the actual processing to `processing.py` and saving to `io.py`.
- **`search` command**: A placeholder for future similarity search functionality.

### 2. `io.py` (Input/Output)
This module handles reading and writing spectral data.
- **Loading**: Wraps `matchms` functions to load spectra from MGF and MSP files.
- **Listing**: helper functions to list available library files in a directory.
- **Saving**: specific functions to export processed spectra to MGF, MSP, JSON, and Python Pickle formats.
- **Spectra Access**: `fetch_mgflib_spectrum` provides access to raw peak data and metadata for specific spectra, likely for visualization or detailed analysis.

### 3. `processing.py` (Data Processing)
Contains the pipelines for cleaning and filtering spectral data.
- **`metadata_processing`**: Applies a series of filters to repair and harmonize metadata (e.g., SMILES/InChI repair, charge standardization, ion mode derivation).
- **`peak_processing`**: Filters peaks based on intensity (absolute and relative) and m/z range, and normalizes intensities.
- **Library Cleaning**: `clean_mgf_library` and `clean_msp_library` apply these pipelines sequentially to all spectra in a given file.

### 4. `similarity.py` (Similarity Scoring)
Wraps `matchms` scoring algorithms to provide easy access to similarity calculations.
- **Cosine Similarity**: `calculate_cosscores` computes Cosine Greedy similarity between reference and query spectra.
- **Modified Cosine**: `modified_cosine_scores` computes Modified Cosine similarity.
- **Matching Helpers**: `top10_cosine_matches` and `threshold_matches` provide convenient ways to filter and inspect high-scoring matches.

## Data Flow

1. **Input**: The user invokes the CLI with a path to a raw spectral library (MGF or MSP).
2. **Ingestion**: `io.py` loads the raw data into `matchms` Spectrum objects.
3. **Processing**: `processing.py` iterates through the spectra:
* Metadata is repaired and standardized.
* Noise peaks are removed and intensities are normalized.
4. **Output**: The cleaned spectra are passed back to `io.py` to be serialized into the requested format (e.g., a cleaned MGF or a Pickle file) in the output directory.
5. **Analysis (Search)**: While the CLI wrapper is pending, the `similarity.py` module allows programmatic calculation of similarity scores between cleaned spectra and reference libraries.

## Dependencies

The core functionality relies heavily on:
- **`matchms`**: For the underlying data structures, filtering algorithms, and similarity metrics.
- **`pandas`**: For data manipulation in `io.py`.
- **`numpy`**: implied usage via `matchms` and `pandas`.