From 2b0949b24c2cfdf1e9e728061b11fe85403bdc5e Mon Sep 17 00:00:00 2001
From: "google-labs-jules[bot]"
 <161369871+google-labs-jules[bot]@users.noreply.github.com>
Date: Sun, 28 Dec 2025 20:16:10 +0000
Subject: [PATCH] Add high-level code overview document dd-20251228.md

This document provides a summary of the current codebase structure,
modules, and data flow, reflecting the simplified architecture in
the `yogimass` package.
---
 dd-20251228.md | 62 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 62 insertions(+)
 create mode 100644 dd-20251228.md

diff --git a/dd-20251228.md b/dd-20251228.md
new file mode 100644
index 0000000..d11dc80
--- /dev/null
+++ b/dd-20251228.md
@@ -0,0 +1,62 @@
+# Yogimass Code Overview (dd-20251228)
+
+## Introduction
+
+Yogimass is a Python toolkit designed for processing and analyzing LC-MS/MS tandem spectra. It leverages the `matchms` library for spectral cleaning, metadata harmonization, and similarity scoring. The project aims to provide a modular and reproducible workflow for building and searching spectral libraries.
+
+## Architecture Overview
+
+The current codebase is organized as a flat Python package `yogimass` containing four main modules. This structure is simplified compared to earlier or planned versions described in the project's main `README.md`.
+
+**Package Structure:**
+```
+yogimass/
+├── __init__.py     # Exposes io, processing, and similarity modules
+├── cli.py          # Command-line interface entry point
+├── io.py           # Input/Output operations (MGF, MSP, JSON, Pickle)
+├── processing.py   # Spectral cleaning and filtering pipelines
+└── similarity.py   # Similarity scoring wrappers (Cosine, Modified Cosine)
+```
+
+## Core Modules
+
+### 1. `cli.py` (Command Line Interface)
+The `cli.py` module serves as the main entry point for the application. It uses `argparse` to define commands and arguments.
+- **`clean` command**: Orchestrates the library cleaning process. It accepts an input file (`.msp` or `.mgf`), an output directory, and an output format. It delegates the actual processing to `processing.py` and saving to `io.py`.
+- **`search` command**: A placeholder for future similarity search functionality.
+
+### 2. `io.py` (Input/Output)
+This module handles reading and writing spectral data.
+- **Loading**: Wraps `matchms` functions to load spectra from MGF and MSP files.
+- **Listing**: helper functions to list available library files in a directory.
+- **Saving**: specific functions to export processed spectra to MGF, MSP, JSON, and Python Pickle formats.
+- **Spectra Access**: `fetch_mgflib_spectrum` provides access to raw peak data and metadata for specific spectra, likely for visualization or detailed analysis.
+
+### 3. `processing.py` (Data Processing)
+Contains the pipelines for cleaning and filtering spectral data.
+- **`metadata_processing`**: Applies a series of filters to repair and harmonize metadata (e.g., SMILES/InChI repair, charge standardization, ion mode derivation).
+- **`peak_processing`**: Filters peaks based on intensity (absolute and relative) and m/z range, and normalizes intensities.
+- **Library Cleaning**: `clean_mgf_library` and `clean_msp_library` apply these pipelines sequentially to all spectra in a given file.
+
+### 4. `similarity.py` (Similarity Scoring)
+Wraps `matchms` scoring algorithms to provide easy access to similarity calculations.
+- **Cosine Similarity**: `calculate_cosscores` computes Cosine Greedy similarity between reference and query spectra.
+- **Modified Cosine**: `modified_cosine_scores` computes Modified Cosine similarity.
+- **Matching Helpers**: `top10_cosine_matches` and `threshold_matches` provide convenient ways to filter and inspect high-scoring matches.
+
+## Data Flow
+
+1.  **Input**: The user invokes the CLI with a path to a raw spectral library (MGF or MSP).
+2.  **Ingestion**: `io.py` loads the raw data into `matchms` Spectrum objects.
+3.  **Processing**: `processing.py` iterates through the spectra:
+    *   Metadata is repaired and standardized.
+    *   Noise peaks are removed and intensities are normalized.
+4.  **Output**: The cleaned spectra are passed back to `io.py` to be serialized into the requested format (e.g., a cleaned MGF or a Pickle file) in the output directory.
+5.  **Analysis (Search)**: While the CLI wrapper is pending, the `similarity.py` module allows programmatic calculation of similarity scores between cleaned spectra and reference libraries.
+
+## Dependencies
+
+The core functionality relies heavily on:
+-   **`matchms`**: For the underlying data structures, filtering algorithms, and similarity metrics.
+-   **`pandas`**: For data manipulation in `io.py`.
+-   **`numpy`**: implied usage via `matchms` and `pandas`.