Skip to content

Promote consensus mass-tracing types + add Ms1Feature writer for MassFeature#1069

Open
trishorts wants to merge 20 commits into
smith-chem-wisc:masterfrom
trishorts:consensus-to-ms1feature
Open

Promote consensus mass-tracing types + add Ms1Feature writer for MassFeature#1069
trishorts wants to merge 20 commits into
smith-chem-wisc:masterfrom
trishorts:consensus-to-ms1feature

Conversation

@trishorts
Copy link
Copy Markdown
Contributor

@trishorts trishorts commented May 13, 2026

This PRis for a consensus mass-tracing classic deconvolution post-processor (per-charge trace grouping, off-by-one correction, cross-charge feature stitching). The writer serializes its output as a FLASHDeconv-style _ms1.feature file MetaMorpheus already reads — bridging consensus deconvolution into search with no new wire format. Off-by-one correction is now resolution-aware: it derives its window from per-trace mass scatter and splits resolvable co-grouped species such as deamidation into their own features instead of erasing them, which bottom-up search requires. Feature output is deterministic, inputs are guarded, and the promoted types ship with 100%-line-covered unit and performance tests.


  • Resolution-aware off-by-one correction (behaviour change). TraceCorrector.Correct now returns List<CorrectedTrace> and derives its window from per-trace mass scatter (3σ via MAD, capped below half the isotope spacing) instead of a fixed ±0.05 Da. An envelope at a resolvable, non-isotope offset — e.g. deamidation (+0.98402 Da), which the old window snapped to the unmodified mass — is split into its own trace and surfaced as a distinct feature; when scatter is too large to resolve it from the 1.00335 Da isotope spacing it conservatively merges. Bottom-up search needs this (deamidation detection is a requirement). The only constants are the 3σ multiple and the physics constant 1.00335 — the window is data-derived, not hardwired. Note: this supersedes the flat ±0.05 used for the benchmark numbers below, so those warrant an end-to-end re-run.

Summary

Two unrelated-looking but tightly-coupled goals in one PR:

  1. Promote the consensus mass-tracing post-processor out of Development.Deconvolution.ConsensusTracing (research-only, see NOTES.md on the consensus-tracing branch for the 1300-line research log behind the design) into a production namespace MassSpectrometry.Deconvolution.Consensus. Per-charge trace grouping, median off-by-one correction, and cross-charge feature stitching are now available to any caller without going through the Development project.

  2. Add a writer that turns the consensus pipeline's output into a FLASHDeconv-style _ms1.feature file via a static Ms1FeatureFile.FromMassFeatures(...) factory + a MassFeature.ToMs1Feature(...) extension method. The factory builds a populated Ms1FeatureFile instance; calling its existing WriteResults(path) emits the wire format that any Ms1FeatureFile consumer already understands — notably MetaMorpheus's FromFileDeconvolutionParameters from mzLib PR 1064, which the MetaMorpheus ExternalMs1Features PR has now wired into the precursor pipeline.

Combined effect: a consensus pipeline can persist its result to a file that MetaMorpheus already reads, with zero new mzLib types crossing into MetaMorpheus and zero new MetaMorpheus surface area. The previously-designed Ms2PrecursorEmitter + Ms2ScanWithPrecursorCandidate two-PR plan (see NOTES.md "Pending design — MS2 emitter" section) is obviated by this approach and can be dropped.

Why now

The consensus pipeline (Classic-decon → trace grouping → off-by-one correction → cross-charge clustering → optional Phase 13 FastTree scoring) was developed over Phases 0–13 of the research arc with the types parked in Development\DeconvolutionDevelopment\ConsensusTracing\. The design has now stabilised: 82–84% recall against the independent R7 reference, mean Phase-13 CV AUC 0.68. NOTES.md flagged "WholeFile move of consensus-tracing types" as the next move once the design settled. With the MetaMorpheus FromFile path landing and the consensus pipeline ready to be consumed, this PR is that move plus the matching writer.

Phase A: Type promotion (commit 3d734cc1)

Six types moved out of Development.Deconvolution.ConsensusTracing into new MassSpectrometry.Deconvolution.Consensus:

Type New file
MassTrace mzLib/MassSpectrometry/Deconvolution/Consensus/MassTrace.cs
MassTraceBuilder (static) mzLib/MassSpectrometry/Deconvolution/Consensus/MassTraceBuilder.cs
CorrectedEnvelope mzLib/MassSpectrometry/Deconvolution/Consensus/CorrectedEnvelope.cs
CorrectedTrace mzLib/MassSpectrometry/Deconvolution/Consensus/CorrectedTrace.cs
TraceCorrector (static, + EnvelopeWeight delegate) mzLib/MassSpectrometry/Deconvolution/Consensus/TraceCorrector.cs
MassFeature mzLib/MassSpectrometry/Deconvolution/Consensus/MassFeature.cs
MassFeatureBuilder (static, extracted from Phase4_CrossChargeConsensus) mzLib/MassSpectrometry/Deconvolution/Consensus/MassFeatureBuilder.cs

Each type now lives in its own file (matching the existing MassSpectrometry/Deconvolution/Algorithms/, Parameters/, Scoring/ folder discipline). Doc comments updated to drop research-only references; method bodies and signatures are byte-identical to the research-namespace versions. No behaviour change.

The research scaffolding (Phase*.cs fixtures, MakeConsensusSnips, LargeTestDataLocator) stays on the consensus-tracing branch. That branch needs a rebase after this PR merges to point at the new namespace — a one-line using MassSpectrometry.Deconvolution.Consensus; per phase file plus deletion of the now-redundant source under Development\. Out of scope here.

Phase B: Ms1FeatureFile.FromMassFeatures factory (commit 61029402)

Two pieces:

Extension method in Readers/ExternalResults/IndividualResultRecords/Ms1FeatureFromMassFeatureExtensions.cs:

public static Ms1Feature ToMs1Feature(
    this MassFeature feature,
    int sequentialId,
    int sampleId = 0,
    int fractionId = 0)

Field mapping (FLASHDeconv canonical schema):

Ms1Feature field Source
SampleId caller (default 0)
Id caller's sequentialId
Mass MassFeature.ConsensusMass (intensity-weighted mean across constituent traces)
Intensity MassFeature.SummedIntensity
RetentionTimeBegin MassFeature.RTStart
RetentionTimeEnd MassFeature.RTEnd
RetentionTimeApex RT of the highest-intensity envelope on the highest-summed-intensity constituent trace ("apex of the dominant charge state at its most intense scan")
IntensityApex intensity of that same envelope
ChargeStateMin / ChargeStateMax min(MassFeature.Charges) / max(MassFeature.Charges)
FractionIdMin / FractionIdMax caller's fractionId (default 0)

Static factory added directly on Ms1FeatureFile in Readers/ExternalResults/ResultFiles/Ms1FeatureFile.cs:

public static Ms1FeatureFile FromMassFeatures(
    IEnumerable<MassFeature> features,
    int sampleId = 0,
    int fractionId = 0,
    Software software = Software.FLASHDeconv)

Builds the records via ToMs1Feature, populates a fresh Ms1FeatureFile, returns it. The caller invokes the existing .WriteResults(path) to emit bytes — no new IO path, the existing reader's writer is reused. Software defaults to FLASHDeconv because that's the canonical schema; callers can override.

No filter applied. Every MassFeature becomes one row. Downstream consumers (MetaMorpheus's precursor HashSet dedupe + FDR machinery) handle noise. This was an explicit design decision — three alternatives (multi-charge only, score-thresholded, score-as-side-artifact) were considered and rejected for v1 to keep maximum recall and let downstream policy decide.

Phase C: Round-trip tests (commit 9bc26a27)

Five NUnit cases in Test/FileReadingTests/ExternalFileReading/TestMs1FeatureFromMassFeature.cs:

Test Asserts
ToMs1Feature_SingletonFeature_MapsAllFields One-trace, one-envelope feature locks the full field mapping for the trivial case.
ToMs1Feature_MultiChargeFeature_ApexIsDominantTraceMaxEnvelope Two-charge, five-envelope feature confirms apex selection and that cross-charge consensus mass is the intensity-weighted mean of per-trace consensus masses.
ToMs1Feature_SampleIdAndFractionIdHonoured Caller overrides for Sample_ID and Fraction_ID propagate to the row.
FromMassFeatures_WriteThenRead_AllFieldsSurvive Three features → write → re-read with FileReader.ReadFile<Ms1FeatureFile> → field-by-field equality. Catches serialisation drift between writer and reader.
FromMassFeatures_AssignsSequentialIdsFromZero Producer's internal MassFeature.Id is ignored; written rows get 0..N-1 regardless of upstream filtering holes.

Bug fixed while writing the tests

The original FromMassFeatures factory used file.Results.Add(...) in a loop. ResultFile<T>.Results.get calls LoadResults() whenever the list is empty; for an in-memory-constructed Ms1FeatureFile with no FilePath, that throws ArgumentException on the empty path. Reshaped the factory to build the list up front and assign once via the public setter, which bypasses the lazy-load. Inline comment in the factory explains why. The third round-trip test caught this before the writer shipped.

Results

TestMs1FeatureFromMassFeature suite          : 5/5 pass
Broader sweep (Ms1Feature + FromFile + ...)   : 148/148 pass
Build (Release|x64, MassSpectrometry + Readers): 0 errors, ~650 pre-existing warnings

The "broader sweep" covers TestMsFeature, TestMs1FeatureFromMassFeature (new), TestFromFileDeconvolution, TestResultFile, TestSupportedFileExtensions, TestDinosaurTsv — every test fixture that touches the Ms1Feature / FromFile / supported-file-extensions surface.

Files changed

File +/− Change
mzLib/MassSpectrometry/Deconvolution/Consensus/MassTrace.cs +37 new
mzLib/MassSpectrometry/Deconvolution/Consensus/MassTraceBuilder.cs +91 new
mzLib/MassSpectrometry/Deconvolution/Consensus/CorrectedEnvelope.cs +21 new
mzLib/MassSpectrometry/Deconvolution/Consensus/CorrectedTrace.cs +31 new
mzLib/MassSpectrometry/Deconvolution/Consensus/TraceCorrector.cs +124 new
mzLib/MassSpectrometry/Deconvolution/Consensus/MassFeature.cs +61 new
mzLib/MassSpectrometry/Deconvolution/Consensus/MassFeatureBuilder.cs +90 new
mzLib/Readers/ExternalResults/IndividualResultRecords/Ms1FeatureFromMassFeatureExtensions.cs +65 new
mzLib/Readers/ExternalResults/ResultFiles/Ms1FeatureFile.cs +50 / −1 added FromMassFeatures static + using MassSpectrometry.Deconvolution.Consensus;
mzLib/Test/FileReadingTests/ExternalFileReading/TestMs1FeatureFromMassFeature.cs +260 new

Total: 10 files, +830 / −3 across 3 commits.

How a consumer uses this

using MassSpectrometry.Deconvolution.Consensus;
using Readers;

// Caller has a list of MassFeature objects from the consensus pipeline.
// Finalise() must have been called on each.
IEnumerable<MassFeature> features = RunConsensusPipeline(...);

// Build and write the .ms1.feature file.
Ms1FeatureFile.FromMassFeatures(features)
    .WriteResults(@"E:\path\to\my_sample_ms1.feature");

// Place the .ms1.feature file next to the corresponding raw/mzML and
// MetaMorpheus's auto-discovery (PR #2650) picks it up automatically.

End-to-end on the yeast top-down test data (per the MetaMorpheus PR results table): adding the FromFile source to the today-baseline classic pipeline yields +44% PSMs and +10% protein groups at 1% FDR.

Architectural notes

  • The writer is decoupled from the producer. It takes IEnumerable<MassFeature>; whoever produces the features (the consensus pipeline today, possibly other producers in the future) doesn't matter.
  • No new wire format. The output is exactly the same FLASHDeconv-style _ms1.feature schema that Ms1FeatureFile.WriteResults already emits. The writer just constructs the records in memory.
  • The MS2 pairing happens inside FromFileDeconvolutionAlgorithm (mzLib PR 1064), which the MetaMorpheus FromFile path invokes per MS2 scan. No new pairing code in this PR.
  • The Ms2PrecursorEmitter + Ms2ScanWithPrecursorCandidate design (see consensus-tracing branch's NOTES.md, "Pending design — MS2 emitter") is obviated by this PR: file-based integration achieves the same end-to-end behaviour without the new types crossing repos.

Out of scope

  • Filter policy. No quality filter is applied (every MassFeature becomes a row). If a downstream consumer wants score-based filtering, that's a separate change — either at the producer (filter the MassFeature list before calling FromMassFeatures) or downstream (filter Ms1Feature records after LoadResults).
  • Phase 13 FastTree scoring (mean CV AUC 0.68 per snippet, NOTES.md Phase 13). Productising the trained model is separate work; the writer carries no score column. If a follow-up wants to add one, the cleanest path is probably a sibling .scores.tsv file written alongside, leaving _ms1.feature schema unchanged.
  • CorrectedTrace implementing ISingleChargeMs1Feature. Would enable an in-memory path (skipping the disk round-trip), which the file-based integration intentionally avoids. Possible follow-up if any consumer ever wants it.
  • Rebase of the consensus-tracing research branch to consume the new namespace. The research Phase fixtures still compile against Development.Deconvolution.ConsensusTracing; that branch needs using MassSpectrometry.Deconvolution.Consensus; added and the duplicate source files in Development\ deleted, but that work belongs on consensus-tracing, not here.

Test plan

  • All 6 promoted types compile clean in the new namespace (MassSpectrometry.csproj build, 0 errors).
  • Reader project builds with the new Ms1FeatureFromMassFeatureExtensions and updated Ms1FeatureFile (Readers.csproj build, 0 errors).
  • 5/5 NUnit cases in TestMs1FeatureFromMassFeature pass.
  • 148/148 cases pass across the broader Ms1Feature / FromFile / supported-file-extensions / DinosaurTsv surface — no regression.
  • Bug in initial FromMassFeatures factory (lazy-load via .Results.Add on empty FilePath) caught by the round-trip test and fixed before commit.
  • Land trishorts/mzLib ce24771c + 3aa2fe04 upstream first (schema aliases for newer TopFD _ms1.feature files — strictly speaking unrelated to this PR's writer, but anyone reproducing the MetaMorpheus integration needs them too).
  • Rebase consensus-tracing branch on top of this PR once merged; update Phase fixtures to use MassSpectrometry.Deconvolution.Consensus; delete the now-duplicate source under Development\DeconvolutionDevelopment\ConsensusTracing\.

Companion PR

smith-chem-wisc/MetaMorpheus#2650 (ExternalMs1Features) is the consumer side. It wires Ms1FeatureFilePath through FileSpecificParameters and uses FromFileDeconvolutionParameters (from mzLib PR 1064) to pair the features with MS2 scans. That PR's CI is currently red against smith-chem-wisc/MetaMorpheus's pinned mzLib, but will go green once mzLib publishes a version containing PR 1064 + the schema aliases.

trishorts and others added 12 commits May 11, 2026 10:14
Add subtilisin|p to the embedded proteases.tsv with full cleavage
specificity and proline-inhibition motifs (N[P]|, S[P]|, L[P]|,
K[P]|, I[P]|, D[P]|, Y[P]|, V[P]|, G[P]|, F[P]|, T[P]|, E[P]|,
Q[P]|, A[P]|, R[P]|).

Add TestSubtilisinP_DigestsCorrectlyAndRespectsProlineRestriction to
ProteinDigestionTests to verify:
- subtilisin|p is present in the embedded protease dictionary with
  CleavageSpecificity.Full
- All expected cleavage sites fire on a proline-free sequence (ANKTIDE)
- The K[P]| proline-inhibition rule is respected (AKPIDE keeps KP intact)
  PR 1064 ships an Ms1Feature reader that recognises only the older
  FlashDeconv / TopFD-v1.6.2 _ms1.feature schema (Sample_ID, ID,
  Time_begin, Time_end, Minimum_charge_state, Maximum_charge_state,
  Minimum_fraction_id, Maximum_fraction_id). Newer TopFD output keeps
  the same _ms1.feature extension but uses different column names
  (File_name, Fraction_ID, Feature_ID, Min_time, Max_time, Min_charge,
  Max_charge), plus extras like Envelope_num and EC_score. Format
  detection picks Ms1FeatureFile by extension; CsvHelper then throws
  because none of the expected [Name(...)] columns are present.

  Discovered while integrating PR 1064's FromFileDeconvolutionParameters
  into MetaMorpheus and pointing it at a real TopFD .ms1.feature from a
  top-down yeast run -- the file parsed by hand looks identical in shape
  to the old schema, just relabelled.

  Fix is column-name aliases on the existing record, plus [Optional] on
  fields that the newer schema omits entirely. Downstream
  GetSingleChargeFeatures() reads only Mass, ChargeStateMin/Max,
  RetentionTimeBegin/End, and IntensityApex -- all aliased to a column
  present in both schemas, so the join algorithm behaves identically
  regardless of which producer wrote the file.

  Per-field summary:
    Sample_ID         -> [Optional]            (newer TopFD has File_name
                                                instead; type-incompatible
                                                -- int vs string path --
                                                and not used downstream)
    ID                -> alias "Feature_ID"    + [Optional]
    Time_begin        -> alias "Min_time"
    Time_end          -> alias "Max_time"
    Minimum_charge_state -> alias "Min_charge"
    Maximum_charge_state -> alias "Max_charge"
    Minimum_fraction_id -> [Optional]          (newer TopFD has a single
                                                Fraction_ID column, not a
                                                min/max pair; not used
                                                downstream so alias would
                                                add no value)
    Maximum_fraction_id -> [Optional]

  No tests added in this commit -- a follow-up should drop a newer-TopFD
  _ms1.feature sample into Test/FileReadingTests/ExternalFileTypes/ and
  extend the existing Ms1FeatureFile read-roundtrip tests to cover both
  schemas. The current
    Ms1Feature_FlashDeconvOpenMs3.0.0_ms1.feature
    Ms1Feature_TopFDv1.6.2_ms1.feature
  fixtures keep passing because the existing [Name(...)] heads remain
  the first entry in every alias list.
   from real TopFD output that uses the File_name / Fraction_ID /
   Feature_ID / Min_time / Max_time / Min_charge / Max_charge schema, and
   wires it into the existing TestMsFeature parameterised tests:

     * TestFeaturesLoadAndCountIsCorrect gains a TestCase asserting the
       fixture loads four features end-to-end via FileReader.
     * TestTopFDLatestMs1FeatureFirstAndLastAreCorrect locks the per-
       field mapping for both the aliased columns (Time_begin/Min_time,
       Minimum_charge_state/Min_charge, etc.) and the [Optional] fields
       absent from the newer schema (SampleId, FractionIdMin/Max default
       to 0). Covers the single-charge edge case (Min_charge == Max_charge
       == 1) in the last record.
     * TestTopFDLatestMs1GetSingleChargeFeatureFunctions confirms charge-
       range expansion is identical across the two TopFD schemas: a 6-14
       range yields 9 envelopes; a 1-1 range yields exactly one;
       GetMs1Features() flattens to 9 + 12 + 10 + 1 = 32 across the four
       fixture features.
     * TestMs1FeatureReadWrite gains the new fixture as a TestCase. The
       writer emits the older-schema headers (newer columns aren't on
       the record), so the round-trip converts schema-newer -> schema-
       older + Optional defaults. Comment in the test explains why that
       is correct: every field downstream consumers actually read
       survives the round-trip; the columns that don't are exactly the
       ones marked [Optional] and unused.

   All 17 TestMsFeature tests pass, as do the 148 tests covering the
   related Ms1Feature / FromFile / SupportedFileExtensions / DinosaurTsv
   surface area.
Move the post-decon consensus pipeline (per-charge trace grouping,
median off-by-one correction, cross-charge feature stitching) from
Development.Deconvolution.ConsensusTracing into a new production
namespace MassSpectrometry.Deconvolution.Consensus. The types had
been parked in Development as the research arc explored their design
(see NOTES.md, phases 1-13); promotion was flagged as the next move
once the design stabilised.

New files under mzLib/MassSpectrometry/Deconvolution/Consensus/:

  * MassTrace -- per-scan envelope list at one charge, anchored to
    the first envelope's mass.
  * MassTraceBuilder -- greedy charge-locked grouper, gap-tolerant
    within a configurable mass tolerance.
  * CorrectedEnvelope -- one envelope inside a corrected trace,
    carrying both original and post-correction mass plus a
    WasCorrected flag.
  * CorrectedTrace -- a MassTrace after weighted-median off-by-one
    correction.
  * TraceCorrector -- the corrector itself, with uniform/intensity/
    scorer-weighted variants via an EnvelopeWeight delegate.
  * MassFeature -- a cross-charge feature: a group of CorrectedTrace
    entries whose consensus masses agree within a ppm tolerance and
    whose RT windows overlap.
  * MassFeatureBuilder -- the union-find sweep that produces them.

No behaviour change. The classes, methods, and signatures are
identical to the research-namespace versions; only their namespace
and file locations changed (each type now lives in its own file).
Doc comments updated to drop references to phase numbers and to the
"Development" location.

The research scaffolding in Development.Deconvolution.ConsensusTracing
(Phase*.cs fixtures, MakeConsensusSnips, LargeTestDataLocator) stays on
the consensus-tracing branch; it will be updated to consume the new
namespace when that branch is rebased on top of this PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lets the consensus-tracing pipeline persist its cross-charge feature
list as a FLASHDeconv-style _ms1.feature file -- the existing wire
format any Ms1FeatureFile consumer (notably MetaMorpheus's
FromFileDeconvolutionParameters) already understands. The
consensus-to-search bridge then needs no new file format and no new
MetaMorpheus surface area.

Two pieces:

  * MassFeature.ToMs1Feature(sequentialId, sampleId = 0,
    fractionId = 0) extension method (under Readers, because the
    return type is the Readers-namespace Ms1Feature record). Maps:

      Mass                   <- MassFeature.ConsensusMass
      Intensity              <- MassFeature.SummedIntensity
      RetentionTimeBegin     <- MassFeature.RTStart
      RetentionTimeEnd       <- MassFeature.RTEnd
      RetentionTimeApex      <- RT of the highest-intensity envelope
                                on the highest-summed-intensity
                                constituent trace ("apex of the
                                dominant charge state at its most
                                intense scan")
      IntensityApex          <- intensity of that same envelope
      ChargeStateMin/Max     <- min/max of MassFeature.Charges
      SampleId / FractionId  <- callers' choice (defaults 0)

  * Static Ms1FeatureFile.FromMassFeatures(IEnumerable<MassFeature>,
    sampleId, fractionId, software) factory. Builds the records via
    ToMs1Feature, populates a fresh Ms1FeatureFile instance, and
    returns it. The caller invokes .WriteResults(path) on the
    returned file to actually emit bytes -- no new IO path, the
    existing reader's writer is reused. Software label defaults to
    FLASHDeconv because that's the canonical schema the writer
    emits; the [Name(...)] alias machinery from PR smith-chem-wisc#1064 +
    schema-aliases PR means newer-TopFD readers also see the
    output as valid.

No filter applied: every MassFeature becomes one row. The original
question surfaced four filtering options (all, multi-charge only,
score-thresholded, all + side artifact); per the design pin "all
features, no filter" is the right default. Downstream consumers
(MetaMorpheus's precursor HashSet + FDR machinery) handle the noise.

The writer is decoupled from the producer: callers pass a sequence
of finalised MassFeature objects; the factory doesn't care where
they came from. Future producers (mass-trace results from non-
Classic algorithms, etc.) can use the same path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five NUnit cases under
Test\FileReadingTests\ExternalFileReading\TestMs1FeatureFromMassFeature.cs:

  * ToMs1Feature_SingletonFeature_MapsAllFields -- one trace, one
    envelope. Locks the mapping: Mass, Intensity, RT begin/end/apex,
    apex intensity, charge bounds, fraction id all come from
    Finalise()-derived MassFeature fields.

  * ToMs1Feature_MultiChargeFeature_ApexIsDominantTraceMaxEnvelope --
    two-charge feature with five envelopes. Confirms apex is "highest-
    intensity envelope on the highest-summed-intensity constituent
    trace", and that cross-charge consensus mass is the intensity-
    weighted mean of per-trace consensus masses.

  * ToMs1Feature_SampleIdAndFractionIdHonoured -- callers can override
    Sample_ID and Fraction_ID for multi-file/multi-fraction outputs.

  * FromMassFeatures_WriteThenRead_AllFieldsSurvive -- three features
    built in-memory, written via WriteResults, re-read with
    FileReader.ReadFile<Ms1FeatureFile>, every downstream-consumed
    field compared to the original ToMs1Feature output. Catches any
    serialisation drift between writer and reader on the new path.

  * FromMassFeatures_AssignsSequentialIdsFromZero -- producer's
    internal MassFeature.Id is ignored; written rows get 0..N-1. Keeps
    the output file's IDs dense and stable even when the upstream
    pipeline has filtered features and left holes.

Fix discovered while writing the second-to-last test: the original
FromMassFeatures factory used `file.Results.Add(...)` in a loop, which
goes through ResultFile<T>.Results' lazy-load getter. The getter calls
LoadResults() whenever the backing list is empty; for an in-memory-
constructed Ms1FeatureFile with no FilePath, LoadResults throws
ArgumentException on the empty path. Reshaped the factory to build
the records up front, then assign Results once via the setter (which
bypasses the lazy-load). Inline comment explains why.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Demonstrates the consensus-to-search loop the rest of this PR was
built for: load an mzML, run Classic decon per MS1 scan with top-down
parameters (charge 1..60), build mass traces with Pass-B grouping
(loose 1.5 Da to cohort off-by-one twins), apply weighted-median
correction, stitch cross-charge features at 10 ppm, write the
resulting MassFeature list as a FLASHDeconv-style _ms1.feature via
the new Ms1FeatureFile.FromMassFeatures factory, and round-trip read
the written file as a sanity check.

Lives as an NUnit [Explicit] test because it consumes a 165 MB local
mzML at a hard-coded path that's not part of the repo or CI:

  E:\TestData\MetaMorpheus\05-26-17_B7A_yeast_td_fract7_rep1.mzML

Assume.That on File.Exists keeps it silently skipped on machines
without the data. Run with:

  dotnet test --filter FullyQualifiedName~TestConsensusToMs1FeatureEndToEnd

On the yeast top-down test file (4837 scans total, ~2700 MS1) the
driver writes a 24 MB _ms1.feature with 266,373 features. That file,
fed to MetaMorpheus's FromFile precursor path (#2650), yields a
Cal -> GPTMD -> Search pipeline that converges to a tighter precursor
mass tolerance (1.6 ppm) than the equivalent classic-only baseline
(2.4 ppm) and finds 907 PSMs at 1% FDR vs 785 -- a 15.5% increase in
identifications from consensus precursors alone, before any additive
combination with the classic source.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lets a single dotnet test invocation run the consensus pipeline over
multiple raw/mzML files in succession. Each case captures its own
source path, output path, and decon charge ceiling -- bottom-up uses
MaxCharge=12 (matching MetaMorpheus's standard BU default), top-down
would use 60.

Currently shipping two bottom-up HEK293 Velos raws as the parameterised
test cases:

  * 20100609_Velos1_TaGe_SA_293_3.raw (5163 MS1 scans, 76878 features
    after correction, 7756 multi-charge -- 8.9 s wall time)
  * 20100609_Velos1_TaGe_SA_293_4.raw (5111 MS1 scans, 74173 features,
    7318 multi-charge -- 5.7 s)

Each test case still gates on File.Exists via Assume.That; cases for
machines without the data skip silently.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
FlashDeconv and TopFD both canonically emit _ms1.feature files with
RetentionTime columns in seconds (e.g. Time_begin = 2787 for a 46-min
LC run). MetaMorpheus and mzLib's MsDataScan.RetentionTime are always
in minutes (mzML / Thermo convention). FromFileDeconvolutionAlgorithm
compared the file's RT directly against the scan's RT, so on any real
FlashDeconv or TopFD output the seconds-vs-minutes mismatch produced
zero overlapping windows -- and zero PSMs at search time.

Caught while running a Cal -> GPTMD -> Search pipeline over a 20-mzML
Jurkat top-down dataset using its FlashDeconv companion _ms1.feature
files. Classic-only produced 23,636 PSMs at 1% FDR; FromFile-only
produced 0 with no error or warning -- silent failure mode.

Fix: in the file-path FromFileDeconvolutionParameters constructor,
detect seconds-as-loaded via a max-RetentionTimeEnd > 500 heuristic
(no realistic LC run exceeds 8 hours = 500 min) and divide all RT
fields by 60. Files already in minutes pass through unchanged. The
internal pre-loaded-features constructor (test seam) is unaffected
so callers supplying explicit synthetic units still get raw values.

Verification: re-running the same Jurkat pipeline after the fix
yields 23,976 PSMs in FromFile-only mode and 39,700 PSMs in additive
Both mode (vs Classic-only's 23,636) -- +68% PSMs / +68% proteoforms
over the Classic baseline on this dataset.

Tests:
  * Existing EndToEnd_FilePathCtor_ResolvesExpectedChargeState
    updated to use MS2 RT in minutes (39.85) instead of seconds
    (2390.0) -- the fixtures are FlashDeconv / TopFD with seconds-RT
    and the fix now normalises them at load, so the test had to
    match the new units.
  * New FromFileDeconvolutionParameters_FileWithRtInSeconds_NormalisesToMinutes
    writes a synthetic seconds-RT file, loads via the file-path
    ctor, asserts every per-charge feature's RT lands in the
    expected minutes range (~39.83 / ~40.17), and pairs against an
    in-minutes MS2 to confirm overlap is found.
  * New FromFileDeconvolutionParameters_FileWithRtInMinutes_LeavesUnchanged
    sister test: file already in minutes (max RT < 500) is NOT
    double-converted.

31/31 tests in TestFromFileDeconvolution + TestMs1FeatureFromMassFeature
pass after the fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…chem-wisc#1069 conflicts)

Upstream independently added the same logical changes for two pieces of this PR:

- subtilisin entry in proteases.tsv. Upstream named the protease "subtilisin|P"
  (uppercase P, consistent with proteomics convention for the proline restriction
  notation in the cleavage column); the PR had "subtilisin|p". Adopting upstream's
  uppercase form, including the matching ProteinDigestionTests assertions
  (upstream also adds LoadProteaseDictionary_SubtilisinP_HasProlineRestrictedMotifs).
- Newer-TopFD Ms1Feature test data and tests. The PR named its data file
  Ms1Feature_TopFDvLatest_ms1.feature; upstream named the identical content
  Ms1Feature_TopFDv1.7.0_ms1.feature. Adopting upstream's version-specific
  naming and removing the redundant TopFDvLatest copy.

Conflict resolution: took upstream's version for proteases.tsv, Test.csproj,
Test/FileReadingTests/ExternalFileReading/TestMsFeature.cs, and
Test/ProteomicsTests/ProteolyticDigestion/ProteinDigestionTests.cs. Deleted
Test/FileReadingTests/ExternalFileTypes/Ms1Feature_TopFDvLatest_ms1.feature.

Build clean. The 80 tests in the affected fixtures
(TestMsFeature, TestFromFileDeconvolution, TestMs1FeatureFromMassFeature,
TestConsensusToMs1FeatureEndToEnd, ProteinDigestionTests) all pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 15, 2026

Codecov Report

❌ Patch coverage is 99.05363% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.75%. Comparing base (92be818) to head (36bc80f).

Files with missing lines Patch % Lines
...s/Deconvolution/FromFileDeconvolutionParameters.cs 91.30% 0 Missing and 2 partials ⚠️
...etry/Deconvolution/Consensus/MassFeatureBuilder.cs 98.24% 0 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1069      +/-   ##
==========================================
+ Coverage   81.63%   81.75%   +0.11%     
==========================================
  Files         369      376       +7     
  Lines       47605    47918     +313     
  Branches     5649     5699      +50     
==========================================
+ Hits        38863    39173     +310     
  Misses       7643     7643              
- Partials     1099     1102       +3     
Files with missing lines Coverage Δ
...trometry/Deconvolution/Consensus/CorrectedTrace.cs 100.00% <100.00%> (ø)
...pectrometry/Deconvolution/Consensus/MassFeature.cs 100.00% <100.00%> (ø)
...sSpectrometry/Deconvolution/Consensus/MassTrace.cs 100.00% <100.00%> (ø)
...ometry/Deconvolution/Consensus/MassTraceBuilder.cs 100.00% <100.00%> (ø)
...trometry/Deconvolution/Consensus/TraceCorrector.cs 100.00% <100.00%> (ø)
mzLib/Readers/BaseClasses/ResultFile.cs 96.96% <100.00%> (ø)
...ernalResults/IndividualResultRecords/Ms1Feature.cs 100.00% <ø> (ø)
...sultRecords/Ms1FeatureFromMassFeatureExtensions.cs 100.00% <100.00%> (ø)
...ders/ExternalResults/ResultFiles/Ms1FeatureFile.cs 97.87% <100.00%> (+1.09%) ⬆️
...etry/Deconvolution/Consensus/MassFeatureBuilder.cs 98.24% <98.24%> (ø)
... and 1 more
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

trishorts and others added 5 commits May 20, 2026 11:28
TraceCorrector now derives its off-by-one window from per-trace mass
scatter (3*sigma via MAD, capped below half the isotope spacing) instead
of a fixed +/-0.05 Da. An envelope at a resolvable, non-isotope offset
(e.g. a deamidated form at +0.98402 Da, which the old fixed window snapped
to the unmodified mass) is split into its own CorrectedTrace and surfaced
as a distinct feature; when scatter is too large to resolve it from the
1.00335 Da isotope spacing it conservatively merges. Correct() now returns
List<CorrectedTrace>. Bottom-up search needs this -- deamidation detection
is a requirement there.

Also in this hardening pass:
- MassFeatureBuilder sorts components by (mass, charge, RT) before
  assigning IDs, so feature IDs and written _ms1.feature row order are
  deterministic instead of depending on Dictionary enumeration.
- Guard clauses with clear messages on Finalise (empty traces),
  ToMs1Feature (un-finalised feature), and BuildTraces (scan/envelope
  length mismatch).
- FromFileDeconvolutionParameters exposes RetentionTimeNormalizedFromSeconds
  and warns when the seconds->minutes heuristic fires, rather than silently
  mutating RT.
- Unit + performance tests for the consensus engines (0 -> 100% line
  coverage on the promoted types) covering the behaviours above.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
[Explicit], local-data drivers used to benchmark the consensus pipeline:
JurkatMs1FeatureGenerator and VelosMs1FeatureGenerator regenerate the
_ms1.feature files for the Jurkat top-down (charge 1..60) and HEK293 Velos
bottom-up (charge 1..12) datasets via the new deamidation-split pipeline.

ConsensusFeatureScoring computes the generic per-envelope deconvolution
score (DeconvolutionScorer.ScoreEnvelope: averagine cosine, ppm error, peak
completeness, intensity-ratio consistency) for every Classic envelope and
aggregates it per feature into a sidecar TSV aligned to the _ms1.feature
row order. This gives a uniform, search-independent feature-quality value
for a computed noise-filter cutoff -- scored at generation time because the
consensus pipeline keeps only mass+intensity per envelope and drops the
peaks the scorer needs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Hardening fixes to the consensus MassFeature -> _ms1.feature path surfaced by
code review. None were observed in normal use, but each is a real edge-case
correctness or crash issue:

- Write one row per contiguous charge run so a gapped set (e.g. {10,12,15},
  when intermediate charges fall below the score cutoff) no longer reads back
  as the fabricated full range 10..15. Adds ToMs1Features.
- FromMassFeatures: empty input writes a valid header-only file instead of
  crashing; also fixes a LoadResults infinite recursion when reading a
  zero-row file (the Results getter re-entered LoadResults on an empty set).
- Ignore non-finite RetentionTimeEnd when sniffing seconds-vs-minutes, so a
  single NaN row can't flip an in-minutes file into a /60 conversion.
- Default the produced file's Software to TopFD to match how it is re-detected
  on reload (the writer always emits Apex_intensity); fix the misleading
  "FLASHDeconv canonical" doc.
- Document that per-charge Intensity comes from Apex_intensity, not the summed
  Intensity column.

Adds NUnit coverage for each; tidies now-stale FLASHDeconv comments.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Member

@nbollis nbollis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good. One concern should be addressed in the handling of the ms1 feature file and it's record storage.

I see a bunch of new code inside of mass spec/deconvolution. Consider if any of our existing classes for peak tracing (such as the XIC section) and our interfaces can be reused here. It would be great if the mass trace builder could operate on IHasSingle... instead of this custom object. It would leave more door open in the future. That said, I would be happy with a pass that conciders the objects/interfaces we have and tell's me why it is a bad idea to reuse them here.

Comment thread mzLib/Readers/ExternalResults/ResultFiles/Ms1FeatureFile.cs Outdated
Comment thread mzLib/MassSpectrometry/Deconvolution/Consensus/TraceCorrector.cs Outdated
trishorts and others added 2 commits May 26, 2026 05:46
TraceCorrector defined its own IsotopeSpacingDa = 1.00335, a less-precise copy of the canonical Chemistry.Constants.C13MinusC12 (1.00335483810). Alias the shared constant so the C12/C13 spacing has one source of truth; the 4.8e-6 Da change is far inside the off-by-one detection window, so results are unchanged.

Addresses review feedback on smith-chem-wisc#1069.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ResultFile<T>.Results lazy-loaded whenever _results was empty, which crashed for a factory-built file with an empty FilePath (e.g. a zero-feature Ms1FeatureFile.FromMassFeatures). Guarding the load with File.Exists(FilePath) lets an in-memory file return its set records directly, removing the need for Ms1FeatureFile's parallel _factoryRecords store so the records have a single source of truth.

Addresses review feedback on smith-chem-wisc#1069.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@trishorts
Copy link
Copy Markdown
Contributor Author

Thanks for the push to reuse the existing peak-tracing code — I took the pass you asked for. The candidate is exactly the XIC stack: ExtractedIonChromatogram + MassIndexingEngine/IndexingEngine.GetAllXics, with IndexedMass (the IIndexedPeak you were thinking of) as a ready-made envelope→peak bridge. To test it properly I wrote a variant generator that swaps only the grouping — GetAllXics in place of MassTraceBuilder — and keeps the entire downstream (TraceCorrectorMassFeatureBuilder → writer) byte-identical, reading each envelope's mass/intensity/charge straight off IndexedMass.IsotopicEnvelope. So any difference is attributable to the grouping algorithm alone.

Yield (the decisive metric). Identical MM FromFile-only pipeline, same 19 Jurkat top-down files, q≤0.01:

grouping proteoforms PSMs protein groups
MassTraceBuilder 1,126 39,656 146
GetAllXics 927 32,518 118

Reusing GetAllXics loses ~18% of proteoforms. It fragments into more, shorter traces (162k vs 138k on one fraction, +38% singletons) because of apex-intensity seeding + a hard claimed-peak set vs. the consensus tracer's scan-order, fixed-anchor grouping — and that fragmentation propagates into worse off-by-one correction and cross-charge stitching.

Robustness. GetAllXics actually throws (ArgumentException, duplicate key) on 1 of the 20 files under the wide 1.5 Da window consensus tracing needs: the seed peak is added without the claimed-peak guard that follow-on peaks get (IndexingEngine.cs:136-137), so under a wide tolerance two in-range envelopes at a seed scan collide. That's a latent bug worth a separate issue regardless of this PR — flagging it here.

Perf. GetAllXics is ~2.6× faster but ~4× the allocations; not a deciding factor when it drops 18% of the IDs.

So I'd like to keep MassTrace/MassTraceBuilder: the grouping is purpose-built for this and measurably better at the metric that matters. A tuned GetAllXics might close the gap, but matching it (seed policy, claim-set, the dedup bug) is enough surgery that it's a fork, not reuse. Happy to revisit if you'd like.

The two inline comments are addressed: Ms1FeatureFile now relies on a File.Exists guard in the base ResultFile.Results getter and drops _factoryRecords (single source of truth); TraceCorrector now aliases Constants.C13MinusC12 instead of redefining the spacing. The "pass the spacing in for decoy traces" idea is a nice one — I left it as a follow-up since TraceCorrector is static today.

@trishorts
Copy link
Copy Markdown
Contributor Author

Filed the GetAllXics seed-peak dedup bug mentioned above as #1074.

Two [Explicit] drivers (local Jurkat top-down data, not CI) that back the decision to keep MassTraceBuilder over reusing the existing GetAllXics: TestXicVsMassTraceComparison times both groupers head-to-head on the same deconvoluted envelopes (wall-clock, allocations, trace counts, grouping agreement); JurkatMs1FeatureGeneratorViaXic emits XIC-grouped _ms1.feature files through the identical downstream so end-to-end proteoform yield can be compared (GetAllXics measured -18% vs MassTraceBuilder).

Refs smith-chem-wisc#1069.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@trishorts trishorts requested a review from RayMSMS May 26, 2026 13:40
trishorts added a commit to trishorts/mzLib that referenced this pull request May 29, 2026
Phase 2: decouple per-spectrum deconvolution from cross-scan feature tracing
behind a pluggable interface, so the MetaFlashDecon native tracer (Phase 3) and
the consensus tracer can be swapped and compared on identical input envelopes.

- IMassFeatureTracer (MassSpectrometry/Deconvolution/FeatureTracing): per-scan
  envelopes -> whole-file MassFeatures.
- ConsensusMassFeatureTracer: wraps the consensus pipeline
  MassTraceBuilder -> TraceCorrector -> MassFeatureBuilder. Reuses smith-chem-wisc#1069's
  MassFeature as the shared feature model (Ms1FeatureFile.FromMassFeatures is
  the writer at the call site).
- TestMassFeatureTracer (3 tests): adapter reproduces the direct pipeline;
  cross-charge stitch yields one feature with ChargeCount 2; interface impl.

Note: this branch is now stacked on smith-chem-wisc#1069 (consensus-to-ms1feature) for the
MassFeature/writer types; rebase off master after smith-chem-wisc#1069 merges.

Full suite: 4034 pass / 0 fail / 5 skip.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
trishorts added a commit to trishorts/mzLib that referenced this pull request May 29, 2026
Ms1FeatureGenerator (Readers): mzML to per-scan MetaFlashDecon deconvolution to
IMassFeatureTracer (native or consensus) to Ms1FeatureFile.FromMassFeatures to a
_ms1.feature file. This is the "analogous to smith-chem-wisc#1069" entry point: whole-file
deconvolution produces a _ms1.feature that MetaMorpheus consumes via FromFile.

- Retention time is written in SECONDS (source minutes x 60) to match real OpenMS
  FLASHDeconv output; configurable via secondsPerRtUnit (pass 1.0 if already seconds).
- Output is the TopFD-superset _ms1.feature schema: the 11 canonical FLASHDeconv
  columns plus Apex_intensity (which MetaMorpheus uses for per-charge weighting).
  The written header is verified to contain every real-FLASHDeconv column.
- TestMs1FeatureGenerator (CI, 2 tests): seconds-RT conversion + schema + round-trip
  on synthetic features.
- JurkatMetaFlashDeconGenerator ([Explicit]): emits _ms1.feature for the 20 Jurkat
  mzMLs with BOTH tracers into runs/metaflashdecon_features/{native,consensus} — the
  Phase 6 head-to-head input vs the real FLASHDeconv files at JurkatTopDown/FlashDecon.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@trishorts trishorts mentioned this pull request May 29, 2026
trishorts added a commit to trishorts/mzLib that referenced this pull request May 30, 2026
…ful)

Ms1FeatureFile.FromMassFeatures emitted one row per contiguous charge run
(ToMs1Features), so a gapped charge set became several rows. That was
unfaithful: the real FLASHDeconv _ms1.feature writes exactly one row per mass
feature with an aggregated Min/Max charge range (verified -- every multi-row
mass in the ground-truth file differs in retention time, never only in charge),
and it inflated intensity by writing the full feature.SummedIntensity on every
split row (a 3-run feature reported 3x its real intensity).

Now emits one row per feature via ToMs1Feature. Re-expanding Min..Max on reload
may surface an unobserved intermediate charge, but that matches FLASHDeconv and
the charge set is metadata, not used for mass/RT pairing. Updated the gapped-set
test to pin the single-row, non-inflated-intensity contract. ToMs1Features kept
as a public API. Shared by the consensus tracer (smith-chem-wisc#1069) -- both now emit one row
per feature.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants