Promote consensus mass-tracing types + add Ms1Feature writer for MassFeature#1069
Promote consensus mass-tracing types + add Ms1Feature writer for MassFeature#1069trishorts wants to merge 20 commits into
Conversation
Add subtilisin|p to the embedded proteases.tsv with full cleavage specificity and proline-inhibition motifs (N[P]|, S[P]|, L[P]|, K[P]|, I[P]|, D[P]|, Y[P]|, V[P]|, G[P]|, F[P]|, T[P]|, E[P]|, Q[P]|, A[P]|, R[P]|). Add TestSubtilisinP_DigestsCorrectlyAndRespectsProlineRestriction to ProteinDigestionTests to verify: - subtilisin|p is present in the embedded protease dictionary with CleavageSpecificity.Full - All expected cleavage sites fire on a proline-free sequence (ANKTIDE) - The K[P]| proline-inhibition rule is respected (AKPIDE keeps KP intact)
PR 1064 ships an Ms1Feature reader that recognises only the older
FlashDeconv / TopFD-v1.6.2 _ms1.feature schema (Sample_ID, ID,
Time_begin, Time_end, Minimum_charge_state, Maximum_charge_state,
Minimum_fraction_id, Maximum_fraction_id). Newer TopFD output keeps
the same _ms1.feature extension but uses different column names
(File_name, Fraction_ID, Feature_ID, Min_time, Max_time, Min_charge,
Max_charge), plus extras like Envelope_num and EC_score. Format
detection picks Ms1FeatureFile by extension; CsvHelper then throws
because none of the expected [Name(...)] columns are present.
Discovered while integrating PR 1064's FromFileDeconvolutionParameters
into MetaMorpheus and pointing it at a real TopFD .ms1.feature from a
top-down yeast run -- the file parsed by hand looks identical in shape
to the old schema, just relabelled.
Fix is column-name aliases on the existing record, plus [Optional] on
fields that the newer schema omits entirely. Downstream
GetSingleChargeFeatures() reads only Mass, ChargeStateMin/Max,
RetentionTimeBegin/End, and IntensityApex -- all aliased to a column
present in both schemas, so the join algorithm behaves identically
regardless of which producer wrote the file.
Per-field summary:
Sample_ID -> [Optional] (newer TopFD has File_name
instead; type-incompatible
-- int vs string path --
and not used downstream)
ID -> alias "Feature_ID" + [Optional]
Time_begin -> alias "Min_time"
Time_end -> alias "Max_time"
Minimum_charge_state -> alias "Min_charge"
Maximum_charge_state -> alias "Max_charge"
Minimum_fraction_id -> [Optional] (newer TopFD has a single
Fraction_ID column, not a
min/max pair; not used
downstream so alias would
add no value)
Maximum_fraction_id -> [Optional]
No tests added in this commit -- a follow-up should drop a newer-TopFD
_ms1.feature sample into Test/FileReadingTests/ExternalFileTypes/ and
extend the existing Ms1FeatureFile read-roundtrip tests to cover both
schemas. The current
Ms1Feature_FlashDeconvOpenMs3.0.0_ms1.feature
Ms1Feature_TopFDv1.6.2_ms1.feature
fixtures keep passing because the existing [Name(...)] heads remain
the first entry in every alias list.
from real TopFD output that uses the File_name / Fraction_ID /
Feature_ID / Min_time / Max_time / Min_charge / Max_charge schema, and
wires it into the existing TestMsFeature parameterised tests:
* TestFeaturesLoadAndCountIsCorrect gains a TestCase asserting the
fixture loads four features end-to-end via FileReader.
* TestTopFDLatestMs1FeatureFirstAndLastAreCorrect locks the per-
field mapping for both the aliased columns (Time_begin/Min_time,
Minimum_charge_state/Min_charge, etc.) and the [Optional] fields
absent from the newer schema (SampleId, FractionIdMin/Max default
to 0). Covers the single-charge edge case (Min_charge == Max_charge
== 1) in the last record.
* TestTopFDLatestMs1GetSingleChargeFeatureFunctions confirms charge-
range expansion is identical across the two TopFD schemas: a 6-14
range yields 9 envelopes; a 1-1 range yields exactly one;
GetMs1Features() flattens to 9 + 12 + 10 + 1 = 32 across the four
fixture features.
* TestMs1FeatureReadWrite gains the new fixture as a TestCase. The
writer emits the older-schema headers (newer columns aren't on
the record), so the round-trip converts schema-newer -> schema-
older + Optional defaults. Comment in the test explains why that
is correct: every field downstream consumers actually read
survives the round-trip; the columns that don't are exactly the
ones marked [Optional] and unused.
All 17 TestMsFeature tests pass, as do the 148 tests covering the
related Ms1Feature / FromFile / SupportedFileExtensions / DinosaurTsv
surface area.
Move the post-decon consensus pipeline (per-charge trace grouping,
median off-by-one correction, cross-charge feature stitching) from
Development.Deconvolution.ConsensusTracing into a new production
namespace MassSpectrometry.Deconvolution.Consensus. The types had
been parked in Development as the research arc explored their design
(see NOTES.md, phases 1-13); promotion was flagged as the next move
once the design stabilised.
New files under mzLib/MassSpectrometry/Deconvolution/Consensus/:
* MassTrace -- per-scan envelope list at one charge, anchored to
the first envelope's mass.
* MassTraceBuilder -- greedy charge-locked grouper, gap-tolerant
within a configurable mass tolerance.
* CorrectedEnvelope -- one envelope inside a corrected trace,
carrying both original and post-correction mass plus a
WasCorrected flag.
* CorrectedTrace -- a MassTrace after weighted-median off-by-one
correction.
* TraceCorrector -- the corrector itself, with uniform/intensity/
scorer-weighted variants via an EnvelopeWeight delegate.
* MassFeature -- a cross-charge feature: a group of CorrectedTrace
entries whose consensus masses agree within a ppm tolerance and
whose RT windows overlap.
* MassFeatureBuilder -- the union-find sweep that produces them.
No behaviour change. The classes, methods, and signatures are
identical to the research-namespace versions; only their namespace
and file locations changed (each type now lives in its own file).
Doc comments updated to drop references to phase numbers and to the
"Development" location.
The research scaffolding in Development.Deconvolution.ConsensusTracing
(Phase*.cs fixtures, MakeConsensusSnips, LargeTestDataLocator) stays on
the consensus-tracing branch; it will be updated to consume the new
namespace when that branch is rebased on top of this PR.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lets the consensus-tracing pipeline persist its cross-charge feature
list as a FLASHDeconv-style _ms1.feature file -- the existing wire
format any Ms1FeatureFile consumer (notably MetaMorpheus's
FromFileDeconvolutionParameters) already understands. The
consensus-to-search bridge then needs no new file format and no new
MetaMorpheus surface area.
Two pieces:
* MassFeature.ToMs1Feature(sequentialId, sampleId = 0,
fractionId = 0) extension method (under Readers, because the
return type is the Readers-namespace Ms1Feature record). Maps:
Mass <- MassFeature.ConsensusMass
Intensity <- MassFeature.SummedIntensity
RetentionTimeBegin <- MassFeature.RTStart
RetentionTimeEnd <- MassFeature.RTEnd
RetentionTimeApex <- RT of the highest-intensity envelope
on the highest-summed-intensity
constituent trace ("apex of the
dominant charge state at its most
intense scan")
IntensityApex <- intensity of that same envelope
ChargeStateMin/Max <- min/max of MassFeature.Charges
SampleId / FractionId <- callers' choice (defaults 0)
* Static Ms1FeatureFile.FromMassFeatures(IEnumerable<MassFeature>,
sampleId, fractionId, software) factory. Builds the records via
ToMs1Feature, populates a fresh Ms1FeatureFile instance, and
returns it. The caller invokes .WriteResults(path) on the
returned file to actually emit bytes -- no new IO path, the
existing reader's writer is reused. Software label defaults to
FLASHDeconv because that's the canonical schema the writer
emits; the [Name(...)] alias machinery from PR smith-chem-wisc#1064 +
schema-aliases PR means newer-TopFD readers also see the
output as valid.
No filter applied: every MassFeature becomes one row. The original
question surfaced four filtering options (all, multi-charge only,
score-thresholded, all + side artifact); per the design pin "all
features, no filter" is the right default. Downstream consumers
(MetaMorpheus's precursor HashSet + FDR machinery) handle the noise.
The writer is decoupled from the producer: callers pass a sequence
of finalised MassFeature objects; the factory doesn't care where
they came from. Future producers (mass-trace results from non-
Classic algorithms, etc.) can use the same path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five NUnit cases under
Test\FileReadingTests\ExternalFileReading\TestMs1FeatureFromMassFeature.cs:
* ToMs1Feature_SingletonFeature_MapsAllFields -- one trace, one
envelope. Locks the mapping: Mass, Intensity, RT begin/end/apex,
apex intensity, charge bounds, fraction id all come from
Finalise()-derived MassFeature fields.
* ToMs1Feature_MultiChargeFeature_ApexIsDominantTraceMaxEnvelope --
two-charge feature with five envelopes. Confirms apex is "highest-
intensity envelope on the highest-summed-intensity constituent
trace", and that cross-charge consensus mass is the intensity-
weighted mean of per-trace consensus masses.
* ToMs1Feature_SampleIdAndFractionIdHonoured -- callers can override
Sample_ID and Fraction_ID for multi-file/multi-fraction outputs.
* FromMassFeatures_WriteThenRead_AllFieldsSurvive -- three features
built in-memory, written via WriteResults, re-read with
FileReader.ReadFile<Ms1FeatureFile>, every downstream-consumed
field compared to the original ToMs1Feature output. Catches any
serialisation drift between writer and reader on the new path.
* FromMassFeatures_AssignsSequentialIdsFromZero -- producer's
internal MassFeature.Id is ignored; written rows get 0..N-1. Keeps
the output file's IDs dense and stable even when the upstream
pipeline has filtered features and left holes.
Fix discovered while writing the second-to-last test: the original
FromMassFeatures factory used `file.Results.Add(...)` in a loop, which
goes through ResultFile<T>.Results' lazy-load getter. The getter calls
LoadResults() whenever the backing list is empty; for an in-memory-
constructed Ms1FeatureFile with no FilePath, LoadResults throws
ArgumentException on the empty path. Reshaped the factory to build
the records up front, then assign Results once via the setter (which
bypasses the lazy-load). Inline comment explains why.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Demonstrates the consensus-to-search loop the rest of this PR was built for: load an mzML, run Classic decon per MS1 scan with top-down parameters (charge 1..60), build mass traces with Pass-B grouping (loose 1.5 Da to cohort off-by-one twins), apply weighted-median correction, stitch cross-charge features at 10 ppm, write the resulting MassFeature list as a FLASHDeconv-style _ms1.feature via the new Ms1FeatureFile.FromMassFeatures factory, and round-trip read the written file as a sanity check. Lives as an NUnit [Explicit] test because it consumes a 165 MB local mzML at a hard-coded path that's not part of the repo or CI: E:\TestData\MetaMorpheus\05-26-17_B7A_yeast_td_fract7_rep1.mzML Assume.That on File.Exists keeps it silently skipped on machines without the data. Run with: dotnet test --filter FullyQualifiedName~TestConsensusToMs1FeatureEndToEnd On the yeast top-down test file (4837 scans total, ~2700 MS1) the driver writes a 24 MB _ms1.feature with 266,373 features. That file, fed to MetaMorpheus's FromFile precursor path (#2650), yields a Cal -> GPTMD -> Search pipeline that converges to a tighter precursor mass tolerance (1.6 ppm) than the equivalent classic-only baseline (2.4 ppm) and finds 907 PSMs at 1% FDR vs 785 -- a 15.5% increase in identifications from consensus precursors alone, before any additive combination with the classic source. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lets a single dotnet test invocation run the consensus pipeline over
multiple raw/mzML files in succession. Each case captures its own
source path, output path, and decon charge ceiling -- bottom-up uses
MaxCharge=12 (matching MetaMorpheus's standard BU default), top-down
would use 60.
Currently shipping two bottom-up HEK293 Velos raws as the parameterised
test cases:
* 20100609_Velos1_TaGe_SA_293_3.raw (5163 MS1 scans, 76878 features
after correction, 7756 multi-charge -- 8.9 s wall time)
* 20100609_Velos1_TaGe_SA_293_4.raw (5111 MS1 scans, 74173 features,
7318 multi-charge -- 5.7 s)
Each test case still gates on File.Exists via Assume.That; cases for
machines without the data skip silently.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
FlashDeconv and TopFD both canonically emit _ms1.feature files with
RetentionTime columns in seconds (e.g. Time_begin = 2787 for a 46-min
LC run). MetaMorpheus and mzLib's MsDataScan.RetentionTime are always
in minutes (mzML / Thermo convention). FromFileDeconvolutionAlgorithm
compared the file's RT directly against the scan's RT, so on any real
FlashDeconv or TopFD output the seconds-vs-minutes mismatch produced
zero overlapping windows -- and zero PSMs at search time.
Caught while running a Cal -> GPTMD -> Search pipeline over a 20-mzML
Jurkat top-down dataset using its FlashDeconv companion _ms1.feature
files. Classic-only produced 23,636 PSMs at 1% FDR; FromFile-only
produced 0 with no error or warning -- silent failure mode.
Fix: in the file-path FromFileDeconvolutionParameters constructor,
detect seconds-as-loaded via a max-RetentionTimeEnd > 500 heuristic
(no realistic LC run exceeds 8 hours = 500 min) and divide all RT
fields by 60. Files already in minutes pass through unchanged. The
internal pre-loaded-features constructor (test seam) is unaffected
so callers supplying explicit synthetic units still get raw values.
Verification: re-running the same Jurkat pipeline after the fix
yields 23,976 PSMs in FromFile-only mode and 39,700 PSMs in additive
Both mode (vs Classic-only's 23,636) -- +68% PSMs / +68% proteoforms
over the Classic baseline on this dataset.
Tests:
* Existing EndToEnd_FilePathCtor_ResolvesExpectedChargeState
updated to use MS2 RT in minutes (39.85) instead of seconds
(2390.0) -- the fixtures are FlashDeconv / TopFD with seconds-RT
and the fix now normalises them at load, so the test had to
match the new units.
* New FromFileDeconvolutionParameters_FileWithRtInSeconds_NormalisesToMinutes
writes a synthetic seconds-RT file, loads via the file-path
ctor, asserts every per-charge feature's RT lands in the
expected minutes range (~39.83 / ~40.17), and pairs against an
in-minutes MS2 to confirm overlap is found.
* New FromFileDeconvolutionParameters_FileWithRtInMinutes_LeavesUnchanged
sister test: file already in minutes (max RT < 500) is NOT
double-converted.
31/31 tests in TestFromFileDeconvolution + TestMs1FeatureFromMassFeature
pass after the fix.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…chem-wisc#1069 conflicts) Upstream independently added the same logical changes for two pieces of this PR: - subtilisin entry in proteases.tsv. Upstream named the protease "subtilisin|P" (uppercase P, consistent with proteomics convention for the proline restriction notation in the cleavage column); the PR had "subtilisin|p". Adopting upstream's uppercase form, including the matching ProteinDigestionTests assertions (upstream also adds LoadProteaseDictionary_SubtilisinP_HasProlineRestrictedMotifs). - Newer-TopFD Ms1Feature test data and tests. The PR named its data file Ms1Feature_TopFDvLatest_ms1.feature; upstream named the identical content Ms1Feature_TopFDv1.7.0_ms1.feature. Adopting upstream's version-specific naming and removing the redundant TopFDvLatest copy. Conflict resolution: took upstream's version for proteases.tsv, Test.csproj, Test/FileReadingTests/ExternalFileReading/TestMsFeature.cs, and Test/ProteomicsTests/ProteolyticDigestion/ProteinDigestionTests.cs. Deleted Test/FileReadingTests/ExternalFileTypes/Ms1Feature_TopFDvLatest_ms1.feature. Build clean. The 80 tests in the affected fixtures (TestMsFeature, TestFromFileDeconvolution, TestMs1FeatureFromMassFeature, TestConsensusToMs1FeatureEndToEnd, ProteinDigestionTests) all pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #1069 +/- ##
==========================================
+ Coverage 81.63% 81.75% +0.11%
==========================================
Files 369 376 +7
Lines 47605 47918 +313
Branches 5649 5699 +50
==========================================
+ Hits 38863 39173 +310
Misses 7643 7643
- Partials 1099 1102 +3
🚀 New features to boost your workflow:
|
TraceCorrector now derives its off-by-one window from per-trace mass scatter (3*sigma via MAD, capped below half the isotope spacing) instead of a fixed +/-0.05 Da. An envelope at a resolvable, non-isotope offset (e.g. a deamidated form at +0.98402 Da, which the old fixed window snapped to the unmodified mass) is split into its own CorrectedTrace and surfaced as a distinct feature; when scatter is too large to resolve it from the 1.00335 Da isotope spacing it conservatively merges. Correct() now returns List<CorrectedTrace>. Bottom-up search needs this -- deamidation detection is a requirement there. Also in this hardening pass: - MassFeatureBuilder sorts components by (mass, charge, RT) before assigning IDs, so feature IDs and written _ms1.feature row order are deterministic instead of depending on Dictionary enumeration. - Guard clauses with clear messages on Finalise (empty traces), ToMs1Feature (un-finalised feature), and BuildTraces (scan/envelope length mismatch). - FromFileDeconvolutionParameters exposes RetentionTimeNormalizedFromSeconds and warns when the seconds->minutes heuristic fires, rather than silently mutating RT. - Unit + performance tests for the consensus engines (0 -> 100% line coverage on the promoted types) covering the behaviours above. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
[Explicit], local-data drivers used to benchmark the consensus pipeline: JurkatMs1FeatureGenerator and VelosMs1FeatureGenerator regenerate the _ms1.feature files for the Jurkat top-down (charge 1..60) and HEK293 Velos bottom-up (charge 1..12) datasets via the new deamidation-split pipeline. ConsensusFeatureScoring computes the generic per-envelope deconvolution score (DeconvolutionScorer.ScoreEnvelope: averagine cosine, ppm error, peak completeness, intensity-ratio consistency) for every Classic envelope and aggregates it per feature into a sidecar TSV aligned to the _ms1.feature row order. This gives a uniform, search-independent feature-quality value for a computed noise-filter cutoff -- scored at generation time because the consensus pipeline keeps only mass+intensity per envelope and drops the peaks the scorer needs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Hardening fixes to the consensus MassFeature -> _ms1.feature path surfaced by
code review. None were observed in normal use, but each is a real edge-case
correctness or crash issue:
- Write one row per contiguous charge run so a gapped set (e.g. {10,12,15},
when intermediate charges fall below the score cutoff) no longer reads back
as the fabricated full range 10..15. Adds ToMs1Features.
- FromMassFeatures: empty input writes a valid header-only file instead of
crashing; also fixes a LoadResults infinite recursion when reading a
zero-row file (the Results getter re-entered LoadResults on an empty set).
- Ignore non-finite RetentionTimeEnd when sniffing seconds-vs-minutes, so a
single NaN row can't flip an in-minutes file into a /60 conversion.
- Default the produced file's Software to TopFD to match how it is re-detected
on reload (the writer always emits Apex_intensity); fix the misleading
"FLASHDeconv canonical" doc.
- Document that per-charge Intensity comes from Apex_intensity, not the summed
Intensity column.
Adds NUnit coverage for each; tidies now-stale FLASHDeconv comments.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
nbollis
left a comment
There was a problem hiding this comment.
Looks pretty good. One concern should be addressed in the handling of the ms1 feature file and it's record storage.
I see a bunch of new code inside of mass spec/deconvolution. Consider if any of our existing classes for peak tracing (such as the XIC section) and our interfaces can be reused here. It would be great if the mass trace builder could operate on IHasSingle... instead of this custom object. It would leave more door open in the future. That said, I would be happy with a pass that conciders the objects/interfaces we have and tell's me why it is a bad idea to reuse them here.
TraceCorrector defined its own IsotopeSpacingDa = 1.00335, a less-precise copy of the canonical Chemistry.Constants.C13MinusC12 (1.00335483810). Alias the shared constant so the C12/C13 spacing has one source of truth; the 4.8e-6 Da change is far inside the off-by-one detection window, so results are unchanged. Addresses review feedback on smith-chem-wisc#1069. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ResultFile<T>.Results lazy-loaded whenever _results was empty, which crashed for a factory-built file with an empty FilePath (e.g. a zero-feature Ms1FeatureFile.FromMassFeatures). Guarding the load with File.Exists(FilePath) lets an in-memory file return its set records directly, removing the need for Ms1FeatureFile's parallel _factoryRecords store so the records have a single source of truth. Addresses review feedback on smith-chem-wisc#1069. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thanks for the push to reuse the existing peak-tracing code — I took the pass you asked for. The candidate is exactly the XIC stack: Yield (the decisive metric). Identical MM FromFile-only pipeline, same 19 Jurkat top-down files, q≤0.01:
Reusing Robustness. Perf. So I'd like to keep The two inline comments are addressed: |
|
Filed the |
Two [Explicit] drivers (local Jurkat top-down data, not CI) that back the decision to keep MassTraceBuilder over reusing the existing GetAllXics: TestXicVsMassTraceComparison times both groupers head-to-head on the same deconvoluted envelopes (wall-clock, allocations, trace counts, grouping agreement); JurkatMs1FeatureGeneratorViaXic emits XIC-grouped _ms1.feature files through the identical downstream so end-to-end proteoform yield can be compared (GetAllXics measured -18% vs MassTraceBuilder). Refs smith-chem-wisc#1069. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 2: decouple per-spectrum deconvolution from cross-scan feature tracing behind a pluggable interface, so the MetaFlashDecon native tracer (Phase 3) and the consensus tracer can be swapped and compared on identical input envelopes. - IMassFeatureTracer (MassSpectrometry/Deconvolution/FeatureTracing): per-scan envelopes -> whole-file MassFeatures. - ConsensusMassFeatureTracer: wraps the consensus pipeline MassTraceBuilder -> TraceCorrector -> MassFeatureBuilder. Reuses smith-chem-wisc#1069's MassFeature as the shared feature model (Ms1FeatureFile.FromMassFeatures is the writer at the call site). - TestMassFeatureTracer (3 tests): adapter reproduces the direct pipeline; cross-charge stitch yields one feature with ChargeCount 2; interface impl. Note: this branch is now stacked on smith-chem-wisc#1069 (consensus-to-ms1feature) for the MassFeature/writer types; rebase off master after smith-chem-wisc#1069 merges. Full suite: 4034 pass / 0 fail / 5 skip. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ms1FeatureGenerator (Readers): mzML to per-scan MetaFlashDecon deconvolution to IMassFeatureTracer (native or consensus) to Ms1FeatureFile.FromMassFeatures to a _ms1.feature file. This is the "analogous to smith-chem-wisc#1069" entry point: whole-file deconvolution produces a _ms1.feature that MetaMorpheus consumes via FromFile. - Retention time is written in SECONDS (source minutes x 60) to match real OpenMS FLASHDeconv output; configurable via secondsPerRtUnit (pass 1.0 if already seconds). - Output is the TopFD-superset _ms1.feature schema: the 11 canonical FLASHDeconv columns plus Apex_intensity (which MetaMorpheus uses for per-charge weighting). The written header is verified to contain every real-FLASHDeconv column. - TestMs1FeatureGenerator (CI, 2 tests): seconds-RT conversion + schema + round-trip on synthetic features. - JurkatMetaFlashDeconGenerator ([Explicit]): emits _ms1.feature for the 20 Jurkat mzMLs with BOTH tracers into runs/metaflashdecon_features/{native,consensus} — the Phase 6 head-to-head input vs the real FLASHDeconv files at JurkatTopDown/FlashDecon. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ful) Ms1FeatureFile.FromMassFeatures emitted one row per contiguous charge run (ToMs1Features), so a gapped charge set became several rows. That was unfaithful: the real FLASHDeconv _ms1.feature writes exactly one row per mass feature with an aggregated Min/Max charge range (verified -- every multi-row mass in the ground-truth file differs in retention time, never only in charge), and it inflated intensity by writing the full feature.SummedIntensity on every split row (a 3-run feature reported 3x its real intensity). Now emits one row per feature via ToMs1Feature. Re-expanding Min..Max on reload may surface an unobserved intermediate charge, but that matches FLASHDeconv and the charge set is metadata, not used for mass/RT pairing. Updated the gapped-set test to pin the single-row, non-inflated-intensity contract. ToMs1Features kept as a public API. Shared by the consensus tracer (smith-chem-wisc#1069) -- both now emit one row per feature. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This PRis for a consensus mass-tracing classic deconvolution post-processor (per-charge trace grouping, off-by-one correction, cross-charge feature stitching). The writer serializes its output as a FLASHDeconv-style
_ms1.featurefile MetaMorpheus already reads — bridging consensus deconvolution into search with no new wire format. Off-by-one correction is now resolution-aware: it derives its window from per-trace mass scatter and splits resolvable co-grouped species such as deamidation into their own features instead of erasing them, which bottom-up search requires. Feature output is deterministic, inputs are guarded, and the promoted types ship with 100%-line-covered unit and performance tests.TraceCorrector.Correctnow returnsList<CorrectedTrace>and derives its window from per-trace mass scatter (3σ via MAD, capped below half the isotope spacing) instead of a fixed ±0.05 Da. An envelope at a resolvable, non-isotope offset — e.g. deamidation (+0.98402 Da), which the old window snapped to the unmodified mass — is split into its own trace and surfaced as a distinct feature; when scatter is too large to resolve it from the 1.00335 Da isotope spacing it conservatively merges. Bottom-up search needs this (deamidation detection is a requirement). The only constants are the 3σ multiple and the physics constant 1.00335 — the window is data-derived, not hardwired. Note: this supersedes the flat ±0.05 used for the benchmark numbers below, so those warrant an end-to-end re-run.Summary
Two unrelated-looking but tightly-coupled goals in one PR:
Promote the consensus mass-tracing post-processor out of
Development.Deconvolution.ConsensusTracing(research-only, see NOTES.md on the consensus-tracing branch for the 1300-line research log behind the design) into a production namespaceMassSpectrometry.Deconvolution.Consensus. Per-charge trace grouping, median off-by-one correction, and cross-charge feature stitching are now available to any caller without going through theDevelopmentproject.Add a writer that turns the consensus pipeline's output into a FLASHDeconv-style
_ms1.featurefile via a staticMs1FeatureFile.FromMassFeatures(...)factory + aMassFeature.ToMs1Feature(...)extension method. The factory builds a populatedMs1FeatureFileinstance; calling its existingWriteResults(path)emits the wire format that anyMs1FeatureFileconsumer already understands — notably MetaMorpheus'sFromFileDeconvolutionParametersfrom mzLib PR 1064, which the MetaMorpheus ExternalMs1Features PR has now wired into the precursor pipeline.Combined effect: a consensus pipeline can persist its result to a file that MetaMorpheus already reads, with zero new mzLib types crossing into MetaMorpheus and zero new MetaMorpheus surface area. The previously-designed
Ms2PrecursorEmitter+Ms2ScanWithPrecursorCandidatetwo-PR plan (see NOTES.md "Pending design — MS2 emitter" section) is obviated by this approach and can be dropped.Why now
The consensus pipeline (Classic-decon → trace grouping → off-by-one correction → cross-charge clustering → optional Phase 13 FastTree scoring) was developed over Phases 0–13 of the research arc with the types parked in
Development\DeconvolutionDevelopment\ConsensusTracing\. The design has now stabilised: 82–84% recall against the independent R7 reference, mean Phase-13 CV AUC 0.68. NOTES.md flagged "WholeFile move of consensus-tracing types" as the next move once the design settled. With the MetaMorpheus FromFile path landing and the consensus pipeline ready to be consumed, this PR is that move plus the matching writer.Phase A: Type promotion (commit
3d734cc1)Six types moved out of
Development.Deconvolution.ConsensusTracinginto newMassSpectrometry.Deconvolution.Consensus:MassTracemzLib/MassSpectrometry/Deconvolution/Consensus/MassTrace.csMassTraceBuilder(static)mzLib/MassSpectrometry/Deconvolution/Consensus/MassTraceBuilder.csCorrectedEnvelopemzLib/MassSpectrometry/Deconvolution/Consensus/CorrectedEnvelope.csCorrectedTracemzLib/MassSpectrometry/Deconvolution/Consensus/CorrectedTrace.csTraceCorrector(static, +EnvelopeWeightdelegate)mzLib/MassSpectrometry/Deconvolution/Consensus/TraceCorrector.csMassFeaturemzLib/MassSpectrometry/Deconvolution/Consensus/MassFeature.csMassFeatureBuilder(static, extracted fromPhase4_CrossChargeConsensus)mzLib/MassSpectrometry/Deconvolution/Consensus/MassFeatureBuilder.csEach type now lives in its own file (matching the existing
MassSpectrometry/Deconvolution/Algorithms/,Parameters/,Scoring/folder discipline). Doc comments updated to drop research-only references; method bodies and signatures are byte-identical to the research-namespace versions. No behaviour change.The research scaffolding (
Phase*.csfixtures,MakeConsensusSnips,LargeTestDataLocator) stays on theconsensus-tracingbranch. That branch needs a rebase after this PR merges to point at the new namespace — a one-lineusing MassSpectrometry.Deconvolution.Consensus;per phase file plus deletion of the now-redundant source underDevelopment\. Out of scope here.Phase B:
Ms1FeatureFile.FromMassFeaturesfactory (commit61029402)Two pieces:
Extension method in
Readers/ExternalResults/IndividualResultRecords/Ms1FeatureFromMassFeatureExtensions.cs:Field mapping (FLASHDeconv canonical schema):
Ms1FeaturefieldSampleIdIdsequentialIdMassMassFeature.ConsensusMass(intensity-weighted mean across constituent traces)IntensityMassFeature.SummedIntensityRetentionTimeBeginMassFeature.RTStartRetentionTimeEndMassFeature.RTEndRetentionTimeApexIntensityApexChargeStateMin/ChargeStateMaxmin(MassFeature.Charges)/max(MassFeature.Charges)FractionIdMin/FractionIdMaxfractionId(default 0)Static factory added directly on
Ms1FeatureFileinReaders/ExternalResults/ResultFiles/Ms1FeatureFile.cs:Builds the records via
ToMs1Feature, populates a freshMs1FeatureFile, returns it. The caller invokes the existing.WriteResults(path)to emit bytes — no new IO path, the existing reader's writer is reused.Softwaredefaults toFLASHDeconvbecause that's the canonical schema; callers can override.No filter applied. Every
MassFeaturebecomes one row. Downstream consumers (MetaMorpheus's precursorHashSetdedupe + FDR machinery) handle noise. This was an explicit design decision — three alternatives (multi-charge only, score-thresholded, score-as-side-artifact) were considered and rejected for v1 to keep maximum recall and let downstream policy decide.Phase C: Round-trip tests (commit
9bc26a27)Five NUnit cases in
Test/FileReadingTests/ExternalFileReading/TestMs1FeatureFromMassFeature.cs:ToMs1Feature_SingletonFeature_MapsAllFieldsToMs1Feature_MultiChargeFeature_ApexIsDominantTraceMaxEnvelopeToMs1Feature_SampleIdAndFractionIdHonouredSample_IDandFraction_IDpropagate to the row.FromMassFeatures_WriteThenRead_AllFieldsSurviveFileReader.ReadFile<Ms1FeatureFile>→ field-by-field equality. Catches serialisation drift between writer and reader.FromMassFeatures_AssignsSequentialIdsFromZeroMassFeature.Idis ignored; written rows get0..N-1regardless of upstream filtering holes.Bug fixed while writing the tests
The original
FromMassFeaturesfactory usedfile.Results.Add(...)in a loop.ResultFile<T>.Results.getcallsLoadResults()whenever the list is empty; for an in-memory-constructedMs1FeatureFilewith noFilePath, that throwsArgumentExceptionon the empty path. Reshaped the factory to build the list up front and assign once via the public setter, which bypasses the lazy-load. Inline comment in the factory explains why. The third round-trip test caught this before the writer shipped.Results
The "broader sweep" covers
TestMsFeature,TestMs1FeatureFromMassFeature(new),TestFromFileDeconvolution,TestResultFile,TestSupportedFileExtensions,TestDinosaurTsv— every test fixture that touches theMs1Feature/FromFile/ supported-file-extensions surface.Files changed
mzLib/MassSpectrometry/Deconvolution/Consensus/MassTrace.csmzLib/MassSpectrometry/Deconvolution/Consensus/MassTraceBuilder.csmzLib/MassSpectrometry/Deconvolution/Consensus/CorrectedEnvelope.csmzLib/MassSpectrometry/Deconvolution/Consensus/CorrectedTrace.csmzLib/MassSpectrometry/Deconvolution/Consensus/TraceCorrector.csmzLib/MassSpectrometry/Deconvolution/Consensus/MassFeature.csmzLib/MassSpectrometry/Deconvolution/Consensus/MassFeatureBuilder.csmzLib/Readers/ExternalResults/IndividualResultRecords/Ms1FeatureFromMassFeatureExtensions.csmzLib/Readers/ExternalResults/ResultFiles/Ms1FeatureFile.csFromMassFeaturesstatic +using MassSpectrometry.Deconvolution.Consensus;mzLib/Test/FileReadingTests/ExternalFileReading/TestMs1FeatureFromMassFeature.csTotal: 10 files, +830 / −3 across 3 commits.
How a consumer uses this
End-to-end on the yeast top-down test data (per the MetaMorpheus PR results table): adding the FromFile source to the today-baseline classic pipeline yields +44% PSMs and +10% protein groups at 1% FDR.
Architectural notes
IEnumerable<MassFeature>; whoever produces the features (the consensus pipeline today, possibly other producers in the future) doesn't matter._ms1.featureschema thatMs1FeatureFile.WriteResultsalready emits. The writer just constructs the records in memory.FromFileDeconvolutionAlgorithm(mzLib PR 1064), which the MetaMorpheus FromFile path invokes per MS2 scan. No new pairing code in this PR.Ms2PrecursorEmitter+Ms2ScanWithPrecursorCandidatedesign (see consensus-tracing branch's NOTES.md, "Pending design — MS2 emitter") is obviated by this PR: file-based integration achieves the same end-to-end behaviour without the new types crossing repos.Out of scope
MassFeaturebecomes a row). If a downstream consumer wants score-based filtering, that's a separate change — either at the producer (filter theMassFeaturelist before callingFromMassFeatures) or downstream (filterMs1Featurerecords afterLoadResults)..scores.tsvfile written alongside, leaving_ms1.featureschema unchanged.CorrectedTraceimplementingISingleChargeMs1Feature. Would enable an in-memory path (skipping the disk round-trip), which the file-based integration intentionally avoids. Possible follow-up if any consumer ever wants it.consensus-tracingresearch branch to consume the new namespace. The research Phase fixtures still compile againstDevelopment.Deconvolution.ConsensusTracing; that branch needsusing MassSpectrometry.Deconvolution.Consensus;added and the duplicate source files inDevelopment\deleted, but that work belongs onconsensus-tracing, not here.Test plan
MassSpectrometry.csprojbuild, 0 errors).Ms1FeatureFromMassFeatureExtensionsand updatedMs1FeatureFile(Readers.csprojbuild, 0 errors).TestMs1FeatureFromMassFeaturepass.FromMassFeaturesfactory (lazy-load via.Results.Addon empty FilePath) caught by the round-trip test and fixed before commit.ce24771c+3aa2fe04upstream first (schema aliases for newer TopFD_ms1.featurefiles — strictly speaking unrelated to this PR's writer, but anyone reproducing the MetaMorpheus integration needs them too).consensus-tracingbranch on top of this PR once merged; update Phase fixtures to useMassSpectrometry.Deconvolution.Consensus; delete the now-duplicate source underDevelopment\DeconvolutionDevelopment\ConsensusTracing\.Companion PR
smith-chem-wisc/MetaMorpheus#2650 (
ExternalMs1Features) is the consumer side. It wiresMs1FeatureFilePaththroughFileSpecificParametersand usesFromFileDeconvolutionParameters(from mzLib PR 1064) to pair the features with MS2 scans. That PR's CI is currently red againstsmith-chem-wisc/MetaMorpheus's pinned mzLib, but will go green once mzLib publishes a version containing PR 1064 + the schema aliases.