fix: replace manual anomalies with a hampel filter#1997
fix: replace manual anomalies with a hampel filter#1997matthewp wants to merge 3 commits intonpmx-dev:mainfrom
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub.
2 Skipped Deployments
|
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Lunaria Status Overview🌕 This pull request will trigger status changes. Learn moreBy default, every PR changing files present in the Lunaria configuration's You can change this by adding one of the keywords present in the Tracked Files
Warnings reference
|
📝 WalkthroughWalkthroughThis PR replaces blocklist-based anomaly correction with a Hampel-filter implementation (add: applyHampelCorrection) and removes the DOWNLOAD_ANOMALIES dataset and related blocklist functions. TrendsChart and WeeklyDownloadStats now call the Hampel correction. Per-package anomaly UI, detailed anomaly tooltips and related state were removed or simplified. i18n keys for known anomaly ranges were deleted across locales and schema. Tests were updated to exercise the Hampel-based correction. Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 1✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: beb9630b-e84a-4279-a4f8-362d18f451b4
📒 Files selected for processing (5)
app/components/Package/TrendsChart.vueapp/components/Package/WeeklyDownloadStats.vueapp/utils/download-anomalies.data.tsapp/utils/download-anomalies.tstest/unit/app/utils/download-anomalies.spec.ts
💤 Files with no reviewable changes (1)
- app/utils/download-anomalies.data.ts
| // halfWindow controls how many neighbors on each side to consider. | ||
| // A window of 3 means we look at 7 points total (3 left + current + 3 right). | ||
| const halfWindow = opts?.halfWindow ?? DEFAULT_HALF_WINDOW |
There was a problem hiding this comment.
Do not score boundary samples with truncated windows.
Line 33 says a halfWindow of 3 uses 3 neighbours on each side, but Lines 50-52 clamp the first and last samples to shorter windows after Line 42 only validates the overall series length. That makes edge points easy false positives: 100,100,100,100,100,100,200 gets its last point flattened back to 100 because there is no right-hand context and windowMad falls to 0. Skip indices that cannot form a full symmetric window, or handle boundaries explicitly.
Suggested fix
- for (let i = 0; i < values.length; i++) {
- // Build a sliding window around the current point, clamped to array bounds.
- const start = Math.max(0, i - halfWindow)
- const end = Math.min(values.length - 1, i + halfWindow)
+ for (let i = halfWindow; i < values.length - halfWindow; i++) {
+ // Only evaluate points that have a full symmetric window.
+ const start = i - halfWindow
+ const end = i + halfWindow
const window = values.slice(start, end + 1)Also applies to: 41-42, 48-52
| // MAD of 0 means most values in the window are identical. | ||
| // If this point differs from the median at all, it's an outlier. | ||
| if (windowMad === 0) { | ||
| if (deviation > 0) { | ||
| result[i]!.value = Math.round(windowMedian) | ||
| result[i]!.hasAnomaly = true | ||
| } | ||
| continue |
There was a problem hiding this comment.
The zero-MAD branch will erase real low-volume traffic.
When windowMad === 0, Line 68 treats any deviation from the local median as an anomaly. For sparse packages, a legitimate series like 0,0,0,1,0,0,0 is rewritten to all zeros on Line 69, which drops the only real activity in that period. Please gate this path behind an absolute/relative floor, or leave zero-MAD windows untouched.
| describe('applyHampelCorrection', () => { | ||
| it('flags and corrects a spike in the middle of steady data', () => { | ||
| const data: WeeklyDataPoint[] = [ | ||
| makeWeeklyPoint('2022-11-07', 100), | ||
| makeWeeklyPoint('2022-11-14', 100), | ||
| makeWeeklyPoint('2022-11-21', 100), | ||
| makeWeeklyPoint('2022-11-28', 1000), // spike | ||
| makeWeeklyPoint('2022-12-05', 100), | ||
| makeWeeklyPoint('2022-12-12', 100), | ||
| makeWeeklyPoint('2022-12-19', 100), | ||
| ] | ||
|
|
||
| const result = applyHampelCorrection(data) as WeeklyDataPoint[] | ||
|
|
||
| // The spike should be corrected | ||
| expect(result[3]!.hasAnomaly).toBe(true) | ||
| expect(result[3]!.value).toBe(100) // replaced with median | ||
|
|
||
| // Non-spike points should be unchanged | ||
| expect(result[0]!.value).toBe(100) | ||
| expect(result[0]!.hasAnomaly).toBeUndefined() | ||
| expect(result[1]!.value).toBe(100) | ||
| expect(result[6]!.value).toBe(100) | ||
| }) | ||
|
|
||
| it('does not flag gradual growth as anomalies', () => { | ||
| const data: WeeklyDataPoint[] = [ | ||
| makeWeeklyPoint('2022-11-07', 100), | ||
| makeWeeklyPoint('2022-11-14', 110), | ||
| makeWeeklyPoint('2022-11-21', 120), | ||
| makeWeeklyPoint('2022-11-28', 130), | ||
| makeWeeklyPoint('2022-12-05', 140), | ||
| makeWeeklyPoint('2022-12-12', 150), | ||
| makeWeeklyPoint('2022-12-19', 160), | ||
| ] | ||
|
|
||
| const result = applyHampelCorrection(data) as WeeklyDataPoint[] | ||
|
|
||
| for (const point of result) { | ||
| expect(point.hasAnomaly).toBeUndefined() | ||
| } | ||
| }) | ||
|
|
||
| it('returns data unchanged when too few points for the window', () => { | ||
| const data: WeeklyDataPoint[] = [ | ||
| makeWeeklyPoint('2022-11-07', 100), | ||
| makeWeeklyPoint('2022-11-14', 1000), | ||
| makeWeeklyPoint('2022-11-21', 100), | ||
| ] | ||
|
|
||
| const result = applyHampelCorrection(data) as WeeklyDataPoint[] | ||
| expect(result[1]!.value).toBe(1000) // not enough data to detect | ||
| }) | ||
|
|
||
| it('does not mutate the original data', () => { | ||
| const data: WeeklyDataPoint[] = [ | ||
| { | ||
| value: 100, | ||
| weekKey: '2022-11-07_2022-11-13', | ||
| weekStart: '2022-11-07', | ||
| weekEnd: '2022-11-13', | ||
| timestampStart: 0, | ||
| timestampEnd: 0, | ||
| }, | ||
| { | ||
| value: 999, | ||
| weekKey: '2022-11-14_2022-11-20', | ||
| weekStart: '2022-11-14', | ||
| weekEnd: '2022-11-20', | ||
| timestampStart: 0, | ||
| timestampEnd: 0, | ||
| }, | ||
| { | ||
| value: 999, | ||
| weekKey: '2022-11-21_2022-11-27', | ||
| weekStart: '2022-11-21', | ||
| weekEnd: '2022-11-27', | ||
| timestampStart: 0, | ||
| timestampEnd: 0, | ||
| }, | ||
| { | ||
| value: 999, | ||
| weekKey: '2022-11-28_2022-12-04', | ||
| weekStart: '2022-11-28', | ||
| weekEnd: '2022-12-04', | ||
| timestampStart: 0, | ||
| timestampEnd: 0, | ||
| }, | ||
| { | ||
| value: 200, | ||
| weekKey: '2022-12-05_2022-12-11', | ||
| weekStart: '2022-12-05', | ||
| weekEnd: '2022-12-11', | ||
| timestampStart: 0, | ||
| timestampEnd: 0, | ||
| }, | ||
| makeWeeklyPoint('2022-11-07', 100), | ||
| makeWeeklyPoint('2022-11-14', 100), | ||
| makeWeeklyPoint('2022-11-21', 100), | ||
| makeWeeklyPoint('2022-11-28', 1000), | ||
| makeWeeklyPoint('2022-12-05', 100), | ||
| makeWeeklyPoint('2022-12-12', 100), | ||
| makeWeeklyPoint('2022-12-19', 100), | ||
| ] | ||
|
|
||
| expect( | ||
| applyBlocklistCorrection({ | ||
| data, | ||
| packageName: 'svelte', | ||
| granularity: 'weekly', | ||
| }), | ||
| ).toEqual([ | ||
| data[0], | ||
| { ...data[1], value: 125, hasAnomaly: true }, | ||
| { ...data[2], value: 150, hasAnomaly: true }, | ||
| { ...data[3], value: 175, hasAnomaly: true }, | ||
| data[4], | ||
| ]) | ||
| applyHampelCorrection(data) | ||
| expect(data[3]!.value).toBe(1000) // original unchanged | ||
| }) |
There was a problem hiding this comment.
Please add regressions for the boundary and sparse-series cases.
The suite covers a centred large spike and gradual growth, but not the two easiest false-positive cases in this implementation: the first/last halfWindow indices, and flat low-volume series such as 0,0,0,1,0,0,0. Please pin both behaviours down here so the filter does not silently flatten real data at the edges or on sparse packages.
As per coding guidelines "**/*.{test,spec}.{ts,tsx}: Write unit tests for core functionality using vitest".
There was a problem hiding this comment.
🧹 Nitpick comments (1)
app/components/Package/TrendsChart.vue (1)
1831-1838: Consider usingv-modelfor simpler checkbox binding.The explicit
:checked+@changepattern is functionally correct, butv-modelprovides equivalent behaviour with less boilerplate.♻️ Optional simplification
<input - :checked="settings.chartFilter.anomaliesFixed" - `@change`=" - settings.chartFilter.anomaliesFixed = ($event.target as HTMLInputElement).checked - " + v-model="settings.chartFilter.anomaliesFixed" type="checkbox" class="accent-[var(--accent-color,var(--fg-subtle))]" />
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 91971e88-96ef-48e4-befd-3a0865a5eb4c
📒 Files selected for processing (17)
app/components/Package/TrendsChart.vuei18n/locales/bg-BG.jsoni18n/locales/cs-CZ.jsoni18n/locales/de-DE.jsoni18n/locales/en.jsoni18n/locales/es.jsoni18n/locales/fr-FR.jsoni18n/locales/hu-HU.jsoni18n/locales/id-ID.jsoni18n/locales/ja-JP.jsoni18n/locales/pl-PL.jsoni18n/locales/ru-RU.jsoni18n/locales/tr-TR.jsoni18n/locales/uk-UA.jsoni18n/locales/zh-CN.jsoni18n/locales/zh-TW.jsoni18n/schema.json
💤 Files with no reviewable changes (16)
- i18n/locales/en.json
- i18n/locales/bg-BG.json
- i18n/locales/es.json
- i18n/locales/de-DE.json
- i18n/locales/zh-TW.json
- i18n/locales/fr-FR.json
- i18n/locales/tr-TR.json
- i18n/locales/cs-CZ.json
- i18n/locales/hu-HU.json
- i18n/schema.json
- i18n/locales/zh-CN.json
- i18n/locales/uk-UA.json
- i18n/locales/ru-RU.json
- i18n/locales/ja-JP.json
- i18n/locales/pl-PL.json
- i18n/locales/id-ID.json
|
this looks very promising! tagging @jycouet who may have some thoughts on this 🙏 |
|
Great to have a second look into this. You probably checked my initial PR, #1636 I started with hampel implementation 👍 If I remember correctly, the sweet spot was 2.5 or 3 for vite. But this setting would hide "great start" of a lib. (eg: 0 0 0 0 0 0 20000) Another note, when the start or end of the chart is in an anomalie period, any algo can't fix it. A good example is: if you start the chart in the middle of the vite spike. Maybe we should add some tests around all this ? (to keep the intent) Another note 2: I would love to have more than just vite there today ! 😅 |
|
Thanks for this! A few thoughts after pulling the branch: Reproduction URL that matters: "Great start" test cases:Packages like PR suggestionCould we expose the Hampel filter as its own (separate from manual correction), with tweakable halfWindow / threshold sliders? That way we can compare manual vs Hampel side by side and find the right balance before committing to one approach? // Happy coding |
I don't follow, as its own what? |
|
Yes I can do that. |

🔗 Linked issue
Previous issue: #1707
🧭 Context
The current implementation of anomaly removal is biased due to the manual nature.
This replaces the implementation with one that uses a hampel filter to automatically remove deviations.
📚 Description
The important part here is
applyHampelCorrection.It goes over each data point and creates a sliding window of data points, by default 3 on each side.
Of those data points it finds the median, since a spike can't pull it off center.
Then it measures how spread out the neighbors are by calculating the MAD (see here).
Then it gives the data point a score and if it's above a threshold then it gets replaced with the median of that window and marked as being an anomaly.
Here are two articles about how hampel filters work:
Here's a screenshot of Vite which still shows its spike removed after this change.