Skip to content

Fix/statistic bounded maps#1191

Merged
tobychui merged 3 commits into
tobychui:v3.3.4from
ElmoViggiani:fix/statistic-bounded-maps
Jun 6, 2026
Merged

Fix/statistic bounded maps#1191
tobychui merged 3 commits into
tobychui:v3.3.4from
ElmoViggiani:fix/statistic-bounded-maps

Conversation

@ElmoViggiani

Copy link
Copy Markdown
Contributor

Hello,
I fixed an unbounded growth bug in the statistics module with help from Claude.

"per-day statistics maps grow without bound under high-cardinality request inputs, leading to OOM-kill"

Memory usage will grow throughout the day, eventually leading to an out of memory process kill.

I'm getting millions of requests per day from many hundreds of thousands unique ip's.

The attached screenshot from the Resource Monitor plugin shows memory growing steadily before the patch, and a flat line after the patch. I'm running with -fastgeoip otherwise the cpu would be pegged.
Screenshot 2026-05-23 at 08 31 20

Before the patch, sys.db would grow 800mb+ per 24 hours and I was having to delete it at midnight and start over.
After the patch, sys.db grows much slower. Only about 160mb after 4-5 days, because there are fewer items saved to the db. I've tested -stats_max_entries with 20000, 50000, and 100000, and the memory usage has consistently stayed flat. Sys.db growth scaled with the -stats_max_entries cap, as expected.

By Claude:
In mod/statistic, the per-day DailySummary keeps a separate map for each request dimension it tracks — client IP, User-Agent, Referer, request URL, downstream and upstream hostnames, and forwarding type. Each map grows by one entry per distinct value seen, and the maps reset at midnight. For the four dimensions whose values come straight from the request stream (client IP, User-Agent, Referer, URL), the set of possible values is effectively unbounded — every distinct value lands as a new permanent map entry until the next reset. On a busy proxy this can drive memory consumption well into the gigabytes over a day of uptime, amplified by the autosave path which re-serializes the whole summary to disk every ten minutes. A handful of past issues have reported Zoraxy gradually consuming all available memory across multi-day cycles (most notably #1007 and parts of the discussion on #816); the existing -fastgeoip workaround addresses a different startup-time allocation and doesn't help with this runtime growth.

This patch adds a per-map size cap that's opt-in via a new CLI flag, -stats_max_entries=N. When unset (or =0), behavior is byte-for-byte identical to the current code — same Load → check → Store pattern, no overhead, no risk to existing deployments. When set to a positive value, each of the eight maps stays at or under N entries. Once a map fills up, the next new insert triggers a quick trim: it scans the map, sorts entries by how many requests they've received, and drops the lowest-count 10% to make room. Existing entries continue to count up at the same speed they always did; the trim only runs on the relatively rare event of a new entry arriving at a full map, and trims that happen at the same time are serialized so request handling never blocks on stat collection.

The choice to drop the lowest-count entries (rather than oldest, newest, or random) means the data the dashboards actually display is preserved. The top-IPs, top-User-Agents, top-URLs panels all surface the most-frequently-seen items in each dimension — and those are exactly the entries the cap keeps. What gets dropped is the long tail: a single visit from an unusual IP, a malformed Referer, a one-time URL request. None of those would have shown up on the dashboard anyway. The top-line totals (TotalRequest / ValidRequest / ErrorRequest) are completely untouched, so the "X requests today" headline remains exact regardless of the cap. Recommended starting value is -stats_max_entries=20000, which gives plenty of headroom for legitimate diversity while bounding the stats structure at roughly 32 MB worst case across all eight maps.


The 8 *sync.Map fields in DailySummary (the target of RecordRequest)
accept inputs whose cardinality is unbounded in principle — every
unique RequestClientIp, Referer, UserAgent, and RequestURL value
becomes a permanent map entry until the midnight reset. On a busy
host this drove the heap-growth pattern observed in profiling:
~138MB sync.Map.Store + ~71MB bbolt write + ~70MB bytes.growSlice
as top heap retainers, amplified by SaveSummaryOfDay's autosave
path (full JSON marshal every 600s, sized to the live cardinality).

This patch wraps each map with a soft cap enforced via lowest-count
eviction: on a new-key insert past the cap, snapshot the map, sort
by request count ascending, and drop the bottom 10%. Existing-key
increments stay O(1); only new-key over-cap inserts trigger the
trim. Concurrent trims are serialized via TryLock so RecordRequest
never blocks on stat collection.

Dropping the lowest-count entries first was chosen over least-
recently-used because the access pattern is increment-only —
least-recently-used would tend to evict the persistent high-count
entries (the ones that actually matter for the dashboards) and
preserve transient one-off entries. Keeping the highest-count
entries matches what the dashboards surface (top items per
dimension).

The cap is hardcoded at 20000 in this commit (sufficient headroom
for most deployments: 20000 * 8 maps * ~200B/entry ≈ 32MB worst
case). A follow-up commit gates the whole feature behind a CLI
flag so the new behavior is opt-in.

Tests cover existing-key increment (size stays at 1), low-count
eviction (high-count keys survive after 1000 inserts at cap=100),
and concurrent stress (16 writers x 2000 unique inserts, race
detector clean, size stays bounded near cap).
The bounded behavior added in the previous commit is now an opt-in
feature. The default (and the behavior with no flag) is the upstream
Load → check → Store path, byte-for-byte unchanged from the
maintainer's original code.

- New CLI flag -stats_max_entries=N (default 0). 0 disables the cap
  entirely; positive values cap each of the eight per-dimension
  sync.Maps in DailySummary at N entries. When the cap is exceeded,
  the entries with the lowest request counts are dropped first, which
  keeps the most-frequently-seen items intact and discards transient
  one-off entries.
- New CollectorOption.MaxEntriesPerStatMap field threads the flag
  through start.go to NewStatisticCollector.
- Dispatch happens once at startup via newIncrFn, which returns either
  unboundedIncr (a verbatim reproduction of the upstream pattern) or
  a closure around boundedIncr. The eight call sites in RecordRequest
  now invoke c.incr(...) — one indirect call per dimension, no
  per-request branching.
- The previously hardcoded const maxEntriesPerStatMap is removed; the
  recommended value when enabling (20000) now lives in the flag's
  help text and a comment in bounded.go.

Two new tests:
- TestNewIncrFnDispatch — verifies maxEntries=0 produces unbounded
  behavior (50 entries kept, sidecar counter untouched) and >0 produces
  bounded behavior (capped at the configured limit).
- TestUnboundedIncrMatchesUpstreamPattern — verifies unboundedIncr
  reproduces the upstream first-store-1, then-increment behavior.

For deployments wanting the bounding behavior:
  -stats_max_entries=20000
For deployments wanting upstream-identical behavior:
  pass nothing (or -stats_max_entries=0)
@ElmoViggiani ElmoViggiani requested a review from tobychui as a code owner May 26, 2026 01:37
@tobychui

Copy link
Copy Markdown
Owner

This is a lot of changes. I think I might need some time to review this properly.

Rebasing to v3.3.4 for now.

@tobychui tobychui changed the base branch from v3.3.3 to v3.3.4 May 28, 2026 11:07
@tobychui tobychui requested a review from Copilot June 4, 2026 13:17

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an optional per-dimension entry cap to the statistics module to prevent unbounded growth of the per-day high-cardinality maps (IP, UA, Referer, URL, etc.), addressing long-running memory growth/OOM scenarios on busy proxies.

Changes:

  • Introduces -stats_max_entries to enable a bounded mode for the per-dimension stats maps.
  • Refactors request recording to use a single increment dispatch (c.incr) that selects bounded vs. unbounded behavior at startup.
  • Adds bounded-map trimming logic (evict lowest-frequency entries) plus internal tests for the trimming behavior.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/def.go Adds the -stats_max_entries CLI flag for bounding per-dimension stats.
src/start.go Wires the new stats cap option into the statistic collector initialization.
src/mod/statistic/statistic.go Adds bounded-sidecar state to DailySummary, introduces increment dispatch, and routes per-request increments through it.
src/mod/statistic/bounded.go Implements bounded increment + eviction (trim) logic for sync.Map-backed counters.
src/mod/statistic/bounded_internal_test.go Adds unit tests for bounded increment/trim behavior and concurrent insertion behavior.
src/mod/statistic/structconv.go Ensures loaded daily summaries initialize the bounded sidecars consistently with persisted map sizes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/mod/statistic/bounded.go
Comment thread src/def.go Outdated
Comment thread src/mod/statistic/statistic.go Outdated
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
@ElmoViggiani

Copy link
Copy Markdown
Contributor Author

I updated my branch with the requested Copilot changes. It compiles ok and passes the tests. I'm running it now in prod. My previous version with the fix was running for 10 days without any memory creep. It has been stable at 720 MiB on a busy site.

Sys.db is still 161MB as when I started, indicating the db entries are getting cleared as scheduled and boltdb is reusing the cleared space as expected.

Image 6-4-26 at 15 41

@tobychui

tobychui commented Jun 4, 2026

Copy link
Copy Markdown
Owner

Thanks for the followup! Will be merging this and deploy to my dev branch this weekend 👍🏻

@tobychui tobychui left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tobychui tobychui merged commit d1e423a into tobychui:v3.3.4 Jun 6, 2026
0 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants