Skip to content

carrier: restore endpoint-count worker scaling + per-account idle semaphore#146

Merged
Kianmhz merged 1 commit into
mainfrom
spike/unified-pool-per-account-sem
May 21, 2026
Merged

carrier: restore endpoint-count worker scaling + per-account idle semaphore#146
Kianmhz merged 1 commit into
mainfrom
spike/unified-pool-per-account-sem

Conversation

@Kianmhz
Copy link
Copy Markdown
Owner

@Kianmhz Kianmhz commented May 21, 2026

Summary

v1.6 changed the worker formula from v1.5's workersPerEndpoint × len(endpoints) to (workersPerEndpoint + idleSlots − 1) × bucketCount (where bucketCount is the number of distinct Google account labels). For configs without account labels — the most common legacy pattern, since most users never edited the deprecated v1.4 example config to add account labels — every endpoint shares one empty-string bucket, so worker count collapses:

Config v1.5 v1.6 (incl #143) this PR
5 unlabeled endpoints 15 3 15
5 deployments across 5 labeled accounts 15 15 15
9 deployments across 5 accounts (#113) 27 15 27

#143 already shipped the right per-worker behavior (revert workersPerEndpoint 4→3, fixed upload throughput at the bench's 1-endpoint config). But it does nothing about the worker-count regression for multi-endpoint configs — those users still run with bucketCount-scaled workers.

This PR restores v1.5's endpoint-count scaling AND replaces the global idle-slot counter with a per-account semaphore inside the picker. The semaphore keeps the anti-abuse cap from #56 (multiple deployments under one account couldn't sustain v1.5's worker count's concurrency on the same Google account) while letting workers freely rotate across accounts.

What changes

Two pickers now:

  • pickRelayEndpoint: blacklist-aware, no cap. Used for active polls carrying TX, which terminate quickly with the drained batch and don't camp an account's concurrency budget — matches v1.5 behavior.
  • pickIdleEndpoint: blacklist-aware AND requires a free idle slot in the candidate's bucket. Atomically reserves; pollOnce releases on return.

Each unlabeled endpoint gets its own implicit bucket (key = "url:"+url) so legacy multi-endpoint configs no longer share one bucket with one idle slot — they get one slot per endpoint, like v1.5.

What the bench shows (and doesn't)

The harness only configures one endpoint, so it can't directly measure the multi-endpoint win. The bench confirms no regression on the 1-endpoint config:

Metric main (#143) this PR
throughput_up_8MB_1session 22 MB/s 22.26 MB/s
throughput_up_8MB_4sessions ~87 MB/s 87.17 MB/s
sessions_per_sec 4.5/s 4.39/s

Three runs each, all stable.

The real benefit lives in the worker-count table above — verifiable from the math in the diff (numWorkers = workersPerEndpoint * len(endpoints) line), not from this bench. The most-affected user segment (legacy unlabeled multi-endpoint configs) goes from 3 → 15 workers; #113's specific setup goes from 15 → 27.

Follow-up

A multi-endpoint bench scenario should be added so this kind of regression can be caught automatically next time. Out of scope for this PR — would need a harness change to spawn N fake /tunnel endpoints.

Verification

  • go test -count=1 -timeout 90s ./... — all green
  • go vet ./... — clean
  • 3-run bench vs v1.6.0 baseline — no regression on 1-endpoint scenarios
  • All existing TestCarrier_PureDownloadIdleCap / TestCarrier_IdleSlotsPerBucket tests pass without modification because the per-bucket cap math happens to give the same result on the configs they exercise.

🤖 Generated with Claude Code

…aphore

v1.6 changed the worker formula from v1.5's `workersPerEndpoint × len(endpoints)`
to `(workersPerEndpoint + idleSlots - 1) × bucketCount`, where bucketCount is
the number of distinct Google account labels. For configs without account
labels — the most common legacy pattern, since most users never edited the
deprecated v1.4 example config — every endpoint shares one empty-string bucket,
so worker count collapses:

  config                                  v1.5  v1.6 (incl #143)  this PR
  ----------------------------------------------------------------------
  5 unlabeled endpoints                     15                 3       15
  5 deployments across 5 labeled accounts   15                15       15
  9 deployments across 5 accounts (#113)    27                15       27

#143 already shipped the right per-worker behavior (revert workersPerEndpoint
to 3, fixes upload throughput at the bench's 1-endpoint config). But it does
nothing about the worker count regression for multi-endpoint configs — those
users still run with bucketCount-scaled workers.

This PR restores v1.5's endpoint-count scaling AND replaces the global idle-slot
counter with a per-account semaphore inside the picker. The semaphore keeps the
anti-abuse cap (issue #56 — multiple deployments under one account couldn't
sustain the v1.5 worker count's concurrency on the same Google account) while
letting workers freely rotate across accounts.

Two pickers now:
- pickRelayEndpoint: blacklist-aware, no cap. Used for active polls carrying
  TX, which terminate quickly with the drained batch and don't camp an
  account's concurrency budget — matches v1.5 behavior.
- pickIdleEndpoint: blacklist-aware AND requires a free idle slot in the
  candidate's bucket. Atomically reserves; pollOnce releases on return.

Each unlabeled endpoint gets its own implicit bucket (key = "url:"+url) so
legacy multi-endpoint configs no longer share one bucket with 1 idle slot —
they get one slot per endpoint, like v1.5.

Bench on the 1-endpoint config (the only one the harness can drive):
  throughput_up_8MB_1session    22.26 MB/s (unchanged from main)
  throughput_up_8MB_4sessions   87.17 MB/s (unchanged)
  sessions_per_sec              4.39 /s    (unchanged, within noise)

No regression on the configs the bench covers. The actual benefit lives in
the worker-count table above — verifiable from the math, not the bench. A
multi-endpoint bench scenario should be a follow-up so this kind of
regression can be caught automatically next time.
@Kianmhz Kianmhz force-pushed the spike/unified-pool-per-account-sem branch from a01ad0e to afc02b1 Compare May 21, 2026 05:12
@Kianmhz Kianmhz merged commit ea76832 into main May 21, 2026
6 checks passed
@Kianmhz Kianmhz deleted the spike/unified-pool-per-account-sem branch May 21, 2026 18:35
Kianmhz added a commit that referenced this pull request May 21, 2026
The default-1 baseline was set conservatively against issue #56 when
the worker pool was still per-bucket-scaled. With #146's per-account
semaphore enforcing the safety cap, two idle slots per bucket is the
better default — it raises download responsiveness without putting
more concurrent UrlFetchApp executions on any one account than the
recommended multi-deployment setup can sustain. Lower to 1 only if
you run a single deployment per account; raise to 3 with 3+.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant