feat: optional kernel SYN rate-limiter + per-endpoint DC connect timeout#363
Merged
Conversation
Adds an OPTIONAL, default-OFF kernel-level per-source-IP inbound SYN
rate-limiter on the proxy port, via iptables `-m hashlimit`. It drops
abusive first-SYN bursts in the kernel BEFORE accept(), complementing the
in-proxy guards (handshake flood guard / per-/24 subnet limiter) which run
AFTER accept() and themselves default OFF — so every attack SYN currently
still costs a socket + an accept() syscall.
Idea borrowed from MTproxy-reanimation (which does this with nftables for
Telemt/MTProxyMax). We use iptables hashlimit to match our existing
iptables stack (TCPMSS, nfqws) — no new nftables dependency — and run it as
a SEPARATE systemd oneshot unit (mtproto-syn-limit.service) so CAP_NET_ADMIN
never has to be granted to mtproto-proxy itself.
- `mtbuddy setup syn-limit [--preset soft|medium|hard] [--rate N/second]
[--burst N] [--remove] [--status]`; interactive menu item too.
- Presets mirror the source: soft 2/s burst 5 (default when enabling,
CGNAT-safer), medium 1/s burst 3, hard 1/s burst 1.
- Default OFF behind a loud CGNAT/VPN false-positive warning (the same
shared-egress-IP problem that keeps the in-proxy guards off); warns when
accept_proxy_protocol is set (kernel sees the LB IP, not real clients).
- Drop counter surfaced in `mtbuddy status`; verifies the rule actually
landed (xt_hashlimit present) instead of a silent no-op.
- Idempotent apply (remove-before-add) and full uninstall cleanup
(replay-delete INPUT jumps → flush → delete chain, both v4/v6).
- The generated script avoids `{`/`}` so std.fmt can render it; rate/burst/
port are validated before being baked in. Pure renderer is unit-tested.
Adds `dc_connect_timeout_sec` (default 10): a per-endpoint deadline for
completing the TCP connect to a Telegram DC endpoint.
A filtered/black-holed endpoint sends no RST, so the kernel keeps the
connect in SYN_SENT for ~2 min. handshake_timeout_sec (15s, measured from
the client's first byte) already caps the whole client handshake, so we are
NOT exposed to an unbounded hang — but that cap is GLOBAL, so a slow first
endpoint starves the failover budget for the remaining endpoints. This
fires per endpoint: if connect() hasn't completed within the deadline, the
endpoint is abandoned and failover advances to the next one (within the
overall handshake_timeout_sec ceiling).
Implementation mirrors onUpstreamConnectComplete's failure path exactly
(cleanupFailedUpstreamConnect → tryNextDcEndpoint for .dc kinds, else
closeSlot), driven from the timer tick. `upstream_connect_started_ms` is
stamped per attempt in startConnectUpstream (reset on every endpoint via the
pool's slot.* = .{}), and the check is gated on phase == .connecting_upstream
so it can never touch an established relay. A healthy connect completes in
well under a second, so this never affects working endpoints.
Inspired by the tg_connect knob in the MTproxy-reanimation analysis;
deliberately does NOT raise handshake_timeout_sec (that would widen the
active-probe / slow-loris window the DD-decision and flood guards exist to
shrink). idle_timeout_sec=120 already equals their client_keepalive=120.
Mirror the two new features across all five READMEs (en/ru/zh/fa/vi) and THREAT_MODEL.md: - `mtbuddy setup syn-limit` command (presets, status, remove), an entry in the abuse-guards note framing it as the optional kernel-level layer that drops SYN bursts before accept() (separate unit, no CAP_NET_ADMIN on the proxy), and the config-table / example mention. - `dc_connect_timeout_sec` config example line + table row. Translations keep technical tokens (commands, flags, config keys, hashlimit, CAP_NET_ADMIN, accept(), SYN, SYN_SENT, RST, CGNAT) verbatim and match each file's existing style.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Two improvements drawn from analysing MTproxy-reanimation (a tuning wrapper for Telemt/MTProxyMax). That project is not a proxy fork — its value is OS/network-level techniques applied around a proxy. Of its five techniques, only two are worth porting; the other three we already do better or they conflict with our invariants (see "Not ported" below). Each kept technique was adversarially verified against our actual code.
1.
mtbuddy setup syn-limit— optional kernel SYN rate-limiter (T1)A default-OFF, per-source-IP inbound SYN rate-limiter on the proxy port via iptables
-m hashlimit, run as a separate systemd oneshot (mtproto-syn-limit.service).Why it's not redundant with our in-proxy guards: the handshake flood guard and per-/24 subnet limiter both run after
accept()and both default OFF (NAT/VPN false-positives). So today every abusive SYN still costs a kernel socket + anaccept()syscall. A kernelhashlimitdrops excess SYNs pre-accept — a different layer, complementary not duplicate.Design choices
CAP_NET_ADMINnever has to be granted tomtproto-proxy.accept_proxy_protocolis set (kernel sees the LB IP, not clients).2/s burst 5(default when enabling, CGNAT-safer), medium1/s burst 3, hard1/s burst 1.mtbuddy status; verifies the rule actually landed (xt_hashlimitpresent) instead of a silent no-op.std.fmtrenders it) and rate/burst/port are validated before being baked in.2.
dc_connect_timeout_sec— per-endpoint DC connect timeout (T2)Default 10s. A filtered/black-holed DC endpoint sends no RST, so the kernel sits in
SYN_SENT~2 min.handshake_timeout_sec(15s from the client's first byte) already caps the whole handshake — so we are not exposed to an unbounded hang — but that cap is global, so a slow first endpoint starves the failover budget for the rest. This fails one dead endpoint fast and lets failover advance to the next, within the overall handshake ceiling.Mirrors
onUpstreamConnectComplete's failure path exactly (cleanupFailedUpstreamConnect→tryNextDcEndpointfor.dc, elsecloseSlot), driven from the timer tick. The deadline base is stamped per attempt and gated onphase == .connecting_upstream, so it can never touch an established relay; a healthy connect finishes in <1s, so working endpoints are unaffected. Deliberately does not raisehandshake_timeout_sec(that would widen the active-probe/slow-loris window our DD-decision + flood guards exist to shrink).Not ported (verified against our code)
SO_KEEPALIVE 60/10/3+TCP_USER_TIMEOUT=30s(stricter), and a host-wide sysctl can't touch the pre-handshake sockets that lackSO_KEEPALIVE(already reaped by our idle/handshake timeouts).tg://link (our links are immutable once distributed — hard no); the MSS half overlaps/conflicts with our existingTCPMSS=88; and it's a mechanism-free folk remedy for the iOS resume hang we already root-caused to the client MtProtoKitbad_server_saltbug (shippedclient_silence_close_sec, filed Telegram-iOS#2197).Testing
zig build -Doptimize=ReleaseFast -Dtarget=x86_64-linux -Dcpu=x86_64_v3+aes✅zig build test✅ (new unit tests: SYN-limit preset/rate/number validation + brace-free script render;dc_connect_timeout_secparse/default).