Skip to content

feat: optional kernel SYN rate-limiter + per-endpoint DC connect timeout#363

Merged
sleep3r merged 3 commits into
mainfrom
feat/synlimit-and-dc-connect-timeout
Jun 15, 2026
Merged

feat: optional kernel SYN rate-limiter + per-endpoint DC connect timeout#363
sleep3r merged 3 commits into
mainfrom
feat/synlimit-and-dc-connect-timeout

Conversation

@sleep3r

@sleep3r sleep3r commented Jun 15, 2026

Copy link
Copy Markdown
Owner

Two improvements drawn from analysing MTproxy-reanimation (a tuning wrapper for Telemt/MTProxyMax). That project is not a proxy fork — its value is OS/network-level techniques applied around a proxy. Of its five techniques, only two are worth porting; the other three we already do better or they conflict with our invariants (see "Not ported" below). Each kept technique was adversarially verified against our actual code.


1. mtbuddy setup syn-limit — optional kernel SYN rate-limiter (T1)

A default-OFF, per-source-IP inbound SYN rate-limiter on the proxy port via iptables -m hashlimit, run as a separate systemd oneshot (mtproto-syn-limit.service).

Why it's not redundant with our in-proxy guards: the handshake flood guard and per-/24 subnet limiter both run after accept() and both default OFF (NAT/VPN false-positives). So today every abusive SYN still costs a kernel socket + an accept() syscall. A kernel hashlimit drops excess SYNs pre-accept — a different layer, complementary not duplicate.

Design choices

  • iptables hashlimit, not nftables — matches our existing stack (TCPMSS, nfqws); no new dependency.
  • Separate oneshot unit so CAP_NET_ADMIN never has to be granted to mtproto-proxy.
  • Default OFF behind a loud CGNAT/VPN warning (same shared-egress-IP problem that keeps the in-proxy guards off). Warns when accept_proxy_protocol is set (kernel sees the LB IP, not clients).
  • Presets mirror the source: soft 2/s burst 5 (default when enabling, CGNAT-safer), medium 1/s burst 3, hard 1/s burst 1.
  • Drop counter in mtbuddy status; verifies the rule actually landed (xt_hashlimit present) instead of a silent no-op.
  • Idempotent apply + full uninstall cleanup (replay-delete INPUT jumps → flush → delete chain, v4 & v6). Generated script is brace-free (so std.fmt renders it) and rate/burst/port are validated before being baked in.
mtbuddy setup syn-limit --preset soft        # enable (CGNAT-safer default)
mtbuddy setup syn-limit --rate 1/second --burst 3
mtbuddy setup syn-limit --remove
mtbuddy setup syn-limit --status

2. dc_connect_timeout_sec — per-endpoint DC connect timeout (T2)

Default 10s. A filtered/black-holed DC endpoint sends no RST, so the kernel sits in SYN_SENT ~2 min. handshake_timeout_sec (15s from the client's first byte) already caps the whole handshake — so we are not exposed to an unbounded hang — but that cap is global, so a slow first endpoint starves the failover budget for the rest. This fails one dead endpoint fast and lets failover advance to the next, within the overall handshake ceiling.

Mirrors onUpstreamConnectComplete's failure path exactly (cleanupFailedUpstreamConnecttryNextDcEndpoint for .dc, else closeSlot), driven from the timer tick. The deadline base is stamped per attempt and gated on phase == .connecting_upstream, so it can never touch an established relay; a healthy connect finishes in <1s, so working endpoints are unaffected. Deliberately does not raise handshake_timeout_sec (that would widen the active-probe/slow-loris window our DD-decision + flood guards exist to shrink).


Not ported (verified against our code)

  • iOS keepalive sysctl (60/15/3): no-op for us — relay sockets already set SO_KEEPALIVE 60/10/3 + TCP_USER_TIMEOUT=30s (stricter), and a host-wide sysctl can't touch the pre-handshake sockets that lack SO_KEEPALIVE (already reaped by our idle/handshake timeouts).
  • iOS MSS=92 + separate port DNAT: a second port = a second tg:// link (our links are immutable once distributed — hard no); the MSS half overlaps/conflicts with our existing TCPMSS=88; and it's a mechanism-free folk remedy for the iOS resume hang we already root-caused to the client MtProtoKit bad_server_salt bug (shipped client_silence_close_sec, filed Telegram-iOS#2197).
  • Deployment autodetection + systemd persistence: already implemented for our own stack.

Testing

  • zig build -Doptimize=ReleaseFast -Dtarget=x86_64-linux -Dcpu=x86_64_v3+aes
  • zig build test ✅ (new unit tests: SYN-limit preset/rate/number validation + brace-free script render; dc_connect_timeout_sec parse/default).
  • Proxy core change is byte-transparent and off the happy path; the firewall feature is opt-in and isolated in its own unit.

sleep3r added 3 commits June 15, 2026 16:11
Adds an OPTIONAL, default-OFF kernel-level per-source-IP inbound SYN
rate-limiter on the proxy port, via iptables `-m hashlimit`. It drops
abusive first-SYN bursts in the kernel BEFORE accept(), complementing the
in-proxy guards (handshake flood guard / per-/24 subnet limiter) which run
AFTER accept() and themselves default OFF — so every attack SYN currently
still costs a socket + an accept() syscall.

Idea borrowed from MTproxy-reanimation (which does this with nftables for
Telemt/MTProxyMax). We use iptables hashlimit to match our existing
iptables stack (TCPMSS, nfqws) — no new nftables dependency — and run it as
a SEPARATE systemd oneshot unit (mtproto-syn-limit.service) so CAP_NET_ADMIN
never has to be granted to mtproto-proxy itself.

- `mtbuddy setup syn-limit [--preset soft|medium|hard] [--rate N/second]
  [--burst N] [--remove] [--status]`; interactive menu item too.
- Presets mirror the source: soft 2/s burst 5 (default when enabling,
  CGNAT-safer), medium 1/s burst 3, hard 1/s burst 1.
- Default OFF behind a loud CGNAT/VPN false-positive warning (the same
  shared-egress-IP problem that keeps the in-proxy guards off); warns when
  accept_proxy_protocol is set (kernel sees the LB IP, not real clients).
- Drop counter surfaced in `mtbuddy status`; verifies the rule actually
  landed (xt_hashlimit present) instead of a silent no-op.
- Idempotent apply (remove-before-add) and full uninstall cleanup
  (replay-delete INPUT jumps → flush → delete chain, both v4/v6).
- The generated script avoids `{`/`}` so std.fmt can render it; rate/burst/
  port are validated before being baked in. Pure renderer is unit-tested.
Adds `dc_connect_timeout_sec` (default 10): a per-endpoint deadline for
completing the TCP connect to a Telegram DC endpoint.

A filtered/black-holed endpoint sends no RST, so the kernel keeps the
connect in SYN_SENT for ~2 min. handshake_timeout_sec (15s, measured from
the client's first byte) already caps the whole client handshake, so we are
NOT exposed to an unbounded hang — but that cap is GLOBAL, so a slow first
endpoint starves the failover budget for the remaining endpoints. This
fires per endpoint: if connect() hasn't completed within the deadline, the
endpoint is abandoned and failover advances to the next one (within the
overall handshake_timeout_sec ceiling).

Implementation mirrors onUpstreamConnectComplete's failure path exactly
(cleanupFailedUpstreamConnect → tryNextDcEndpoint for .dc kinds, else
closeSlot), driven from the timer tick. `upstream_connect_started_ms` is
stamped per attempt in startConnectUpstream (reset on every endpoint via the
pool's slot.* = .{}), and the check is gated on phase == .connecting_upstream
so it can never touch an established relay. A healthy connect completes in
well under a second, so this never affects working endpoints.

Inspired by the tg_connect knob in the MTproxy-reanimation analysis;
deliberately does NOT raise handshake_timeout_sec (that would widen the
active-probe / slow-loris window the DD-decision and flood guards exist to
shrink). idle_timeout_sec=120 already equals their client_keepalive=120.
Mirror the two new features across all five READMEs (en/ru/zh/fa/vi) and
THREAT_MODEL.md:
- `mtbuddy setup syn-limit` command (presets, status, remove), an entry in
  the abuse-guards note framing it as the optional kernel-level layer that
  drops SYN bursts before accept() (separate unit, no CAP_NET_ADMIN on the
  proxy), and the config-table / example mention.
- `dc_connect_timeout_sec` config example line + table row.

Translations keep technical tokens (commands, flags, config keys,
hashlimit, CAP_NET_ADMIN, accept(), SYN, SYN_SENT, RST, CGNAT) verbatim and
match each file's existing style.
@sleep3r sleep3r merged commit b346b75 into main Jun 15, 2026
8 checks passed
@sleep3r sleep3r deleted the feat/synlimit-and-dc-connect-timeout branch June 15, 2026 13:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant