Add /load-shareurl proxy for hosts blocked on some school networks#623
Conversation
Joe says: We really want to make sure that shareurl works from raw.githubusercontent.com. Notably this proxies *only* that domain for now, specifically to make sure we support this common use case. In general I expect that less constrained users (e.g. ugrad) just fetch code happily with CORS from the browser in the long run. This also comes with some improvements to proxying, extracted into a helper: - Real size limits - Streaming, not blocking waiting for the whole response - CSP headers so if someone tries to load something executable (e.g. in SVG) it really can't run - Some filtering on redirects to avoid weird cases of redirecting to 127.0.0.1 inside the server Since we have existing proxying for images it's nice to improve this and have a good helper. Claude says (lightly edited by Joe): Some classrooms will likely load curriculum via #shareurl from raw.githubusercontent.com on networks that block GitHub. The new /load-shareurl endpoint fetches an allowlisted URL on the browser's behalf and streams the response back, so those classrooms get a working path. A small window.fetch shim in beforePyret.js routes any fetch to an allowlisted host through the proxy automatically. That covers every browser fetch path in one place: drive.js's makeUrlFile, the url/url-file import prefetches in cpo-main.js, and the Pyret runtime's F.fetch via cross-fetch. /downloadImg moves to share the same streaming-fetch helper, picking up nosniff + sandbox response headers, an IP-filtering HTTP agent (request-filtering-agent), a response size cap, and a request timeout. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Seems reasonably straightforward. Do we want to proxy raw.githubusercontent automatically, or only if it fails first? Do you intend to have different timeouts in the two proxy usages? Or should those be made constants instead of magic numbers? |
|
I'll do some magic # fixup, that's good feedback. Re: eager vs. fallback proxying – the problem with fallback is that in the failure case it can take a long time for the first request to fail on a school content filter. So the experience is lousy. If we end up proxying a lot, we can do some heuristic thing where we try for 2s or something and give up and proxy if that doesn't work. |
|
Maybe we start proxying automatically, and then in the background, we try to request a specific file that we own, without a proxy: if it goes through, we can disable the proxy? |
|
Good point. Can even just do proxy/non-proxy in parallel on the first request, and then disable the proxy if the regular request succeeds. If we add more domains that becomes a dictionary of "proxy-for" different ones. But that's not the worst thing ever, and would adapt to firewall rules at schools... |
Claude says (Joe felt little need to edit): Replaces the four magic numbers in the /downloadImg and /load-shareurl proxy call sites with named module-level constants. Bumps the image proxy to 20 MB / 30 s so that fetches of phone-sized images and slow Drive ?export= responses don't get cut off; /load-shareurl stays at 1 MB / 10 s since it's plaintext from a fast host. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Joe says: Constraints: - Don't want to proxy everything (be kind to our servers) - Don't want to wait a long time for the direct request to fail (flaky network, weird school blocker) So we kick off both the proxy and the direct request the first time we have a proxy-able candidate. - If the direct request succeeds, cancel the proxy request to CPO and do direct requests for that domain from this page - If the direct request fails, use only proxy requests for that domain for this page A direct request counts as success if it returns 200 OK and text/plain (making sure we don't weirdly accept 200 OK with like... a HTML login page because of a school SSO portal) I think this is a good balance of server load and letting working clients do their thing. Claude says: Responding to Ben's PR feedback to do better than always-proxy. The first fetch to an allowlisted host now runs direct + /load-shareurl in parallel; whichever returns first that verifies (direct = 2xx text/plain) is served to the caller, and the outcome of the direct request locks per-host shouldProxy state for the rest of the page-load. So unrestricted networks pay the proxy hop only once and rapidly switch to direct; blocked networks see direct fail and switch to proxy. Header-only verification (rather than body comparison) lets us abort the in-flight proxy fetch as soon as direct is trusted, keeping server load near zero on the healthy path. proxyStreamFetch now also tears down the upstream connection when the client disconnects, so the abort actually saves bandwidth on the server side too. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Joe says:
I think I constrained Claude so much in this session that it had to say things
I like below :-)
This is post-convo with Ben where we were still unhappy with (a) variable names
and (b) closed over mutable variables instead of declarative promise APIs.
Claude says:
Following a second pass of code review: the previous version used a
new Promise((resolve, reject) => { let gotSomeResponse = false; ... })
block with a flag tracking which side had won. That conflated two
concerns — "first success wins the response" and "abort proxy iff direct
beat it" — into one imperative state machine.
This pass separates them. The race is expressed as three composed
Promise primitives:
- directP: direct fetch, gated on content-type verification (rejects
if direct returned the wrong shape, so it drops out of the response
race entirely).
- shouldProxyPromise (Promise.race): false iff direct verified before
the timeout; locks the per-host policy for the page-load.
- directFinishedSuccessfullyAndFirstP (Promise.race): does direct
settle before proxy, AND was that settlement a verified success?
Only then do we proxyCtrl.abort(). This keeps the abort safe — we
never abort once proxy has already returned headers, which would
error the caller's body stream mid-read.
- responsePromise (Promise.any): whichever of directP or proxyP
fulfills first.
Tested both healthy and DevTools-blocked flows. Observed the timing
edge case where both proxy and direct return 200 within a few ms of
each other: shouldProxy still locks 'direct' as expected, since the
verdict is decoupled from whether the abort caught proxy in time.
Also adds a comment explaining the deliberate non-handling of caller-
provided init.signal: the signal overwrite means a caller aborting
won't cancel the proxy fetch, but the alternative (event-listener
forwarding or AbortSignal.any) is more complexity than the
not-quite-fully-aborted case justifies right now.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Joe says:
We really want to make sure that shareurl works from raw.githubusercontent.com. Notably this proxies only that domain for now, specifically to make sure we support this common use case. In general I expect that less constrained users (e.g. ugrad) just fetch code happily with CORS from the browser in the long run.
This also comes with some improvements to proxying, extracted into a helper:
Since we have existing proxying for images it's nice to improve this and have a good helper.
Claude says (lightly edited by Joe):
Some classrooms will likely load curriculum via #shareurl from raw.githubusercontent.com on networks that block GitHub. The new /load-shareurl endpoint fetches an allowlisted URL on the browser's behalf and streams the response back, so those classrooms get a working path.
A small window.fetch shim in beforePyret.js routes any fetch to an allowlisted host through the proxy automatically. That covers every browser fetch path in one place: drive.js's makeUrlFile, the url/url-file import prefetches in cpo-main.js, and the Pyret runtime's F.fetch via cross-fetch.
/downloadImg moves to share the same streaming-fetch helper, picking up nosniff