fix(runtime): offload async inference work by rylinjames · Pull Request #250 · FastCrest/tether

rylinjames · 2026-06-11T09:24:34Z

Summary

run non-batched predict_async() calls through a dedicated bounded inference executor instead of blocking the event loop or growing the default executor queue
run _predict_batch_sync() from the legacy batch worker through the same bounded executor and return inference_executor_full error payloads when saturated
expose bounded executor metrics: running work, queued work, configured capacity, and rejected submissions
add --inference-executor-workers and --inference-executor-queue serve flags, with prod/a/b policy-slot labeling for metrics
add a generic ORT I/O Binding path for the denoise loop so constant inputs are bound once per chunk and per-step dynamic inputs use run_with_iobinding() when enabled
add regression tests for async offload, executor saturation/backpressure, executor metrics, and fake-session I/O Binding

/Users/romirjain/Desktop/building\ projects/fastcrest/tether/.venv/bin/ruff check src/tether/runtime/server.py src/tether/runtime/inference_executor.py src/tether/observability/prometheus.py src/tether/observability/__init__.py tests/test_inference_executor.py tests/test_server.py tests/test_observability_prometheus.py
PYTHONPATH=$PWD/src /Users/romirjain/Desktop/building\ projects/fastcrest/tether/.venv/bin/python -m py_compile src/tether/runtime/server.py src/tether/runtime/inference_executor.py src/tether/observability/prometheus.py src/tether/observability/__init__.py src/tether/cli.py tests/test_inference_executor.py tests/test_server.py tests/test_observability_prometheus.py
PYTHONPATH=$PWD/src /Users/romirjain/Desktop/building\ projects/fastcrest/tether/.venv/bin/python -m pytest tests/test_inference_executor.py tests/test_observability_prometheus.py tests/test_server.py::TestTetherServerWithMockORT::test_predict_async_offloads_non_batched_predict tests/test_server.py::TestTetherServerWithMockORT::test_predict_async_rejects_when_executor_is_full tests/test_server.py::TestTetherServerWithMockORT::test_batch_worker_offloads_sync_batch_predict tests/test_server.py::TestTetherServerWithMockORT::test_denoise_uses_iobinding_when_enabled tests/test_serve_e2e.py -p no:cacheprovider

Note: tests/test_chunk_budget_integration.py could not run in this local venv because onnxruntime is not installed.

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

fix(runtime): offload async inference work

8650bc5

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

rylinjames force-pushed the fix/runtime-async-predict-offload branch from 9513555 to 8650bc5 Compare June 11, 2026 09:30

fix(runtime): bound async inference executor

f9d70b4

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

rylinjames merged commit 799dc83 into main Jun 12, 2026
6 checks passed

rylinjames deleted the fix/runtime-async-predict-offload branch June 12, 2026 17:23