Skip to content

fix(runtime): offload async inference work#250

Merged
rylinjames merged 2 commits into
mainfrom
fix/runtime-async-predict-offload
Jun 12, 2026
Merged

fix(runtime): offload async inference work#250
rylinjames merged 2 commits into
mainfrom
fix/runtime-async-predict-offload

Conversation

@rylinjames

@rylinjames rylinjames commented Jun 11, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • run non-batched predict_async() calls through a dedicated bounded inference executor instead of blocking the event loop or growing the default executor queue
  • run _predict_batch_sync() from the legacy batch worker through the same bounded executor and return inference_executor_full error payloads when saturated
  • expose bounded executor metrics: running work, queued work, configured capacity, and rejected submissions
  • add --inference-executor-workers and --inference-executor-queue serve flags, with prod/a/b policy-slot labeling for metrics
  • add a generic ORT I/O Binding path for the denoise loop so constant inputs are bound once per chunk and per-step dynamic inputs use run_with_iobinding() when enabled
  • add regression tests for async offload, executor saturation/backpressure, executor metrics, and fake-session I/O Binding

Tests

  • /Users/romirjain/Desktop/building\ projects/fastcrest/tether/.venv/bin/ruff check src/tether/runtime/server.py src/tether/runtime/inference_executor.py src/tether/observability/prometheus.py src/tether/observability/__init__.py tests/test_inference_executor.py tests/test_server.py tests/test_observability_prometheus.py
  • PYTHONPATH=$PWD/src /Users/romirjain/Desktop/building\ projects/fastcrest/tether/.venv/bin/python -m py_compile src/tether/runtime/server.py src/tether/runtime/inference_executor.py src/tether/observability/prometheus.py src/tether/observability/__init__.py src/tether/cli.py tests/test_inference_executor.py tests/test_server.py tests/test_observability_prometheus.py
  • PYTHONPATH=$PWD/src /Users/romirjain/Desktop/building\ projects/fastcrest/tether/.venv/bin/python -m pytest tests/test_inference_executor.py tests/test_observability_prometheus.py tests/test_server.py::TestTetherServerWithMockORT::test_predict_async_offloads_non_batched_predict tests/test_server.py::TestTetherServerWithMockORT::test_predict_async_rejects_when_executor_is_full tests/test_server.py::TestTetherServerWithMockORT::test_batch_worker_offloads_sync_batch_predict tests/test_server.py::TestTetherServerWithMockORT::test_denoise_uses_iobinding_when_enabled tests/test_serve_e2e.py -p no:cacheprovider

Note: tests/test_chunk_budget_integration.py could not run in this local venv because onnxruntime is not installed.

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rylinjames rylinjames force-pushed the fix/runtime-async-predict-offload branch from 9513555 to 8650bc5 Compare June 11, 2026 09:30
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rylinjames rylinjames merged commit 799dc83 into main Jun 12, 2026
6 checks passed
@rylinjames rylinjames deleted the fix/runtime-async-predict-offload branch June 12, 2026 17:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant