Skip to content

Security hardening, retry unification, model allowlists, image-text-to-text, and Kubernetes deployment#14

Merged
eduardoworrel merged 9 commits into
mainfrom
fix/security-hardening-and-retry-unification
Feb 25, 2026
Merged

Security hardening, retry unification, model allowlists, image-text-to-text, and Kubernetes deployment#14
eduardoworrel merged 9 commits into
mainfrom
fix/security-hardening-and-retry-unification

Conversation

@eduardoworrel

@eduardoworrel eduardoworrel commented Feb 19, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Security hardening (S-series): mitigate SSRF, path traversal, unbounded file uploads, WebSocket abuse, missing rate-limiting on WS, and other vulnerabilities across the HTTP/WebSocket/queue pipeline
  • Retry unification (A11): unify the two independent retry counters (PostProcessingQueue Redis-based + SessionTrackQueue in-memory) into a single PrivateArgs["retry_count"] stored in Redis, with a shared ceiling of 3 retries; register SessionTrackQueue in DI (was dead code), fix message deserialization, fix timer replacement for retried tasks, publish task_completion on success, and clean up Redis keys
  • Error detection fix (E1): replace fragile response.Contains("\"Status\":\"Error\"") string matching with proper JsonDocument.Parse — prevents false HTTP 500 on valid responses that happen to contain the word "error" in their content
  • Model/dtype allowlists (E2): add AllowedModels and AllowedDtypes to FieldsConfig, validated in IsValidFields() at the API boundary — rejects arbitrary model strings before they reach worker nodes
  • New task type: image-text-to-text — adds multimodal vision AI endpoint (/api/v1/image-text-to-text) with full pipeline support (preprocessing, distribution, post-processing) for SmolVLM and similar models
  • Kubernetes deployment: add k8s manifests for all services (core-api, core-websocket, core-background, client, redis) with namespace, PVC, cert-manager issuer, and ingress routing for subdomain architecture:
    • woolball.xyz → landing page
    • open.woolball.xyz → client app + API/WebSocket/Swagger
    • api.woolball.xyz → core-api (dedicated)
    • ws.woolball.xyz → core-websocket (dedicated)
  • Refactoring: extract the 484-line TaskRequest god file into separate handler classes (SpeechToTextTaskHandler, TextToSpeechTaskHandler, TranslationTaskHandler, TextGenerationTaskHandler, ImageTextToTextTaskHandler), plus FieldsConfig, ITaskHandler, TaskHandlerFactory, StreamBuffer<T>, TaskRequestFactory, SpeechToTextFormProcessor

Bugs fixed

ID Bug Fix
S8-S14 Multiple security vulnerabilities (SSRF, path traversal, unbounded uploads, WS abuse) Input sanitization, file size limits, origin validation
A11-0 SessionTrackQueue never registered in DI (dead code) Added AddHostedService<SessionTrackQueue>()
A11-0a DistributeQueue publishes GUID string, SessionTrackQueue tries Deserialize<TaskRequest> Changed to Guid.TryParse
A11-1 task_completion never published — timers never cancelled on success PostProcessingQueue now publishes task_completion + deletes Redis key
A11-2 _taskAttempts.GetOrAdd(taskId, 1) never increments — infinite retries Removed in-memory counter, uses Redis retry_count
A11-3 TryAdd ignores re-dispatched tasks — no new timer Changed to AddOrUpdate that cancels previous timer
A11-4 No coordination between two retry counters Both systems now read/write same PrivateArgs["retry_count"] in Redis
A11-5 task:{id} never deleted after success PostProcessingQueue deletes key after successful processing
E1 String matching error detection returns HTTP 500 for valid responses Structural JSON parsing via JsonDocument
E2 No validation for model/dtype fields Per-handler allowlists from README checked in IsValidFields()

Test plan

  • dotnet build passes with 0 errors
  • Verify grep -rn "_taskAttempts" src/ returns empty (in-memory counter fully removed)
  • Verify grep -rn "task_completion" src/ returns 2+ results (publisher + subscriber)
  • Verify grep -rn "AddHostedService<SessionTrackQueue>" src/ returns 1 result
  • Send a translation request with valid content containing "error" — should return HTTP 200
  • Send a request with an invalid model string — should return HTTP 400 with descriptive message
  • End-to-end task flow: dispatch → worker → PostProcessingQueue publishes completion → SessionTrackQueue cancels timer → Redis key deleted
  • kubectl apply -f k8s/ deploys all services successfully
  • Image-text-to-text endpoint accepts image + text and returns generated description
  • Ingress routes woolball.xyz, open.woolball.xyz, api.woolball.xyz, ws.woolball.xyz correctly

🤖 Generated with Claude Code

eduardoworrel and others added 8 commits February 19, 2026 10:35
Add InputSanitizer with URL validation (blocks private/reserved IPs),
filename sanitization (Path.GetFileName), and file size limits (100MB).
Apply to SpeechToTextTaskHandler and configure Kestrel/FormOptions limits.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
S8: Validate base64 data-URL MIME type and audio magic bytes before
writing decoded content to disk (InputSanitizer).

S9: Replace permissive CORS (AllowCredentials + any origin) with
AllowAnyOrigin without credentials to prevent CSRF.

S11: Remove full payload logging from TaskBusinessLogic (result messages),
sanitize exception logs across TasksEndPoints, TaskSockets, and
AudioValidation to avoid leaking sensitive data.

S12: Add UseHttpsRedirection to WebSocket/Program.cs.

S13: Add rate limiter (50 req/min) to WebSocket service and apply
RequireRateLimiting to the ws/{id} route.

S14: Replace unbounded string concatenation with StringBuilder and
enforce 10 MB max message size per WebSocket frame accumulation,
closing with MessageTooBig on violation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…kQueue

Both systems now share retry_count via PrivateArgs in Redis instead of
independent in-memory counters, with a combined ceiling of 3 retries.

Fixes: SessionTrackQueue registered in DI (was dead code), message
deserialization uses Guid.TryParse matching DistributeQueue format,
AddOrUpdate replaces TryAdd for retried task timers, task_completion
published on success, Redis key deleted after completion.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…r detection

- Extract TaskRequest god file into separate handler classes per task type
  (SpeechToTextTaskHandler, TextToSpeechTaskHandler, TranslationTaskHandler,
  TextGenerationTaskHandler) with FieldsConfig, ITaskHandler, TaskHandlerFactory
- Extract StreamBuffer, SpeechToTextFormProcessor, TaskRequestFactory
- Add model/dtype allowlist validation in FieldsConfig and IsValidFields(),
  rejecting arbitrary model strings at the API boundary (E2)
- Replace fragile string-matching error detection
  (response.Contains("\"Status\":\"Error\"")) with JSON parsing via
  JsonDocument, preventing false 500s on valid responses containing
  "error" in their content (E1)
- Add CLAUDE.md, rules, agents, and skills for project documentation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…chunk buffers

Replace fire-and-forget Redis Pub/Sub with Redis Streams (XADD/XREADGROUP/XACK)
for all 7 pipeline channels, providing message durability and crash recovery.
Replace static ConcurrentDictionary chunk buffers in STT/TTS logic with
Redis Stream-based RedisChunkBuffer<T>, enabling horizontal scaling and
preventing memory leaks from incomplete tasks.

result_queue_{taskId} channels remain on Pub/Sub (ephemeral per-request).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…sult timeout

D5: Replace broken ClaimPendingAsync (StreamPendingMessagesAsync with null
consumer) with StreamAutoClaimAsync for correct crash recovery of orphaned
messages in Redis Stream consumer groups.

D6: Fix SessionTrackQueue dual-consumer restart — use linked CancellationToken
with Task.WhenAny so that if either consumer crashes, both restart together
instead of leaving one stream unprocessed indefinitely.

D7: Add 3-minute safety-net timeout to AwaitTaskResultAsync via linked
CancellationToken. Propagate HTTP cancellationToken from endpoint caller.
Return HTTP 504 on TimeoutException instead of hanging indefinitely.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…s (P9-P20)

Remove 7 unused NuGet packages (FluentValidation, MediatR, NRedisStack,
EF Core SqlServer/Design, JwtBearer, Serilog), delete dead files
(TemplateWorker, Entity, TextGenerationRequestContract, TextGenerationSchemaFilter),
remove unused model classes from AIModels.cs, extract duplicate retry_count
parsing into PrivateArgsHelper, add 30s timeout to DistributeQueue node
acquisition, update GitHub Actions to v4, add Docker resource limits,
fix log message inconsistency, and remove unnecessary Task.Delay(1).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add k8s deployment manifests for all services (core-api, core-websocket,
core-background, client, redis) with ingress routing:
- woolball.xyz → landing page
- open.woolball.xyz → client app + /api, /swagger, /ws
- api.woolball.xyz → core-api
- ws.woolball.xyz → core-websocket

TLS via cert-manager for all domains.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@eduardoworrel eduardoworrel changed the title Security hardening, retry unification, and model allowlists Security hardening, retry unification, model allowlists, image-text-to-text, and Kubernetes deployment Feb 24, 2026
The generic gemma3-1b-it-int4.task causes "func is not a function" in
MediaPipe on the browser. Switch to gemma3-1b-it-int4-web.task which is
the web-converted format required by tasks-genai.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@eduardoworrel eduardoworrel merged commit 2d6202a into main Feb 25, 2026
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant