[Feature - NOKIA] Controller multiprocessing, time chunking, cache safety, tracking improvements#1421
[Feature - NOKIA] Controller multiprocessing, time chunking, cache safety, tracking improvements#1421saratpoluri wants to merge 18 commits into
Conversation
…fety, tracking improvements (#1317) Nokia enhancements to Intel SceneScape 2025.2 — controller and multiprocessing changes only. No Triton/GPU dependencies. Split from #1306 as requested — Triton changes will follow separately. For a detailed technical walkthrough of all changes with code references, see [docs/controller-enhancements-technical-reference.md] Changes SceneController multiprocessing (scene_controller.py): ProcessPoolExecutor per scene (spawn context), overwrite buffer, semaphore admission control (default 20), async MQTT publish on dedicated thread, automatic crash recovery on BrokenProcessPool, publish watchdog every 30s Scene-aware time chunking (time_chunking.py): Replaced flat TimeChunkBuffer with SceneAwareCategoryBuffer — groups cameras by scene, hybrid dispatch (event-driven when all cameras arrive + 200ms timer fallback), fixed-rate scheduling via time.monotonic() to prevent drift Thread-safe CacheManager (cache_manager.py): _fast dict-only lookup methods safe for MQTT callback thread; HTTP refresh moved to background thread (60s interval); lock held only during in-memory updates, never during HTTP I/O O(1) object association (ilabs_tracking.py): Pre-built hash maps replace O(n) linear scans; UUID stability fix includes unreliable + suspended tracks in pruning C++ tracker bindings (robot_vision): Expose getSuspendedTracks() and getUnreliableTracks() from C++ tracker for UUID stability fix Schema extensions (metadata.schema.json): Add reid, facemask, color, age, hat, gender, subtype fields Tracker retuning (tracker-config.json): Tuned for 10 FPS operation REST client fixes (rest_client.py): Fix token not assigned, url=None crash, rootcert ignored for TLS verification Dockerfile/Makefile: Default to public Ubuntu image; Nokia mirror overridable via RUNTIME_OS_IMAGE build arg/env var UI (sscape.js, style.css): SVG coordinate fix, scale controls (Fit/Native/75/50/33%), Show IDs toggle Signed-off-by: Mohammed Sufiyan Saqib mohammed.sufiyan_saqib@nokia.com
|
Address also human comments from this PR: #1317 |
|
Merged this to a feature branch that used release-2025.2 as base, so we can make changes as we see fit. |
There was a problem hiding this comment.
Pull request overview
This PR delivers Nokia’s controller-focused enhancements for SceneScape 2025.2, primarily aimed at improving throughput, responsiveness, and stability under multi-camera load by moving heavy work off the MQTT callback thread and strengthening tracking/time-chunking/cache behavior.
Changes:
- Introduces a multi-process, per-scene worker architecture in
SceneControllerwith crash recovery, admission control, and async MQTT publishing. - Replaces flat time-chunk buffering with scene-aware buffering and hybrid (event + fixed-rate timer) dispatch; adds unit tests for the new time-chunking behavior.
- Improves tracking/UUID stability and extends schemas/UI/Helm runtime configuration to support the new controller behaviors.
Reviewed changes
Copilot reviewed 37 out of 37 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| scene_common/src/scene_common/rest_client.py | Adds REST timeouts/verify handling and changes RESTClient constructor/API surface. |
| scene_common/src/scene_common/options.py | Extends CV subsystem choices (adds GPU_NVIDIA). |
| manager/src/templates/sscape/sceneDetail.html | Adds UI controls (display scale, Show IDs) and hides map <img> used for SVG setup. |
| manager/src/static/js/sscape.js | Implements SVG viewBox scaling, drag coordinate fixes, and UI toggles for scale/IDs. |
| manager/src/static/js/marks.js | Adds optional ID label rendering and toggle behavior for marks. |
| manager/src/static/css/style.css | Makes SVG stroke/text sizes scale-compensated; styles ID labels and scale control. |
| kubernetes/scenescape-chart/templates/scene-controller/deployment.yaml | Adds env knobs (max workers, OMP threads), optional resources, shm + reid config mounts. |
| kubernetes/scenescape-chart/templates/scene-controller/configmap.yaml | Adds reid-config.json ConfigMap (hooked pre-install/upgrade). |
| docs/controller-enhancements-technical-reference.md | New long-form technical reference for controller enhancements. |
| controller/src/schema/metadata.schema.json | Adds semantic metadata definitions and detection metadata field. |
| controller/src/robot_vision/src/rv/tracking/TrackManager.cpp | Hooks suspended-track cleanup into predict path. |
| controller/src/robot_vision/setup.py | Adds optional compile flag for Hungarian profiling (but currently introduces a syntax issue). |
| controller/src/robot_vision/requirements.txt | Updates copyright header/comments. |
| controller/src/robot_vision/python/src/robot_vision/extensions/tracking.cpp | Exposes suspended/unreliable tracks and config fields to Python. |
| controller/src/robot_vision/include/rv/tracking/TrackManager.hpp | Updates config stringification (minor formatting change). |
| controller/src/robot_vision/include/rv/tracking/MultipleObjectTracker.hpp | Exposes suspended/unreliable tracks accessors. |
| controller/src/robot_vision/include/rv/apollo/gated_hungarian_bigraph_matcher.hpp | Adds optional profiling output for Hungarian matching. |
| controller/src/controller/vdms_adapter.py | Extends ReID DB API (addEntry metadata, rename similarity method). |
| controller/src/controller/uuid_manager.py | Bounds ReID threadpool, adds shutdown, switches to new DB API, preserves active UUID mappings. |
| controller/src/controller/tracking.py | Adds daemon ownership model, thread assertions, better exception handling, heartbeat logging. |
| controller/src/controller/time_chunking.py | Implements scene-aware buffering + hybrid dispatch; rewires time-chunked tracking flow. |
| controller/src/controller/test_time_chunking.py | Adds unit tests for scene-aware buffering and dispatch behavior. |
| controller/src/controller/scene.py | Accumulates events correctly across categories; adds analytics-only flow and profiling logs. |
| controller/src/controller/scene_controller.py | Major refactor: multiprocessing per scene, overwrite buffer, async publish, cache refresh threading, analytics-only subscriptions. |
| controller/src/controller/reid.py | Updates ReIDDatabase abstract API signatures. |
| controller/src/controller/observability/metrics.py | Initializes counters with an initial add(0) during metric creation. |
| controller/src/controller/moving_object.py | Makes RankWarning handling robust; refactors ReID parsing/storage; improves persist attribute handling. |
| controller/src/controller/ilabs_tracking.py | Adds O(1) association path, UUID stability across track states, profiling instrumentation. |
| controller/src/controller/detections_builder.py | Moves ReID payload into metadata.reid structure for outgoing messages. |
| controller/src/controller/controller_mode.py | Adds global “analytics-only” mode toggle. |
| controller/src/controller/child_scene_controller.py | Clarifies broad exception handling comment during child scene connect. |
| controller/src/controller/cache_manager.py | Makes cache refresh thread-safe and moves HTTP off hot paths; adds “fast” lookup methods and periodic refresh. |
| controller/src/controller-cmd | Adds health endpoint, profiling options, analytics-only flag (but does not wire ReID config). |
| controller/requirements-runtime.txt | Updates orjson pin; adds requests pin. |
| controller/Makefile | Switches default runtime image to public Ubuntu noble; updates build dirs. |
| controller/Dockerfile | Uses runtime image arg; installs procps/htop for operational debugging. |
| controller/config/tracker-config.json | Retunes tracker defaults for 10 FPS and sets time chunk interval to 200ms. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Agent-Logs-Url: https://github.com/open-edge-platform/scenescape/sessions/8948d2f1-d336-4138-887a-19dfb08ad358 Co-authored-by: saratpoluri <1325325+saratpoluri@users.noreply.github.com>
Agent-Logs-Url: https://github.com/open-edge-platform/scenescape/sessions/8948d2f1-d336-4138-887a-19dfb08ad358 Co-authored-by: saratpoluri <1325325+saratpoluri@users.noreply.github.com>
Agent-Logs-Url: https://github.com/open-edge-platform/scenescape/sessions/8948d2f1-d336-4138-887a-19dfb08ad358 Co-authored-by: saratpoluri <1325325+saratpoluri@users.noreply.github.com>
Addressed the PR #1317 review items on this branch in commits 63941e6 and 61e0b10 (setup.py syntax fix, RESTClient compatibility/timeout parsing, ReID config wiring + in-place update safety, mutable default fix, and time_chunking lock doc correction). Done — I addressed the PR #1317 code review feedback in commits 63941e6 and 61e0b10 on this branch. |
|
@copilot : address code review comments. |
8349e83 to
1c07341
Compare
1c07341 to
236515a
Compare
4a20876 to
90cdf5a
Compare
db02caa to
9d6cb78
Compare
aab2e36 to
1714689
Compare
3098e7b to
c60f7bd
Compare
c60f7bd to
edb2d9f
Compare
ee27e0d to
f26aad0
Compare
f26aad0 to
79e540b
Compare
tdorauintc
left a comment
There was a problem hiding this comment.
Current time chunking implementation seems to be over-complicated. I suggest to simplify it.
I will review the rest in the next week.
| # Global cache_manager instance for scene_id lookup (set by scene_controller at startup) | ||
| _cache_manager = None | ||
|
|
||
| def set_cache_manager(cache_manager): | ||
| """Set the global cache_manager instance for scene_id derivation.""" | ||
| global _cache_manager | ||
| _cache_manager = cache_manager |
There was a problem hiding this comment.
time_chunking.py stores the CacheManager instance in a module-level global (_cache_manager) set via set_cache_manager(). This works today because there is exactly one SceneController (and thus one CacheManager) per process, but it introduces implicit coupling and fragility that can impact testability (tests must manage module-level state; parallel test runs can interfere) and is not future-proof - if a second SceneController were ever created in-process, the global would silently point to whichever called set_cache_manager() last.
Suggestion: The dependency-injection pattern, already used by SceneAwareCategoryBuffer (constructor parameter get_scene_camera_count), could be extended to TimeChunkedIntelLabsTracking — pass the cache manager (or a lookup callable) through the constructor instead of reading a module global. This would also eliminate the global _cache_manager statement inside trackObjects().
Not a blocker — the single-controller-per-process invariant holds — but worth cleaning up to match the DI pattern used elsewhere in this same file and improve code quality.
| return count | ||
| return None | ||
|
|
||
| class SceneAwareCategoryBuffer: |
There was a problem hiding this comment.
The "scene-aware" design is over-engineered for the current call graph and SceneController design - it groups by scene_id internally, but is never shared across scenes, because worker thread is created per scene in SceneController (see "Architectural invariant: Each scene's tracker state must be owned by a single worker process" comment doc in the scene_controller.py).
I am sceptical about it because it introduces unnecessary complexity and adds maintenance burden (we should rather simplify the code where possible and avoid dead paths that are never exercised). I don't see much benefit from generalizing the design locally, since the code is tightly coupled with scene controller.
| return sum(len(cameras) for cameras in self._data.values()) | ||
|
|
||
|
|
||
| class TimeChunkProcessor(threading.Thread): |
There was a problem hiding this comment.
The same comment as above about over-engineered "scene-aware" design applies to TimeChunkProcessor
| """ | ||
|
|
||
| def __init__(self, tracker_manager, interval_ms: int = DEFAULT_CHUNKING_INTERVAL_MS, | ||
| partial_scene_timeout_sec: float = DEFAULT_PARTIAL_SCENE_TIMEOUT_SEC): |
There was a problem hiding this comment.
Maybe I am missing sth but I don't understand why additional variable partial_scene_timeout_sec and constant DEFAULT_PARTIAL_SCENE_TIMEOUT_SEC is introduced. Only time chunking interval (which is configurable by user) should be used to verify whether data should be processed or not. This way we can guarantee that user is able to control the tracker processing rate.
| Design principles: | ||
| 1. Fixed-rate scheduling via time.monotonic() prevents timer drift under load (M1) | ||
| 2. Event-driven early dispatch when all cameras arrive for a scene (H1) | ||
| 3. Scheduled dispatch handles partial scenes via stale timeout (fairness) | ||
| 4. If tracker busy, skip interval - buffer continues accumulating fresher data |
There was a problem hiding this comment.
I did some analysis how this design works in different scenarios.
Scenario 1: multiple cameras at the same Camera FPS > chunking frequency > 5 Hz =1/(partial_scene_timeout_sec): effective rate is driven by camera FPS, not timer. The early-wake path dominates. The timer is mostly a no-op backup. Consequence: If tracker processing takes longer than the inter-frame interval, the tracker self-throttles. Works differently than original time-chunking (clock-driven) in such case but it's OK
Scenario 2: multiple cameras at the same Camera FPS < chunking frequency: It depends. If Camera FPS > 5 Hz then the early wake fires once per camera cycle when the last camera arrives. Timer ticks wait for complete data and effective rate is again driven by camera FPS. If Camera FPS < 5 Hz, timer tick processes stale partial data and but at 5 Hz rate. It is rather not what we need
📝 Description
Nokia enhancements to Intel SceneScape 2025.2 — controller and multiprocessing
changes only. No Triton/GPU dependencies.
Split from #1306 as requested — Triton changes will follow separately.
For a detailed technical walkthrough of all changes with code references,
see [docs/controller-enhancements-technical-reference.md]
Changes
SceneController multiprocessing (scene_controller.py): ProcessPoolExecutor
per scene (spawn context), overwrite buffer, semaphore admission control (default 20),
async MQTT publish on dedicated thread, automatic crash recovery on BrokenProcessPool,
publish watchdog every 30s
Scene-aware time chunking (time_chunking.py): Replaced flat TimeChunkBuffer
with SceneAwareCategoryBuffer — groups cameras by scene, hybrid dispatch
(event-driven when all cameras arrive + 200ms timer fallback), fixed-rate scheduling
via time.monotonic() to prevent drift
Thread-safe CacheManager (cache_manager.py): _fast dict-only lookup methods
safe for MQTT callback thread; HTTP refresh moved to background thread (60s interval);
lock held only during in-memory updates, never during HTTP I/O
O(1) object association (ilabs_tracking.py): Pre-built hash maps replace O(n)
linear scans; UUID stability fix includes unreliable + suspended tracks in pruning
C++ tracker bindings (robot_vision): Expose getSuspendedTracks() and
getUnreliableTracks() from C++ tracker for UUID stability fix
Schema extensions (metadata.schema.json): Add reid, facemask, color,
age, hat, gender, subtype fields
Tracker retuning (tracker-config.json): Tuned for 10 FPS operation
REST client fixes (rest_client.py): Fix token not assigned, url=None crash,
rootcert ignored for TLS verification
Dockerfile/Makefile: Default to public Ubuntu image; Nokia mirror overridable
via RUNTIME_OS_IMAGE build arg/env var
UI (sscape.js, style.css): SVG coordinate fix, scale controls
(Fit/Native/75/50/33%), Show IDs toggle
Signed-off-by: Mohammed Sufiyan Saqib mohammed.sufiyan_saqib@nokia.com
✨ Type of Change
Select the type of change your PR introduces:
🧪 Testing Scenarios
Describe how the changes were tested and how reviewers can test them too:
✅ Checklist
Before submitting the PR, ensure the following: