Skip to content

[Feature - NOKIA] Controller multiprocessing, time chunking, cache safety, tracking improvements#1421

Draft
saratpoluri wants to merge 18 commits into
release-2025.2from
Nokia-optimization-2025.2
Draft

[Feature - NOKIA] Controller multiprocessing, time chunking, cache safety, tracking improvements#1421
saratpoluri wants to merge 18 commits into
release-2025.2from
Nokia-optimization-2025.2

Conversation

@saratpoluri
Copy link
Copy Markdown
Contributor

📝 Description

Nokia enhancements to Intel SceneScape 2025.2 — controller and multiprocessing
changes only. No Triton/GPU dependencies.

Split from #1306 as requested — Triton changes will follow separately.

For a detailed technical walkthrough of all changes with code references,
see [docs/controller-enhancements-technical-reference.md]

Changes
SceneController multiprocessing (scene_controller.py): ProcessPoolExecutor
per scene (spawn context), overwrite buffer, semaphore admission control (default 20),
async MQTT publish on dedicated thread, automatic crash recovery on BrokenProcessPool,
publish watchdog every 30s
Scene-aware time chunking (time_chunking.py): Replaced flat TimeChunkBuffer
with SceneAwareCategoryBuffer — groups cameras by scene, hybrid dispatch
(event-driven when all cameras arrive + 200ms timer fallback), fixed-rate scheduling
via time.monotonic() to prevent drift
Thread-safe CacheManager (cache_manager.py): _fast dict-only lookup methods
safe for MQTT callback thread; HTTP refresh moved to background thread (60s interval);
lock held only during in-memory updates, never during HTTP I/O
O(1) object association (ilabs_tracking.py): Pre-built hash maps replace O(n)
linear scans; UUID stability fix includes unreliable + suspended tracks in pruning
C++ tracker bindings (robot_vision): Expose getSuspendedTracks() and
getUnreliableTracks() from C++ tracker for UUID stability fix
Schema extensions (metadata.schema.json): Add reid, facemask, color,
age, hat, gender, subtype fields
Tracker retuning (tracker-config.json): Tuned for 10 FPS operation
REST client fixes (rest_client.py): Fix token not assigned, url=None crash,
rootcert ignored for TLS verification
Dockerfile/Makefile: Default to public Ubuntu image; Nokia mirror overridable
via RUNTIME_OS_IMAGE build arg/env var
UI (sscape.js, style.css): SVG coordinate fix, scale controls
(Fit/Native/75/50/33%), Show IDs toggle
Signed-off-by: Mohammed Sufiyan Saqib mohammed.sufiyan_saqib@nokia.com

✨ Type of Change

Select the type of change your PR introduces:

  • 🐞 Bug fix – Non-breaking change which fixes an issue
  • 🚀 New feature – Non-breaking change which adds functionality
  • 🔨 Refactor – Non-breaking change which refactors the code base
  • 💥 Breaking change – Changes that break existing functionality
  • 📚 Documentation update
  • 🔒 Security update
  • 🧪 Tests
  • 🚂 CI

🧪 Testing Scenarios

Describe how the changes were tested and how reviewers can test them too:

  • ✅ Tested manually
  • 🤖 Ran automated end-to-end tests

✅ Checklist

Before submitting the PR, ensure the following:

  • 🔍 PR title is clear and descriptive
  • 📝 For internal contributors: If applicable, include the JIRA ticket number (e.g., ITEP-123456) in the PR title. Do not include full URLs
  • 💬 I have commented my code, especially in hard-to-understand areas
  • 📄 I have made corresponding changes to the documentation
  • ✅ I have added tests that prove my fix is effective or my feature works

…fety, tracking improvements (#1317)

Nokia enhancements to Intel SceneScape 2025.2 — controller and multiprocessing
changes only. No Triton/GPU dependencies.

Split from #1306 as requested — Triton changes will follow separately.

For a detailed technical walkthrough of all changes with code references,
see [docs/controller-enhancements-technical-reference.md]

Changes
SceneController multiprocessing (scene_controller.py): ProcessPoolExecutor
per scene (spawn context), overwrite buffer, semaphore admission control (default 20),
async MQTT publish on dedicated thread, automatic crash recovery on BrokenProcessPool,
publish watchdog every 30s
Scene-aware time chunking (time_chunking.py): Replaced flat TimeChunkBuffer
with SceneAwareCategoryBuffer — groups cameras by scene, hybrid dispatch
(event-driven when all cameras arrive + 200ms timer fallback), fixed-rate scheduling
via time.monotonic() to prevent drift
Thread-safe CacheManager (cache_manager.py): _fast dict-only lookup methods
safe for MQTT callback thread; HTTP refresh moved to background thread (60s interval);
lock held only during in-memory updates, never during HTTP I/O
O(1) object association (ilabs_tracking.py): Pre-built hash maps replace O(n)
linear scans; UUID stability fix includes unreliable + suspended tracks in pruning
C++ tracker bindings (robot_vision): Expose getSuspendedTracks() and
getUnreliableTracks() from C++ tracker for UUID stability fix
Schema extensions (metadata.schema.json): Add reid, facemask, color,
age, hat, gender, subtype fields
Tracker retuning (tracker-config.json): Tuned for 10 FPS operation
REST client fixes (rest_client.py): Fix token not assigned, url=None crash,
rootcert ignored for TLS verification
Dockerfile/Makefile: Default to public Ubuntu image; Nokia mirror overridable
via RUNTIME_OS_IMAGE build arg/env var
UI (sscape.js, style.css): SVG coordinate fix, scale controls
(Fit/Native/75/50/33%), Show IDs toggle
Signed-off-by: Mohammed Sufiyan Saqib mohammed.sufiyan_saqib@nokia.com
@saratpoluri saratpoluri changed the base branch from main to release-2025.2 May 11, 2026 23:34
@saratpoluri saratpoluri requested a review from Copilot May 11, 2026 23:35
@saratpoluri
Copy link
Copy Markdown
Contributor Author

Address also human comments from this PR: #1317

@saratpoluri
Copy link
Copy Markdown
Contributor Author

Merged this to a feature branch that used release-2025.2 as base, so we can make changes as we see fit.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR delivers Nokia’s controller-focused enhancements for SceneScape 2025.2, primarily aimed at improving throughput, responsiveness, and stability under multi-camera load by moving heavy work off the MQTT callback thread and strengthening tracking/time-chunking/cache behavior.

Changes:

  • Introduces a multi-process, per-scene worker architecture in SceneController with crash recovery, admission control, and async MQTT publishing.
  • Replaces flat time-chunk buffering with scene-aware buffering and hybrid (event + fixed-rate timer) dispatch; adds unit tests for the new time-chunking behavior.
  • Improves tracking/UUID stability and extends schemas/UI/Helm runtime configuration to support the new controller behaviors.

Reviewed changes

Copilot reviewed 37 out of 37 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
scene_common/src/scene_common/rest_client.py Adds REST timeouts/verify handling and changes RESTClient constructor/API surface.
scene_common/src/scene_common/options.py Extends CV subsystem choices (adds GPU_NVIDIA).
manager/src/templates/sscape/sceneDetail.html Adds UI controls (display scale, Show IDs) and hides map <img> used for SVG setup.
manager/src/static/js/sscape.js Implements SVG viewBox scaling, drag coordinate fixes, and UI toggles for scale/IDs.
manager/src/static/js/marks.js Adds optional ID label rendering and toggle behavior for marks.
manager/src/static/css/style.css Makes SVG stroke/text sizes scale-compensated; styles ID labels and scale control.
kubernetes/scenescape-chart/templates/scene-controller/deployment.yaml Adds env knobs (max workers, OMP threads), optional resources, shm + reid config mounts.
kubernetes/scenescape-chart/templates/scene-controller/configmap.yaml Adds reid-config.json ConfigMap (hooked pre-install/upgrade).
docs/controller-enhancements-technical-reference.md New long-form technical reference for controller enhancements.
controller/src/schema/metadata.schema.json Adds semantic metadata definitions and detection metadata field.
controller/src/robot_vision/src/rv/tracking/TrackManager.cpp Hooks suspended-track cleanup into predict path.
controller/src/robot_vision/setup.py Adds optional compile flag for Hungarian profiling (but currently introduces a syntax issue).
controller/src/robot_vision/requirements.txt Updates copyright header/comments.
controller/src/robot_vision/python/src/robot_vision/extensions/tracking.cpp Exposes suspended/unreliable tracks and config fields to Python.
controller/src/robot_vision/include/rv/tracking/TrackManager.hpp Updates config stringification (minor formatting change).
controller/src/robot_vision/include/rv/tracking/MultipleObjectTracker.hpp Exposes suspended/unreliable tracks accessors.
controller/src/robot_vision/include/rv/apollo/gated_hungarian_bigraph_matcher.hpp Adds optional profiling output for Hungarian matching.
controller/src/controller/vdms_adapter.py Extends ReID DB API (addEntry metadata, rename similarity method).
controller/src/controller/uuid_manager.py Bounds ReID threadpool, adds shutdown, switches to new DB API, preserves active UUID mappings.
controller/src/controller/tracking.py Adds daemon ownership model, thread assertions, better exception handling, heartbeat logging.
controller/src/controller/time_chunking.py Implements scene-aware buffering + hybrid dispatch; rewires time-chunked tracking flow.
controller/src/controller/test_time_chunking.py Adds unit tests for scene-aware buffering and dispatch behavior.
controller/src/controller/scene.py Accumulates events correctly across categories; adds analytics-only flow and profiling logs.
controller/src/controller/scene_controller.py Major refactor: multiprocessing per scene, overwrite buffer, async publish, cache refresh threading, analytics-only subscriptions.
controller/src/controller/reid.py Updates ReIDDatabase abstract API signatures.
controller/src/controller/observability/metrics.py Initializes counters with an initial add(0) during metric creation.
controller/src/controller/moving_object.py Makes RankWarning handling robust; refactors ReID parsing/storage; improves persist attribute handling.
controller/src/controller/ilabs_tracking.py Adds O(1) association path, UUID stability across track states, profiling instrumentation.
controller/src/controller/detections_builder.py Moves ReID payload into metadata.reid structure for outgoing messages.
controller/src/controller/controller_mode.py Adds global “analytics-only” mode toggle.
controller/src/controller/child_scene_controller.py Clarifies broad exception handling comment during child scene connect.
controller/src/controller/cache_manager.py Makes cache refresh thread-safe and moves HTTP off hot paths; adds “fast” lookup methods and periodic refresh.
controller/src/controller-cmd Adds health endpoint, profiling options, analytics-only flag (but does not wire ReID config).
controller/requirements-runtime.txt Updates orjson pin; adds requests pin.
controller/Makefile Switches default runtime image to public Ubuntu noble; updates build dirs.
controller/Dockerfile Uses runtime image arg; installs procps/htop for operational debugging.
controller/config/tracker-config.json Retunes tracker defaults for 10 FPS and sets time chunk interval to 200ms.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread controller/src/robot_vision/setup.py Outdated
Comment thread scene_common/src/scene_common/rest_client.py Outdated
Comment thread controller/src/controller-cmd Outdated
Comment thread controller/src/controller/scene_controller.py Outdated
Comment thread controller/src/controller/cache_manager.py Outdated
Comment thread scene_common/src/scene_common/rest_client.py Outdated
Comment thread controller/src/controller/time_chunking.py Outdated
@saratpoluri
Copy link
Copy Markdown
Contributor Author

@copilot Address code review comments from this PR: #1317

Copy link
Copy Markdown
Contributor

Copilot AI commented May 12, 2026

Address also human comments from this PR: #1317

Addressed the PR #1317 review items on this branch in commits 63941e6 and 61e0b10 (setup.py syntax fix, RESTClient compatibility/timeout parsing, ReID config wiring + in-place update safety, mutable default fix, and time_chunking lock doc correction).

@copilot Address code review comments from this PR: #1317

Done — I addressed the PR #1317 code review feedback in commits 63941e6 and 61e0b10 on this branch.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 37 out of 37 changed files in this pull request and generated 6 comments.

Comment thread controller/src/controller/ilabs_tracking.py
Comment thread controller/src/robot_vision/setup.py Outdated
Comment thread controller/src/robot_vision/include/rv/apollo/gated_hungarian_bigraph_matcher.hpp Outdated
Comment thread controller/src/controller/cache_manager.py Outdated
Comment thread controller/src/controller/vdms_adapter.py
Comment thread controller/src/schema/metadata.schema.json Outdated
@saratpoluri
Copy link
Copy Markdown
Contributor Author

@copilot : address code review comments.

@saratpoluri saratpoluri force-pushed the Nokia-optimization-2025.2 branch 3 times, most recently from 8349e83 to 1c07341 Compare May 12, 2026 23:43
@saratpoluri saratpoluri force-pushed the Nokia-optimization-2025.2 branch from 1c07341 to 236515a Compare May 12, 2026 23:45
@saratpoluri saratpoluri force-pushed the Nokia-optimization-2025.2 branch from 4a20876 to 90cdf5a Compare May 13, 2026 00:15
@saratpoluri saratpoluri force-pushed the Nokia-optimization-2025.2 branch from db02caa to 9d6cb78 Compare May 13, 2026 05:07
@saratpoluri saratpoluri force-pushed the Nokia-optimization-2025.2 branch from aab2e36 to 1714689 Compare May 13, 2026 05:34
@saratpoluri saratpoluri force-pushed the Nokia-optimization-2025.2 branch 7 times, most recently from 3098e7b to c60f7bd Compare May 13, 2026 22:57
@saratpoluri saratpoluri force-pushed the Nokia-optimization-2025.2 branch from c60f7bd to edb2d9f Compare May 13, 2026 23:01
Comment thread controller/src/controller/scene_controller.py
@saratpoluri saratpoluri force-pushed the Nokia-optimization-2025.2 branch from ee27e0d to f26aad0 Compare May 13, 2026 23:51
@saratpoluri saratpoluri marked this pull request as draft May 14, 2026 00:38
@saratpoluri saratpoluri force-pushed the Nokia-optimization-2025.2 branch from f26aad0 to 79e540b Compare May 14, 2026 00:46
Copy link
Copy Markdown
Contributor

@tdorauintc tdorauintc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current time chunking implementation seems to be over-complicated. I suggest to simplify it.

I will review the rest in the next week.

Comment on lines +60 to +66
# Global cache_manager instance for scene_id lookup (set by scene_controller at startup)
_cache_manager = None

def set_cache_manager(cache_manager):
"""Set the global cache_manager instance for scene_id derivation."""
global _cache_manager
_cache_manager = cache_manager
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

time_chunking.py stores the CacheManager instance in a module-level global (_cache_manager) set via set_cache_manager(). This works today because there is exactly one SceneController (and thus one CacheManager) per process, but it introduces implicit coupling and fragility that can impact testability (tests must manage module-level state; parallel test runs can interfere) and is not future-proof - if a second SceneController were ever created in-process, the global would silently point to whichever called set_cache_manager() last.

Suggestion: The dependency-injection pattern, already used by SceneAwareCategoryBuffer (constructor parameter get_scene_camera_count), could be extended to TimeChunkedIntelLabsTracking — pass the cache manager (or a lookup callable) through the constructor instead of reading a module global. This would also eliminate the global _cache_manager statement inside trackObjects().

Not a blocker — the single-controller-per-process invariant holds — but worth cleaning up to match the DI pattern used elsewhere in this same file and improve code quality.

return count
return None

class SceneAwareCategoryBuffer:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "scene-aware" design is over-engineered for the current call graph and SceneController design - it groups by scene_id internally, but is never shared across scenes, because worker thread is created per scene in SceneController (see "Architectural invariant: Each scene's tracker state must be owned by a single worker process" comment doc in the scene_controller.py).

I am sceptical about it because it introduces unnecessary complexity and adds maintenance burden (we should rather simplify the code where possible and avoid dead paths that are never exercised). I don't see much benefit from generalizing the design locally, since the code is tightly coupled with scene controller.

return sum(len(cameras) for cameras in self._data.values())


class TimeChunkProcessor(threading.Thread):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same comment as above about over-engineered "scene-aware" design applies to TimeChunkProcessor

"""

def __init__(self, tracker_manager, interval_ms: int = DEFAULT_CHUNKING_INTERVAL_MS,
partial_scene_timeout_sec: float = DEFAULT_PARTIAL_SCENE_TIMEOUT_SEC):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I am missing sth but I don't understand why additional variable partial_scene_timeout_sec and constant DEFAULT_PARTIAL_SCENE_TIMEOUT_SEC is introduced. Only time chunking interval (which is configurable by user) should be used to verify whether data should be processed or not. This way we can guarantee that user is able to control the tracker processing rate.

Comment on lines +196 to +200
Design principles:
1. Fixed-rate scheduling via time.monotonic() prevents timer drift under load (M1)
2. Event-driven early dispatch when all cameras arrive for a scene (H1)
3. Scheduled dispatch handles partial scenes via stale timeout (fairness)
4. If tracker busy, skip interval - buffer continues accumulating fresher data
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some analysis how this design works in different scenarios.

Scenario 1: multiple cameras at the same Camera FPS > chunking frequency > 5 Hz =1/(partial_scene_timeout_sec): effective rate is driven by camera FPS, not timer. The early-wake path dominates. The timer is mostly a no-op backup. Consequence: If tracker processing takes longer than the inter-frame interval, the tracker self-throttles. Works differently than original time-chunking (clock-driven) in such case but it's OK

Scenario 2: multiple cameras at the same Camera FPS < chunking frequency: It depends. If Camera FPS > 5 Hz then the early wake fires once per camera cycle when the last camera arrives. Timer ticks wait for complete data and effective rate is again driven by camera FPS. If Camera FPS < 5 Hz, timer tick processes stale partial data and but at 5 Hz rate. It is rather not what we need

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants