Skip to content

feat: add platform skeleton — Engine, Module base, and channel primitives#34

Open
lterrac wants to merge 14 commits into
mainfrom
feat/platform
Open

feat: add platform skeleton — Engine, Module base, and channel primitives#34
lterrac wants to merge 14 commits into
mainfrom
feat/platform

Conversation

@lterrac

@lterrac lterrac commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Scoped-down version of the initial platform PR, per review feedback on #32 (first milestone scope too broad).

What this includes

  • platform/include/system/engine.hpp — the control-plane kernel: manages module lifecycle (initialize / run / await / terminate / finalize), start/stop RPCs, deployer/worker split. No model semantics.
  • platform/include/modules/module.hpp + subscription.hpp — pure lifecycle interface that all future platform modules implement.
  • platform/include/system/channels/ — host-side channel primitives: Input, Output, Message, MessageTypeRegistry, Base. Pure communication transport.
  • platform/meson.build + meson_options.txt — build scaffolding with HiCR subproject.
  • platform/extern/HiCR — HiCR submodule (used by engine and channels).
  • platform/docs/ — top-level README, system/engine, system/channels docs.

What is explicitly deferred

Deferred Reason
include/modules/configuration/ Mixes model fields (model_range, parallelism) with platform fields. Deferred until DeploymentSpec ownership is resolved (see #32).
broadcastDeployment, channelController, service modules Depend on configuration types; follow-up PR.
extern/TaskR Only needed by the service module.
examples/ Follow-up PR with the concrete modules.

Relation to #32 design discussion

This PR takes a position only on the undisputed cornerstone: Platform is a control plane that manages module lifecycle and channel primitives. It does not take a position on DeploymentSpec ownership or request-dispatch vs resource-routing — those remain open per the #32 discussion.

Tracked by #32 and #13.

@lterrac lterrac marked this pull request as draft June 12, 2026 09:43
@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 57483356-7c0e-46f9-88b8-eadca10e41f6

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Introduces the platform/ subtree for PyPTO Serving: Meson build configuration with HiCR and TaskR submodules, C++ headers for the serving::modules::Module/Subscription abstractions, SPSC channel primitives (Base, Input, Output, Message, MessageTypeRegistry), the serving::system::Engine lifecycle manager, a runnable MPI engine example, and reference documentation.

Changes

Platform Foundation

Layer / File(s) Summary
Build setup and external dependencies
.gitignore, .gitmodules, platform/.clang-format, platform/meson.build, platform/meson_options.txt, platform/extern/HiCR, platform/extern/TaskR, .github/workflows/ci.yml
Meson project, build options (engines, buildTests, compileWarningsAsErrors), LLVM clang-format rules, HiCR/TaskR submodule registrations and commit pins, and new platform-build CI job.
Module lifecycle framework
platform/include/modules/module.hpp, platform/include/modules/subscription.hpp
Abstract Module base with taskr::Service construction and pure virtual lifecycle hooks; Subscription value type holding message type, input edge, and handler callback.
Channel primitives and message contracts
platform/include/system/channels/message.hpp, platform/include/system/channels/messageTypeRegistry.hpp, platform/include/system/channels/base.hpp, platform/include/system/channels/input.hpp, platform/include/system/channels/output.hpp
Message with packed metadata_t bit-encoding; FNV-1a MessageTypeRegistry; Base channel with config/slot-key types, coordination buffer lifecycle, and initialize(tag) flow; Input/Output SPSC consumer and producer endpoints.
Engine orchestration
platform/include/system/engine.hpp
serving::system::Engine registers modules, wires RPC start/stop targets, role-splits run() between deployer and workers, and drives the full module initialize/run/await/terminate/finalize sequence.
Example program and test wiring
platform/examples/meson.build, platform/examples/system/engine/engine.cpp, platform/examples/system/engine/meson.build
Engine example that sets up HWloc topology, constructs MPI/pthread/Boost managers, creates and runs an RPCEngine-backed serving::system::Engine, and a Meson target running it under MPI with 5 processes.
Reference documentation
platform/docs/README.md, platform/docs/system/README.md, platform/docs/system/channels.md, platform/docs/system/engine.md
Platform architecture overview, system layer component ownership, channel layer design and limitations, and engine lifecycle/module contract docs.

Sequence Diagram(s)

sequenceDiagram
  participant Deployer as Engine (deployer)
  participant RPCEngine
  participant Workers as Engine (workers N-1)

  rect rgba(70, 130, 180, 0.5)
    note over Deployer,Workers: initialize
    Deployer->>Deployer: initialize() — all modules
    Workers->>Workers: initialize() — all modules
  end

  rect rgba(60, 179, 113, 0.5)
    note over Deployer,Workers: run
    Deployer->>RPCEngine: broadcast "Start Instances" RPC
    RPCEngine->>Workers: invoke start()
    Workers->>Workers: run modules, enter listen()
    Deployer->>Deployer: start() — run modules
  end

  rect rgba(210, 105, 30, 0.5)
    note over Deployer,Workers: terminate
    Deployer->>RPCEngine: broadcast "Stop Instances" RPC
    RPCEngine->>Workers: invoke stop() — _isRunning=false
    Deployer->>Deployer: stop() locally
    Deployer->>Deployer: terminate/await/finalize modules
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐇 A platform was born from a bunny's delight,
With channels and engines humming all night.
MPI hops across instances with glee,
Modules initialize, one, two, three!
The RPC whispers "start" to each node,
And PyPTO serving sets off down the road. 🚀

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main platform skeleton additions: Engine, Module base, and channel primitives.
Description check ✅ Passed The description is directly related to the PR and accurately outlines the included platform scaffolding and deferred items.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a C++ platform-management layer for PyPTO Serving, establishing core infrastructure for distributed system bootstrap, deployment graph configuration, and SPSC channel-based communication. The review feedback highlights several critical issues, including undefined behavior from passing std::string directly to variadic functions with %s format specifiers, potential queue starvation in the telephone game example due to unpopped messages, risk of crashes from dereferencing empty containers, high CPU usage from busy-waiting in the engine's await loop, and standard format specifier mismatches for size_t values.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread platform/include/modules/configuration/deployment.hpp Outdated
Comment thread platform/include/modules/configuration/deployment.hpp Outdated
Comment thread platform/include/system/channels/base.hpp
Comment thread platform/examples/modules/channelController/telephoneGame.hpp Outdated
Comment thread platform/examples/include/runtime/helpers.hpp Outdated
Comment thread platform/include/system/engine.hpp
Comment thread platform/examples/include/deployment/helpers.hpp Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

🟠 Major comments (22)
platform/include/system/engine.hpp-91-104 (1)

91-104: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Deployer terminates without confirming worker shutdown.

The terminate() method broadcasts stop RPCs to worker instances (lines 96-100) and immediately stops itself (line 102), but does not wait for acknowledgment that workers have processed the stop RPC or completed their shutdown sequence. If the deployer instance exits before workers receive or process the stop RPC, the distributed system may be left in an inconsistent state. Consider adding a barrier or acknowledgment mechanism to ensure all workers have shut down before the deployer finalizes.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@platform/include/system/engine.hpp` around lines 91 - 104, The terminate()
method currently broadcasts stop RPCs via _rpcEngine->requestRPC(...) to
instances from _instanceManager->getInstances() and immediately calls stop() on
the deployer; change this to send stop requests and then wait for confirmation
from each worker (either by using RPC responses/ack callbacks, a barrier RPC, or
polling instance state via _instanceManager) before calling stop() so the
deployer does not exit until all workers have acknowledged shutdown; use
__SERVING_SYSTEM_STOP_RPC_NAME as the RPC and add a configurable timeout and
logging for any non-responsive instances to avoid blocking forever.
platform/include/system/channels/output.hpp-57-89 (1)

57-89: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Const-correctness issue in isFull() and pushMessage().

Both isFull() (line 57) and pushMessage() (line 79) are marked const, yet they modify channel state by calling updateDepth() and push() on the underlying HiCR channels. If the channel member pointers are const, this compiles only because the methods modify the pointed-to object, not the pointer itself. However, this violates const semantics: const methods should not modify observable object state. Remove const from these methods or document that the constness applies only to the Output object's direct members, not the channel state.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@platform/include/system/channels/output.hpp` around lines 57 - 89, The
methods isFull and pushMessage are incorrectly declared const because they call
non-const operations on the underlying channels (updateDepth, push) and
therefore modify observable state; remove the const qualifier from the
signatures of isFull(const size_t) and pushMessage(const Message) (and adjust
any related declarations/usages such as pushMessageLocking calling pushMessage)
so these methods are non-const, or alternatively mark the channel pointers
mutable if you intentionally want logical constness—prefer removing const for
clarity; update function declarations in Output (isFull, pushMessage) and any
forward declarations to match.
platform/include/system/channels/messageTypeRegistry.hpp-55-66 (1)

55-66: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

High hash collision risk with 10-bit truncation.

The FNV-1a hash is truncated to 10 bits (line 65), providing only 1,024 possible type IDs. Due to the birthday paradox, collision probability reaches ~50% with approximately 38 registered types. While the code detects and throws on collisions (line 32-33), this may limit scalability. If the system expects more than ~30 message types, consider increasing TYPE_BITS in message.hpp or documenting the expected maximum number of types.

Would you like me to generate a script to verify the number of expected message types in the codebase?

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@platform/include/system/channels/messageTypeRegistry.hpp` around lines 55 -
66, The fnv1a() function currently truncates the 64-bit FNV-1a hash to
Message::MAX_TYPE (10 bits), causing high collision risk when >~30 types are
registered; fix by increasing TYPE_BITS in message.hpp (and thus
Message::MAX_TYPE) to a larger value (e.g., 16) to provide more unique IDs or
alternatively stop truncating in fnv1a() and return a wider hash type for the
registry to map; update references to messageType_t, Message::MAX_TYPE and fnv1a
accordingly and run the collision detection tests to confirm the reduced
collision rate.
platform/include/system/channels/base.hpp-80-93 (1)

80-93: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Resource leak if constructor allocation fails.

If allocateLocalMemorySlot throws during the second or third allocation (lines 88-89), the previously allocated slots will not be freed because the destructor won't run for a partially constructed object. Consider using RAII wrappers (e.g., storing slots in unique_ptr with custom deleters) or moving allocations to an initialize() method with proper cleanup on failure.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@platform/include/system/channels/base.hpp` around lines 80 - 93, The Base
constructor performs multiple calls to
_config.coordinationMemoryManager->allocateLocalMemorySlot and if any allocation
throws (e.g., during assignments to
_dataChannelLocalCoordinationBufferForPayloads or
_metadataChannelLocalCoordinationBuffer) previously allocated slots
(_dataChannelLocalCoordinationBufferForSizes, etc.) leak because the destructor
won't run for a partially constructed object; fix by making the allocations
exception-safe: either wrap each slot field in an RAII wrapper (e.g.,
std::unique_ptr or a small SlotHolder with a custom deleter that calls
coordinationMemoryManager->freeLocalMemorySlot) or move the allocation logic out
of the constructor into an initialize() method that performs allocations and
frees already-allocated slots on failure (or uses try/catch to release prior
slots), ensuring _dataChannelLocalCoordinationBufferForSizes,
_dataChannelLocalCoordinationBufferForPayloads and
_metadataChannelLocalCoordinationBuffer are cleaned up on exceptions.
platform/include/system/channels/output.hpp-79-89 (1)

79-89: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Exception safety: memory slots may leak if push throws.

If _dataChannel->push() (line 84) or _metadataChannel->push() (line 86) throws an exception, the registered local memory slots on lines 83 and 85 will not be deregistered, potentially causing resource leaks. Wrap the registration/deregistration in RAII guards or use a try-catch block to ensure cleanup. Additionally, registering and deregistering memory slots for every message may incur performance overhead; consider caching and reusing registered slots if the pattern allows.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@platform/include/system/channels/output.hpp` around lines 79 - 89,
pushMessage currently registers payloadSlot and metadataSlot via
payloadMemoryManager::registerLocalMemorySlot and
coordinationMemoryManager::registerLocalMemorySlot but calls _dataChannel->push
and _metadataChannel->push afterwards; if either push throws the corresponding
deregisterLocalMemorySlot calls are skipped and slots leak. Fix by ensuring
deregistration always runs: wrap each registered slot in an RAII guard type
(e.g., LocalMemorySlotGuard) that calls deregisterLocalMemorySlot in its
destructor, or at minimum use a try/finally pattern (try { push... } catch(...)
{ deregister both; throw; } ) around the push calls; update pushMessage to use
these guards or try/catch to guarantee payloadSlot and metadataSlot are
deregistered even when _dataChannel->push or _metadataChannel->push throws, and
consider exposing a reusable registration cache if performance becomes an issue.
platform/include/system/channels/input.hpp-80-103 (1)

80-103: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Thread safety concern in message access methods.

hasMessage(), getMessage(), and popMessage() are not internally synchronized, and the check-then-act pattern (e.g., hasMessage() followed by getMessage()) creates a race window. If multiple threads access the same Input channel, one thread could pop a message between another thread's hasMessage() check and its getMessage() call, causing a runtime exception on line 82. Consider documenting that Input instances are single-consumer only, or add internal locking to these methods.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@platform/include/system/channels/input.hpp` around lines 80 - 103, The
methods hasMessage(), getMessage(), and popMessage() on Input are vulnerable to
a check-then-act race; add an internal mutex (e.g., std::mutex _mutex) to the
Input class and acquire a lock (std::lock_guard<std::mutex>) inside
hasMessage(), getMessage(), and popMessage() so the availability check and
subsequent access/pop occur atomically; specifically, move the availability
check into the locked region in getMessage() and popMessage() instead of calling
hasMessage() outside the lock, and protect all accesses to _dataChannel and
_metadataChannel under that same lock to prevent another thread from popping
between check and use.
platform/include/system/engine.hpp-84-86 (1)

84-86: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Missing exception handling during module shutdown.

Lines 84-86 iterate over all modules calling terminate(), await(), and finalize(), but if any of these methods throws an exception, the iteration stops and remaining modules are left in an inconsistent state. Add per-module try-catch blocks to ensure all modules are properly cleaned up even if some fail, logging errors but continuing the shutdown sequence.

🛡️ Recommended fix
-    for (const auto &[name, module] : _modules) { module->terminate(); }
-    for (const auto &[name, module] : _modules) { module->await(); }
-    for (const auto &[name, module] : _modules) { module->finalize(); }
+    for (const auto &[name, module] : _modules) {
+      try { module->terminate(); }
+      catch (const std::exception &e) { printf("[Instance %lu] Module '%s' terminate failed: %s\n", _instanceId, name.c_str(), e.what()); }
+    }
+    for (const auto &[name, module] : _modules) {
+      try { module->await(); }
+      catch (const std::exception &e) { printf("[Instance %lu] Module '%s' await failed: %s\n", _instanceId, name.c_str(), e.what()); }
+    }
+    for (const auto &[name, module] : _modules) {
+      try { module->finalize(); }
+      catch (const std::exception &e) { printf("[Instance %lu] Module '%s' finalize failed: %s\n", _instanceId, name.c_str(), e.what()); }
+    }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@platform/include/system/engine.hpp` around lines 84 - 86, Wrap each
per-module call in the three shutdown loops so exceptions from one module don't
abort the sequence: in the loops iterating over _modules (the blocks calling
module->terminate(), module->await(), and module->finalize()) add a try {
module->terminate(); } catch (const std::exception& e) { /* log error with
module name and e.what() */ } catch (...) { /* log unknown error */ } and do the
same pattern around module->await() and module->finalize(); ensure logging
references the module name (from the loop key) so failures are recorded but the
loop continues to attempt shutdown for all modules.
platform/include/modules/subscription.hpp-12-12 (1)

12-12: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Remove using namespace directive from header file.

Line 12 declares using namespace serving::system; at file scope in a header, which forces all translation units that include this header to import the entire serving::system namespace. This is an anti-pattern that can cause name collisions and ambiguity in client code. Remove the directive and use fully qualified names (e.g., serving::system::channels::Input) or a namespace alias within the implementation.

♻️ Proposed fix
-using namespace serving::system;
-
 namespace serving::modules
 {
 
-using messageHandler_t = std::function<void(const std::shared_ptr<channels::Input>, const channels::Message &)>;
+using messageHandler_t = std::function<void(const std::shared_ptr<system::channels::Input>, const system::channels::Message &)>;

And update line 25:

-  Subscription(const channels::Message::messageType_t type, const std::shared_ptr<channels::Input> edge, const messageHandler_t handler)
+  Subscription(const system::channels::Message::messageType_t type, const std::shared_ptr<system::channels::Input> edge, const messageHandler_t handler)

And similarly for lines 33-34.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@platform/include/modules/subscription.hpp` at line 12, Remove the file-scope
"using namespace serving::system;" from the header and update all symbol
references in this header to use fully-qualified names (for example replace
channels::Input with serving::system::channels::Input and channels::Output with
serving::system::channels::Output) or introduce a local namespace alias (e.g.,
namespace ss = serving::system;) inside the header scope and use
ss::channels::Input/ss::channels::Output where those types are referenced
(ensure any declarations like Subscription, channel types, and related
template/typedefs are updated accordingly).
platform/include/modules/channelController/module.hpp-78-79 (1)

78-79: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Race condition: hasProducer and hasConsumer must lock before reading.

Lines 78–79 read _actualProducers and _actualConsumers without holding the corresponding mutexes. Concurrent calls to reconcile() (line 169–171) or finalize() (line 113–117) can modify these maps, causing data races and potentially crashes or incorrect results.

🔒 Proposed fix
-  [[nodiscard]] __INLINE__ bool hasProducer(const edgeName_t edgeName) const { return _actualProducers.contains(edgeName); }
-  [[nodiscard]] __INLINE__ bool hasConsumer(const edgeName_t edgeName) const { return _actualConsumers.contains(edgeName); }
+  [[nodiscard]] __INLINE__ bool hasProducer(const edgeName_t edgeName) const
+  {
+    std::lock_guard lock(_producerMutex);
+    return _actualProducers.contains(edgeName);
+  }
+  [[nodiscard]] __INLINE__ bool hasConsumer(const edgeName_t edgeName) const
+  {
+    std::lock_guard lock(_consumerMutex);
+    return _actualConsumers.contains(edgeName);
+  }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@platform/include/modules/channelController/module.hpp` around lines 78 - 79,
hasProducer and hasConsumer read _actualProducers/_actualConsumers without
synchronization; wrap each read in the appropriate mutex lock (the mutex that
protects _actualProducers and the mutex that protects _actualConsumers) to
prevent races with reconcile() and finalize(); update hasProducer(edgeName_t)
and hasConsumer(edgeName_t) to acquire the corresponding lock (e.g., via
std::scoped_lock or the existing lock type used elsewhere) for the duration of
the contains/read operation and then return the result.
platform/examples/include/deployment/helpers.hpp-202-206 (1)

202-206: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add error handling for file I/O and JSON parsing failures.

Lines 202-204 open and parse the deployment configuration file without any error handling. If the file does not exist, cannot be opened, or contains invalid JSON, the program will crash with an unhandled exception rather than providing a clear diagnostic message.

🛡️ Proposed fix
   const std::string servingConfigFilePath = std::string(argv[1]);
   std::ifstream     servingConfigFs(servingConfigFilePath);
+  if (!servingConfigFs.is_open()) {
+    fprintf(stderr, "Error: Cannot open config file '%s'\n", servingConfigFilePath.c_str());
+    instanceManager->abort(-1);
+  }
-  auto              servingConfigJs = nlohmann::json::parse(servingConfigFs);
+  nlohmann::json servingConfigJs;
+  try {
+    servingConfigJs = nlohmann::json::parse(servingConfigFs);
+  } catch (const nlohmann::json::exception &e) {
+    fprintf(stderr, "Error: Failed to parse config file '%s': %s\n", servingConfigFilePath.c_str(), e.what());
+    instanceManager->abort(-1);
+  }
 
   deployment.deserialize(servingConfigJs);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@platform/examples/include/deployment/helpers.hpp` around lines 202 - 206,
Check that argv[1] is provided and validate opening the config file by testing
servingConfigFs.is_open() after constructing servingConfigFs for
servingConfigFilePath and log/exit with a clear message if it fails; then wrap
the JSON parsing and deployment.deserialize(servingConfigJs) in a try-catch
(catching nlohmann::json::parse_error and std::exception) to log the parse error
or deserialize error and exit cleanly, ensuring any thrown exceptions are
converted into user-friendly diagnostics rather than allowing the program to
crash.
platform/examples/include/deployment/helpers.hpp-37-57 (1)

37-57: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Validate that instance IDs and channel IDs fit in the 20-bit key layout.

The defaultChannelKeyBuilder packs sourceInstanceId, targetInstanceId, and channelId into 20-bit fields each (masks (1ull << 20) - 1 at lines 42-44). If any of these values exceed 1048575 (2^20 - 1), the upper bits will be silently truncated, causing different IDs to collide and produce identical keys. This will corrupt channel routing.

🛡️ Proposed fix to add validation
 __INLINE__ serving::system::channels::slotKeys_t defaultChannelKeyBuilder(const HiCR::Instance::instanceId_t        sourceInstanceId,
                                                                        const HiCR::Instance::instanceId_t        targetInstanceId,
                                                                        const serving::system::channels::channelId_t channelId)
 {
+  constexpr key_t MAX_20BIT = (1ull << 20) - 1;
+  if (sourceInstanceId > MAX_20BIT || targetInstanceId > MAX_20BIT || channelId > MAX_20BIT) {
+    HICR_THROW_LOGIC("Instance or channel ID exceeds 20-bit limit (max %lu): src=%ld, dst=%ld, ch=%ld", 
+                     MAX_20BIT, sourceInstanceId, targetInstanceId, channelId);
+  }
   using key_t         = HiCR::GlobalMemorySlot::globalKey_t;
   const key_t src     = (key_t(sourceInstanceId) & ((1ull << 20) - 1)) << 44;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@platform/examples/include/deployment/helpers.hpp` around lines 37 - 57, In
defaultChannelKeyBuilder, validate that sourceInstanceId, targetInstanceId and
channelId each fit within 20 bits before masking/shifting (the current masks
(1ull << 20) - 1 silently truncate higher bits); add explicit checks using the
same symbol names (sourceInstanceId, targetInstanceId, channelId) and return or
assert/fail fast with a clear error if any value > 0xFFFFF, then proceed to
compute src/dst/ch and makeKey as before so no silent collisions occur (refer to
HiCR::GlobalMemorySlot::globalKey_t and the makeKey lambda to locate where to
add the checks).
platform/examples/include/runtime/helpers.hpp-48-48 (1)

48-48: ⚠️ Potential issue | 🟠 Major | ⚖️ Poor tradeoff

Remove the computeResourceCount parameter or fix the hard-coded validation at line 77.

The function accepts a computeResourceCount parameter (defaulting to 2) that suggests flexibility, but line 77 enforces exactly 2 compute resources with if (runtime.computeResources.size() != 2). This inconsistency will cause runtime exceptions whenever a caller passes any value other than 2, making the parameter misleading.

Either remove the parameter and hard-code collection of 2 resources, or fix the validation to accept the requested count.

Proposed fix option 1: Remove the parameter
-__INLINE__ Runtime makeRuntime(int *argc, char ***argv, const size_t computeResourceCount = 2)
+__INLINE__ Runtime makeRuntime(int *argc, char ***argv)
 {
   Runtime runtime;
 
   hwloc_topology_init(&runtime.hwlocTopologyObject);
   HiCR::backend::hwloc::TopologyManager hwlocTopologyManager(&runtime.hwlocTopologyObject);
 
   const auto topology       = hwlocTopologyManager.queryTopology();
   auto       device         = *topology.getDevices().begin();
   auto       memorySpaces   = device->getMemorySpaceList();
   runtime.bufferMemorySpace = *memorySpaces.begin();
   auto computeResourcesIt   = device->getComputeResourceList().begin();
-  for (size_t i = 0; i < computeResourceCount; i++)
+  for (size_t i = 0; i < 2; i++)
   {
     runtime.computeResources.push_back(*computeResourcesIt);
     ++computeResourcesIt;
   }
Proposed fix option 2: Accept variable counts
   if (runtime.computeResources.size() != 2) { HICR_THROW_RUNTIME("Too many CR"); }
+  if (runtime.computeResources.size() != computeResourceCount) { HICR_THROW_RUNTIME("Compute resource count mismatch"); }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@platform/examples/include/runtime/helpers.hpp` at line 48, The makeRuntime
function signature currently exposes computeResourceCount but the code validates
runtime.computeResources.size() == 2, causing a mismatch; fix by either removing
the computeResourceCount parameter from makeRuntime(...) and keep logic that
builds exactly two resources, or (preferred) make the validation respect the
passed parameter by replacing the hard-coded check with one that compares
runtime.computeResources.size() to computeResourceCount (and ensure resource
collection logic uses computeResourceCount when populating
runtime.computeResources); refer to the makeRuntime function and the
runtime.computeResources.size() validation to locate and update the code
accordingly.
platform/examples/system/engine/engine.cpp-44-52 (1)

44-52: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Compute resource bounds checking needed in engine.cpp and service.cpp.

Both engine.cpp and service.cpp iterate to collect 2 compute resources without verifying the device provides at least that many. The shared root cause is the unchecked iterator advancement pattern. Add bounds validation in both locations to prevent undefined behavior when running on devices with fewer cores.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@platform/examples/system/engine/engine.cpp` around lines 44 - 52, The loop
that unconditionally dereferences and increments computeResourcesIt must
validate iterator bounds: before pushing *computeResourcesIt into
computeResources (and before advancing computeResourcesIt) check that
computeResourcesIt != computeResourcesEnd (or that remaining elements >= 2 via
std::distance) and break or handle the error if insufficient devices; likewise
ensure computeResources is non-empty before using auto computeResource =
*computeResources.begin(). Apply the same guarded-iteration fix to the analogous
loop in service.cpp (the unchecked computeResourcesIt advancement there) so both
places safely handle cases with fewer than two compute resources.
platform/examples/include/channels/helpers.hpp-64-78 (1)

64-78: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add null checks after .lock() calls.

Both createDesiredInput and createDesiredOutput call .lock() on a weak pointer without checking for nullptr. If the weak pointer has expired, dereferencing the result will crash.

🛡️ Proposed fix to add null checks
 __INLINE__ std::shared_ptr<serving::system::channels::Input> createDesiredInput(const std::shared_ptr<serving::modules::channelController::Module> &channelController,
                                                                              const localInput_t                                              &inputInfo,
                                                                              const serving::system::channels::keyBuilderFc_t                    &keyBuilder)
 {
   auto input = channelController->addDesiredConsumer(inputInfo.sourceInstanceId, inputInfo.channelId, *inputInfo.edge, keyBuilder).lock();
+  if (!input) HICR_THROW_LOGIC("Failed to lock desired input for edge '%s'.", inputInfo.edge->getName().c_str());
   return input;
 }
 
 __INLINE__ std::shared_ptr<serving::system::channels::Output> createDesiredOutput(const std::shared_ptr<serving::modules::channelController::Module> &channelController,
                                                                                const localOutput_t                                             &outputInfo,
                                                                                const serving::system::channels::keyBuilderFc_t                    &keyBuilder)
 {
   auto output = channelController->addDesiredProducer(outputInfo.targetInstanceId, outputInfo.channelId, *outputInfo.edge, keyBuilder).lock();
+  if (!output) HICR_THROW_LOGIC("Failed to lock desired output for edge '%s'.", outputInfo.edge->getName().c_str());
   return output;
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@platform/examples/include/channels/helpers.hpp` around lines 64 - 78,
createDesiredInput and createDesiredOutput call .lock() on weak_ptr results
without checking for expiration; after calling
channelController->addDesiredConsumer(...).lock() and
addDesiredProducer(...).lock() respectively, check whether the returned
std::shared_ptr is non-null and handle the expired case (e.g., return nullptr or
propagate an error/log) instead of returning/using a null pointer; ensure you
reference the weak-to-shared conversions originating from
channelController->addDesiredConsumer and channelController->addDesiredProducer
in the functions createDesiredInput and createDesiredOutput.
platform/examples/include/channels/helpers.hpp-116-119 (1)

116-119: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Add timeout to waitUntilReady to prevent indefinite hangs.

The busy-wait loop has no timeout or escape condition. If the channel never becomes ready (due to a misconfiguration, deadlock, or network partition), this will hang indefinitely and block the caller thread.

⏱️ Proposed fix to add a timeout parameter
-__INLINE__ void waitUntilReady(const std::shared_ptr<serving::system::channels::Base> &channel, const size_t intervalMs = 500)
+__INLINE__ void waitUntilReady(const std::shared_ptr<serving::system::channels::Base> &channel, const size_t intervalMs = 500, const size_t timeoutMs = 30000)
 {
-  while (!channel->isReady()) { std::this_thread::sleep_for(std::chrono::milliseconds(intervalMs)); }
+  const auto start = std::chrono::steady_clock::now();
+  while (!channel->isReady()) {
+    std::this_thread::sleep_for(std::chrono::milliseconds(intervalMs));
+    const auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now() - start).count();
+    if (elapsed > timeoutMs) HICR_THROW_LOGIC("Channel failed to become ready within %lu ms.", timeoutMs);
+  }
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@platform/examples/include/channels/helpers.hpp` around lines 116 - 119, The
waitUntilReady busy-wait needs a timeout: change the function
waitUntilReady(const std::shared_ptr<serving::system::channels::Base>&, const
size_t intervalMs = 500) into a timed version that accepts a timeout parameter
(e.g., size_t timeoutMs, default to a reasonable value like 30000) and make it
return bool (true if channel->isReady() became true within the timeout, false
otherwise); implement using std::chrono::steady_clock to record start time, loop
sleeping intervalMs while checking channel->isReady() and break/return false
when elapsed >= timeoutMs, and update all call sites to handle the boolean
result (or throw a runtime_error at callers that need exceptions).
platform/examples/modules/channelController/telephoneGame.hpp-29-30 (1)

29-30: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add bounded wait/timeout for message polling loops.

Lines 29–30 and Lines 36–37 wait forever if a message never arrives, which can deadlock this example and test runs.

Proposed fix
+  auto waitForMessage = [](serving::system::channels::Input& ch, std::chrono::seconds timeout) {
+    const auto deadline = std::chrono::steady_clock::now() + timeout;
+    while (!ch.hasMessage())
+    {
+      if (std::chrono::steady_clock::now() >= deadline) return false;
+      std::this_thread::sleep_for(std::chrono::milliseconds(50));
+    }
+    return true;
+  };
+
   if (isRoot)
   {
@@
-    while (inputChannel.hasMessage() == false) { std::this_thread::sleep_for(std::chrono::milliseconds(500)); }
+    if (!waitForMessage(inputChannel, std::chrono::seconds(10))) return;
     auto output = inputChannel.getMessage();
@@
   else
   {
-    while (inputChannel.hasMessage() == false) { std::this_thread::sleep_for(std::chrono::milliseconds(500)); }
+    if (!waitForMessage(inputChannel, std::chrono::seconds(10))) return;
     auto input = inputChannel.getMessage();

Also applies to: 36-37

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@platform/examples/modules/channelController/telephoneGame.hpp` around lines
29 - 30, The polling loops that call inputChannel.hasMessage() then sleep
(surrounding inputChannel.getMessage()) must be bounded to avoid deadlock:
replace the infinite-wait pattern with a timed-wait using a deadline
(std::chrono::steady_clock) and a configurable timeout constant (e.g.,
kMessageWaitTimeout), and if the timeout elapses take a deterministic action
(throw an exception or return an error/optional) instead of looping forever;
apply the same change to both occurrences that surround
inputChannel.getMessage() so both loops time out predictably and propagate an
error instead of deadlocking.
platform/examples/modules/channelController/channelController.cpp-36-40 (1)

36-40: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Return immediately after abort to prevent unsafe fallthrough.

On Line 39, control flow continues in source after runtime.instanceManager->abort(-1). If that call ever returns (or is stubbed), Line 42+ executes with invalid CLI state.

Proposed fix
   if (argc != 2)
   {
     fprintf(stderr, "Error: Must provide the config file path.\n");
     runtime.instanceManager->abort(-1);
+    return -1;
   }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@platform/examples/modules/channelController/channelController.cpp` around
lines 36 - 40, The code calls runtime.instanceManager->abort(-1) when argc != 2
but does not stop execution afterwards; update the caller (the containing
function, e.g., main in channelController.cpp) to immediately return (or call
exit) right after runtime.instanceManager->abort(-1) to avoid fallthrough and
any use of invalid CLI state; modify the branch that checks argc and the fprintf
so that after invoking runtime.instanceManager->abort(-1) execution does not
continue (add a return/exit statement directly after the abort call).
platform/include/modules/configuration/replica.hpp-17-17 (1)

17-17: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Missing reference in JSON constructor parameters in replica.hpp and partition.hpp.

Both Replica (in replica.hpp) and Partition (in partition.hpp) declare constructors that accept const nlohmann::json js by value rather than by const reference. This shared mistake forces a deep copy of the JSON object on every construction. The correct pattern (demonstrated in task.hpp line 15) uses const nlohmann::json &js.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@platform/include/modules/configuration/replica.hpp` at line 17, The Replica
and Partition constructors take nlohmann::json by value causing an unnecessary
deep copy; update the signatures for Replica(const nlohmann::json js) and
Partition(const nlohmann::json js) to accept a const reference (const
nlohmann::json& js) and adjust any corresponding declarations/definitions to
match (mirror the pattern used in task.hpp's constructor) so the JSON is passed
by const reference instead of copied.
platform/meson.build-73-81 (1)

73-81: ⚠️ Potential issue | 🟠 Major

Wire buildTests into the top-level examples inclusion

  • platform/examples/*/meson.build already uses if get_option('buildTests') to create test(...) targets.
  • However, platform/meson.build only includes subdir('examples') when get_option('buildExamples') is true, so -DbuildTests=true alone won’t reach those files and no tests will be added.

Update the platform/meson.build condition to include examples/ when either buildExamples or buildTests is enabled.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@platform/meson.build` around lines 73 - 81, The current top-level guard only
calls subdir('examples') when get_option('buildExamples') is true, so tests
gated by get_option('buildTests') inside platform/examples/* are never included
when only -DbuildTests=true is passed; update the condition in
platform/meson.build (the block guarded by meson.is_subproject() == false) to
call subdir('examples') when either get_option('buildExamples') or
get_option('buildTests') is true (use a logical OR between
get_option('buildExamples') and get_option('buildTests')) so test targets
defined under platform/examples are added when tests are requested.
platform/extern/HiCR-1-1 (1)

1-1: ⚠️ Potential issue | 🟠 Major

HiCR submodule bump: fix unreachable URL + document the 6eb59ea… change

  • .gitmodules sets platform/extern/HiCR to https://github.com/huawei-csl/HiCR.git.
  • git ls-remote/git fetch against that URL fails with remote: Repository not found, so the updated submodule commit 6eb59ea9e63461c847ec06a2dc7819ba41f509fc can’t be retrieved (and its compatibility/security/breaking impact can’t be reviewed) and CI/build may fail.
  • Update the submodule URL (mirror/internal repo/credentials) so this exact commit is fetchable, then document in the PR what changed in HiCR for 6eb59ea… and why (including any breaking/behavioral and security-relevant impact).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@platform/extern/HiCR` at line 1, .gitmodules points platform/extern/HiCR at
an unreachable URL (https://github.com/huawei-csl/HiCR.git) causing git
ls-remote/git fetch to fail and preventing retrieval of commit
6eb59ea9e63461c847ec06a2dc7819ba41f509fc; update the submodule URL for
platform/extern/HiCR to a reachable mirror/internal repo or add required
credentials so git fetch/ls-remote succeeds, verify the commit is fetchable, and
then update the PR description to document exactly what changed in HiCR at
commit 6eb59ea… including any breaking API/behavioral changes or
security-relevant impacts so reviewers and CI can validate the change.
platform/include/modules/configuration/edge.hpp-23-27 (1)

23-27: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Missing capacity invariant check allows invalid buffer configurations.

_bufferSize can be set larger than _bufferCapacity (constructor, deserialize, and setters). Enforce this invariant at write points to block invalid deployment state early.

Suggested fix
   Edge(const std::string &name, const size_t bufferCapacity, const size_t bufferSize = 0)
     : _name(name),
       _bufferCapacity(bufferCapacity),
       _bufferSize(bufferSize)
-  {}
+  {
+    if (_bufferSize > _bufferCapacity) HICR_THROW_LOGIC("[Edge] buffer size cannot exceed buffer capacity.");
+  }
@@
   __INLINE__ void deserialize(const nlohmann::json &js)
   {
@@
     if (js.contains("Buffer Capacity")) _bufferCapacity = hicr::json::getNumber<size_t>(js, "Buffer Capacity");
     _bufferSize = hicr::json::getNumber<size_t>(js, "Buffer Size");
+    if (_bufferSize > _bufferCapacity) HICR_THROW_LOGIC("[Edge] buffer size cannot exceed buffer capacity.");
   }
@@
-  __INLINE__ void setBufferCapacity(const size_t bufferCapacity) { _bufferCapacity = bufferCapacity; }
-  __INLINE__ void setBufferSize(const size_t bufferSize) { _bufferSize = bufferSize; }
+  __INLINE__ void setBufferCapacity(const size_t bufferCapacity)
+  {
+    if (_bufferSize > bufferCapacity) HICR_THROW_LOGIC("[Edge] buffer size cannot exceed buffer capacity.");
+    _bufferCapacity = bufferCapacity;
+  }
+  __INLINE__ void setBufferSize(const size_t bufferSize)
+  {
+    if (bufferSize > _bufferCapacity) HICR_THROW_LOGIC("[Edge] buffer size cannot exceed buffer capacity.");
+    _bufferSize = bufferSize;
+  }

Also applies to: 55-57, 71-72

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@platform/include/modules/configuration/edge.hpp` around lines 23 - 27, The
Edge class allows creating or updating an instance where _bufferSize >
_bufferCapacity; add validation in the Edge(const std::string&, const size_t,
const size_t) constructor and in all mutators/deserialization paths (the
deserialize method and any setters like setBufferSize and setBufferCapacity) to
enforce _bufferSize <= _bufferCapacity: if the invariant fails, reject the
update by throwing a std::invalid_argument (or returning an error) with a clear
message; when setBufferCapacity reduces capacity, also clamp or reject if
current _bufferSize would exceed the new capacity to maintain the invariant.
platform/include/modules/module.hpp-3-5 (1)

3-5: ⚠️ Potential issue | 🟠 Major

Add explicit <memory> include for std::unique_ptr / std::make_unique

platform/include/modules/module.hpp uses std::make_unique and std::unique_ptr but has no #include <memory>, so it relies on transitive includes (brittle for header self-sufficiency).

Suggested fix
 `#pragma` once
 
 `#include` <hicr/core/definitions.hpp>
+#include <memory>
 `#include` <taskr/service.hpp>
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@platform/include/modules/module.hpp` around lines 3 - 5, The header is using
std::unique_ptr and std::make_unique but doesn't include <memory>, relying on
transitive includes; add an explicit `#include` <memory> to
platform/include/modules/module.hpp (alongside the existing `#include`
<hicr/core/definitions.hpp> and `#include` <taskr/service.hpp>) so symbols like
std::unique_ptr and std::make_unique resolve without depending on other headers.
🟡 Minor comments (1)
platform/examples/modules/channelController/telephoneGame.hpp-24-24 (1)

24-24: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use a portable format for instanceId logging.

printf uses %lu on Lines 24/32/39/42, but HiCR::Instance::instanceId_t is an alias and may not always match unsigned long.

Proposed fix
-    printf("[Instance %lu][TelephoneGame] Sending message: %s\n", instanceId, text.c_str());
+    printf("[Instance %llu][TelephoneGame] Sending message: %s\n",
+           static_cast<unsigned long long>(instanceId), text.c_str());
@@
-    printf("[Instance %lu][TelephoneGame] Received message: %s\n", instanceId, std::string(reinterpret_cast<const char *>(output.getData()), output.getSize()).c_str());
+    printf("[Instance %llu][TelephoneGame] Received message: %s\n",
+           static_cast<unsigned long long>(instanceId),
+           std::string(reinterpret_cast<const char *>(output.getData()), output.getSize()).c_str());

Also applies to: 32-32, 39-39, 42-42

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@platform/examples/modules/channelController/telephoneGame.hpp` at line 24,
The printf calls logging instanceId use "%lu" which is not portable for
HiCR::Instance::instanceId_t; update each printf (the one at Sending message and
the other occurrences) to use "%llu" and cast instanceId to (unsigned long long)
(e.g., printf("[Instance %llu][TelephoneGame] ...", (unsigned long
long)instanceId, ...)) so the format matches a fixed-width wide integer
regardless of the underlying alias; apply the same change for the occurrences
referenced (lines with instanceId in TelephoneGame's prints).
🧹 Nitpick comments (8)
platform/include/system/channels/message.hpp (1)

57-61: ⚡ Quick win

Document pointer ownership and lifetime requirements.

The Message class stores a non-owning const uint8_t* pointer to payload data. There is no documentation indicating that the caller must ensure the data outlives the Message object. Consider adding a comment in the constructor or class documentation clarifying that Message does not take ownership of data and that the caller must manage the data's lifetime. Alternatively, explicitly delete copy/move constructors if copying is not intended, or document the shallow-copy semantics.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@platform/include/system/channels/message.hpp` around lines 57 - 61, Document
that Message stores a non-owning pointer and requires the caller to manage the
payload lifetime: add a clear comment on the Message class or the Message(const
uint8_t *const data, const size_t size, const metadata_t metadata) constructor
stating that _data is non-owning, Message does not free or extend the buffer,
and the caller must ensure the buffer outlives the Message instance;
alternatively, if shallow-copying is undesirable, explicitly delete or define
copy/move constructors/operators for Message (referencing class Message, members
_data/_size/_metadata) or switch to an owning type (e.g.,
std::vector/std::shared_ptr) if ownership should be transferred.
platform/include/system/channels/base.hpp (1)

123-124: ⚡ Quick win

Prefer RAII lock guards over raw lock/unlock.

Exposing raw lock() and unlock() methods is error-prone; if an exception is thrown between lock and unlock, the mutex remains locked, causing a deadlock. Consider removing these methods and requiring callers to use std::lock_guard<std::mutex> or std::unique_lock<std::mutex> with a public accessor to the mutex, or provide a scoped lock helper method.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@platform/include/system/channels/base.hpp` around lines 123 - 124, The public
lock()/unlock() methods on the class are unsafe—replace them with an
RAII-friendly API: remove __INLINE__ void lock() and unlock(), expose the
underlying mutex via a safe accessor (e.g., std::mutex& mutex() or const
std::mutex& native_mutex()) and/or add a scoped helper (e.g., a template
with_lock(Func&&) that acquires a std::lock_guard<std::mutex> and invokes the
callback). Update callers to use std::lock_guard<std::mutex> or the new
with_lock helper instead of calling lock()/unlock() directly (refer to the lock,
unlock symbols and the mutex member in this header to locate changes).
platform/include/modules/channelController/module.hpp (1)

153-160: 💤 Low value

Clarify the error message for unconfigured communication managers.

Line 158 throws a logic error if entry.communicationManager is not in exchangeMap. Because exchangeMap is pre-populated from _communicationManagersInOrder (line 154), this error means a channel was configured with a manager not in the ordered list. The current message "is not present in configured manager order" is correct but could be more explicit.

📝 Suggested improvement
-        if (exchangeMap.contains(entry.communicationManager) == false) HICR_THROW_LOGIC("Memory slot exchange manager is not present in configured manager order.");
+        if (exchangeMap.contains(entry.communicationManager) == false)
+          HICR_THROW_LOGIC("Channel references communication manager %p which is not in the configured manager order provided to the channel controller.", entry.communicationManager);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@platform/include/modules/channelController/module.hpp` around lines 153 -
160, The HICR_THROW_LOGIC call when checking
exchangeMap.contains(entry.communicationManager) should use a clearer error
message: update the message emitted from the check in the loop that processes
memorySlotsToExchange to state that the memory slot references a communication
manager that was not included in _communicationManagersInOrder (include the
manager pointer or identifier from entry.communicationManager and the affected
global key or memorySlot for context). Modify the HICR_THROW_LOGIC invocation so
it mentions exchangeMap, entry.communicationManager and the related entry (e.g.,
entry.globalKey) to make the failure explicit.
platform/include/modules/broadcastDeployment/module.hpp (1)

120-120: ⚡ Quick win

Replace C-style cast with static_cast or reinterpret_cast.

Line 120 uses (char *) which is a C-style cast. Modern C++ prefers explicit casts for clarity and type safety.

♻️ Proposed fix
-    std::string deploymentString = (char *)returnValue->getPointer();
+    std::string deploymentString = static_cast<char *>(returnValue->getPointer());
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@platform/include/modules/broadcastDeployment/module.hpp` at line 120, Replace
the C-style cast "(char *)" when building deploymentString from
returnValue->getPointer(); use an explicit C++ cast: if getPointer() returns
void* use reinterpret_cast<char*>(returnValue->getPointer()), otherwise if it
returns a compatible pointer type use
static_cast<char*>(returnValue->getPointer()); update the expression that
constructs std::string deploymentString accordingly (referencing
deploymentString and returnValue->getPointer()) to eliminate the C-style cast.
platform/examples/modules/broadcastDeployment/broadcastDeployment.cpp (3)

79-79: 💤 Low value

Consider using a fixed maximum sleep duration to prevent excessive delays.

The sleep duration is instanceId seconds, which works for small, contiguous instance IDs (0, 1, 2), but if instance IDs are large or non-contiguous (e.g., 100, 200, 300), this will cause very long delays. For example code, consider capping the sleep or using modulo to bound the delay.

♻️ Proposed fix to cap sleep duration
-  std::this_thread::sleep_for(std::chrono::seconds(instanceId)); // Sleep a bit to ensure the output is not mixed
+  std::this_thread::sleep_for(std::chrono::seconds(instanceId % 10)); // Cap at 10 seconds to prevent excessive delays
   printf("[Instance %lu] Received deployment configuration:\n%s\n", instanceId, receivedDeployment.serialize().dump(2).c_str());
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@platform/examples/modules/broadcastDeployment/broadcastDeployment.cpp` at
line 79, The current sleep uses
std::this_thread::sleep_for(std::chrono::seconds(instanceId)) which can block
excessively for large or non-contiguous instanceId values; change it to bound
the delay (e.g., compute a delay = std::min(instanceId, MAX_DELAY_SECONDS) or
use instanceId % MAX_DELAY_SECONDS) and sleep for that bounded number of seconds
so outputs remain staggered but never delayed beyond a fixed maximum; update the
call site where std::this_thread::sleep_for is invoked with instanceId to use
the bounded delay.

51-60: 💤 Low value

Use the already-determined isRoot flag instead of re-checking instance ID equality.

Lines 30-32 determine isRoot, instanceId, and deployerInstanceId, but line 51 re-checks instanceId == deployerInstanceId instead of reusing the isRoot flag. This creates an unnecessary second comparison and potential inconsistency if isRootInstance() logic differs from ID equality.

♻️ Proposed simplification
   std::shared_ptr<serving::modules::broadcastDeployment::Module> broadcastDeploymentModule;
-  if (instanceId == deployerInstanceId)
+  if (isRoot)
   {
     broadcastDeploymentModule = std::make_shared<serving::modules::broadcastDeployment::Module>(
       runtime.instanceManager, runtime.taskComputeManager, runtime.rpcEngine, deployerInstanceId, instanceId, deployment);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@platform/examples/modules/broadcastDeployment/broadcastDeployment.cpp` around
lines 51 - 60, Replace the explicit comparison instanceId == deployerInstanceId
with the already-computed isRoot flag when choosing which Module constructor to
call: use isRoot to decide whether to construct
serving::modules::broadcastDeployment::Module with the deployment argument or
without it, updating the conditional around the broadcastDeploymentModule
creation so it references isRoot (and still passes runtime.instanceManager,
runtime.taskComputeManager, runtime.rpcEngine, deployerInstanceId, instanceId)
instead of re-checking IDs.

62-62: Clarify getDeployment() ordering for non-deployer instances (broadcastDeployment.cpp): Module’s non-deployer constructor default-constructs _deployment, and _deployment is only populated in initialize() via retrieveDeployment(); calling serialize() on an uninitialized deployment would be unsafe because Deployment::serialize() dereferences _requestManager. In this example, receivedDeployment is only serialized after serving.initialize()/serving.await(), and retrieveDeployment() runs inside initialize(), so this ordering is currently safe—but moving the receivedDeployment line below serving.initialize() would prevent future regressions and better documents the dependency.

Optional refactor
   }
 
-  const auto &receivedDeployment = broadcastDeploymentModule->getDeployment();
   // Adding broadcast deployment module to serving
   serving.addModule("BroadcastDeployment", broadcastDeploymentModule);
 
   // Initializing serving
   serving.initialize();
+
+  const auto &receivedDeployment = broadcastDeploymentModule->getDeployment();
 
   // Running serving
   serving.run();
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@platform/examples/modules/broadcastDeployment/broadcastDeployment.cpp` at
line 62, The code calls broadcastDeploymentModule->getDeployment() (stored as
receivedDeployment) before the module's initialize() has run, which can leave
Deployment uninitialized and make Deployment::serialize() unsafe because it
dereferences _requestManager; move the line that obtains receivedDeployment (and
any subsequent serialize/use) to after serving.initialize() and serving.await()
so retrieveDeployment() runs (via Module::initialize()) and _deployment is
populated before use, ensuring safe calls to Deployment::serialize().
platform/include/modules/configuration/deployment.hpp (1)

86-88: 💤 Low value

Redundant getter method.

getControlBufferConst() (line 88) returns the same const reference as getControlBuffer() const (line 86). Consider removing the redundant method to simplify the API surface.

♻️ Simplify by removing redundant method
   [[nodiscard]] __INLINE__ const auto &getControlBuffer() const { return _settings.controlBuffer; }
   [[nodiscard]] __INLINE__ auto       &getControlBuffer() { return _settings.controlBuffer; }
-  [[nodiscard]] __INLINE__ auto       &getControlBufferConst() const { return _settings.controlBuffer; }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@platform/include/modules/configuration/deployment.hpp` around lines 86 - 88,
Remove the redundant getter getControlBufferConst() which returns the same const
reference as getControlBuffer() const; delete the getControlBufferConst()
declaration/definition and replace any call-sites using getControlBufferConst()
with getControlBuffer() (the const overload), ensuring references to
_settings.controlBuffer continue to use the const-qualified getter when needed.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2e3d807d-551e-47a6-b194-c3b5186df012

📥 Commits

Reviewing files that changed from the base of the PR and between 36fbdb8 and 1fb79ea.

📒 Files selected for processing (48)
  • .gitignore
  • .gitmodules
  • platform/.clang-format
  • platform/docs/README.md
  • platform/docs/modules/README.md
  • platform/docs/modules/broadcast-deployment.md
  • platform/docs/modules/channel-controller.md
  • platform/docs/modules/configuration.md
  • platform/docs/modules/service.md
  • platform/docs/system/README.md
  • platform/docs/system/channels.md
  • platform/docs/system/engine.md
  • platform/examples/include/channels/helpers.hpp
  • platform/examples/include/deployment/helpers.hpp
  • platform/examples/include/runtime/helpers.hpp
  • platform/examples/meson.build
  • platform/examples/modules/broadcastDeployment/broadcastDeployment.cpp
  • platform/examples/modules/broadcastDeployment/meson.build
  • platform/examples/modules/broadcastDeployment/policy.json
  • platform/examples/modules/channelController/channelController.cpp
  • platform/examples/modules/channelController/meson.build
  • platform/examples/modules/channelController/policy.json
  • platform/examples/modules/channelController/telephoneGame.hpp
  • platform/examples/modules/service/meson.build
  • platform/examples/modules/service/service.cpp
  • platform/examples/system/engine/engine.cpp
  • platform/examples/system/engine/meson.build
  • platform/extern/HiCR
  • platform/extern/TaskR
  • platform/include/modules/broadcastDeployment/module.hpp
  • platform/include/modules/channelController/module.hpp
  • platform/include/modules/configuration/deployment.hpp
  • platform/include/modules/configuration/edge.hpp
  • platform/include/modules/configuration/partition.hpp
  • platform/include/modules/configuration/replica.hpp
  • platform/include/modules/configuration/requestManager.hpp
  • platform/include/modules/configuration/task.hpp
  • platform/include/modules/module.hpp
  • platform/include/modules/service/module.hpp
  • platform/include/modules/subscription.hpp
  • platform/include/system/channels/base.hpp
  • platform/include/system/channels/input.hpp
  • platform/include/system/channels/message.hpp
  • platform/include/system/channels/messageTypeRegistry.hpp
  • platform/include/system/channels/output.hpp
  • platform/include/system/engine.hpp
  • platform/meson.build
  • platform/meson_options.txt

Comment thread .gitmodules Outdated
Comment thread platform/examples/include/runtime/helpers.hpp Outdated
Comment thread platform/examples/include/runtime/helpers.hpp Outdated
Comment thread platform/include/modules/configuration/deployment.hpp Outdated
Comment thread platform/include/modules/configuration/deployment.hpp Outdated
Comment thread platform/include/modules/configuration/deployment.hpp Outdated
Comment thread platform/include/modules/configuration/requestManager.hpp Outdated
Comment thread platform/include/system/engine.hpp
Comment thread platform/meson.build Outdated
@lterrac lterrac changed the title WIP: feat/platform feat: add platform skeleton — Engine, Module base, and channel primitives Jun 26, 2026
lterrac added a commit that referenced this pull request Jun 26, 2026
- meson.build: flatten compile_args array (['-Wfatal-errors'] +
  warningAsErrorFlags) to avoid nested array in declare_dependency
- meson.build: remove dangling endif and orphaned comment left over
  from examples removal
- .gitmodules: remove branch = master from HiCR entry to prevent
  accidental drift on git submodule update --remote
- engine.hpp: document RPC lambda lifetime contract (HiCR has no
  removeRPCTarget API; callers must terminate/await before destruction)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@lterrac

lterrac commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator Author

Only thing missing is wiring the code into the CI, going to work on it right now

@lterrac lterrac marked this pull request as ready for review June 26, 2026 10:00
@lterrac

lterrac commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@gemini-code-assist review

Triggering fresh reviews — the branch was substantially rewritten since the original reviews. The PR is now scoped to the platform skeleton only (Engine, Module base, channel primitives, system engine example, build scaffolding). Several files that attracted comments (configuration types, module examples, telephoneGame) are no longer in this PR.

@coderabbitai

coderabbitai Bot commented Jun 30, 2026

Copy link
Copy Markdown

@lterrac Thanks for the heads-up — I’ll do a fresh full review of #34 using the updated scope and ignore earlier comments tied to files that are no longer part of this PR.


✏️ Learnings added
Learnt from: lterrac
URL: https://github.com/hw-native-sys/pypto-serving/pull/34

Timestamp: 2026-06-30T13:39:40.986Z
Learning: In repository `hw-native-sys/pypto-serving`, Pull Request `#34` was substantially rewritten and is currently scoped to the platform skeleton only under `platform/`, specifically `serving::system::Engine`, the `serving::modules::Module` base, channel primitives, the system engine example, and build scaffolding; earlier-reviewed areas such as configuration types, module examples, and `telephoneGame` are no longer in scope for this PR.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

✅ Action performed

Full review finished.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the C++ platform-management layer for PyPTO Serving, implementing a distributed engine, module lifecycle management, and SPSC-oriented communication channels that separate data payloads from metadata. The review feedback identifies several critical issues: a high risk of hash collisions in the 10-bit MessageTypeRegistry, fragility in the asymmetric engine lifecycle, an unmitigated busy-wait loop in pushMessageLocking when channels are full, a race condition where background service threads can invoke virtual methods on destroyed modules, potential ID corruption from unmasked bit-width overflows in Message::getId(), and missing null-pointer checks on memory allocations in the channel base class.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread platform/include/system/channels/messageTypeRegistry.hpp
Comment thread platform/include/system/engine.hpp
Comment thread platform/include/system/channels/output.hpp
Comment thread platform/include/modules/module.hpp
Comment thread platform/include/system/channels/message.hpp
Comment thread platform/include/system/channels/base.hpp
@coderabbitai

coderabbitai Bot commented Jun 30, 2026

Copy link
Copy Markdown

@lterrac Thanks — I’ll do a fresh full review of #34 using the updated scope only: the platform skeleton under platform/ (serving::system::Engine, serving::modules::Module, channel primitives, the system engine example, and build scaffolding), and I’ll ignore earlier comments tied to files that are no longer part of this PR.

(full review mode) 🐇

✅ Action performed

Full review finished.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

♻️ Duplicate comments (1)
.gitmodules (1)

4-9: 🔒 Security & Privacy | 🔴 Critical | ⚡ Quick win

Fix broken submodule URLs — repositories have moved.

The .gitmodules URLs still point to non-existent huawei-csl organization repositories. Verification confirmed these repos live under Algebraic-Programming:

  • https://github.com/Algebraic-Programming/HiCR.git
  • https://github.com/Algebraic-Programming/TaskR.git

Update both URLs to prevent clone failures.

🔧 Proposed fix
 [submodule "platform/extern/HiCR"]
 	path = platform/extern/HiCR
-	url = https://github.com/huawei-csl/HiCR.git
+	url = https://github.com/Algebraic-Programming/HiCR.git
 [submodule "platform/extern/TaskR"]
 	path = platform/extern/TaskR
-	url = https://github.com/huawei-csl/TaskR.git
+	url = https://github.com/Algebraic-Programming/TaskR.git
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.gitmodules around lines 4 - 9, Update the submodule URLs in .gitmodules for
the HiCR and TaskR entries because the current huawei-csl locations are broken.
Change the URL values for the platform/extern/HiCR and platform/extern/TaskR
submodules to the corresponding Algebraic-Programming GitHub repositories so
cloning and submodule initialization work again.
🧹 Nitpick comments (3)
platform/docs/README.md (1)

15-24: 📐 Maintainability & Code Quality | 🔵 Trivial

Source layout includes deferred directories.

The include/modules/configuration/ directory listed at line 18 is explicitly deferred per PR objectives. Consider removing it from the source layout or marking it as planned but not yet present to avoid confusion.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@platform/docs/README.md` around lines 15 - 24, The source layout in README
still presents include/modules/configuration/ as part of the current tree even
though it is deferred. Update the documentation in the platform README to either
remove that directory from the listed layout or clearly mark it as planned/not
yet present, keeping the wording consistent with the other layout entries in the
README section.
platform/meson_options.txt (1)

1-6: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Remove trailing whitespace on line 3.

choices : [ 'mpi' ], 

Trailing space after the comma.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@platform/meson_options.txt` around lines 1 - 6, Remove the trailing
whitespace in the meson option definition for option('engines'); the issue is
the extra space after the comma in the choices array. Update the option
declaration so the choices line is clean and consistent with the surrounding
Meson formatting, keeping the option('engines') block otherwise unchanged.
platform/include/modules/module.hpp (1)

3-5: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Make this public header self-contained.

Module uses size_t, std::unique_ptr, and std::make_unique, but those headers are only available transitively here. That makes the header brittle to include-order changes in HiCR/TaskR.

Suggested change
 `#pragma` once
 
+#include <cstddef>
+#include <memory>
+
 `#include` <hicr/core/definitions.hpp>
 `#include` <taskr/service.hpp>
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@platform/include/modules/module.hpp` around lines 3 - 5, The public header
for Module is not self-contained because it relies on transitive includes for
size_t, std::unique_ptr, and std::make_unique. Update the Module header itself
to include the standard headers that provide those symbols so its declarations
and inline usage compile regardless of include order, and keep the existing
HiCR/TaskR includes for the module types it references.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/ci.yml:
- Around line 84-87: The recursive checkout in the CI workflow leaves Git
credentials available to PR-controlled code, so update the actions/checkout@v4
step to disable credential persistence. In the checkout configuration, add
persist-credentials: false alongside fetch-depth and submodules, and only keep
credentials enabled if a later step in this job genuinely requires authenticated
git access.

In `@platform/docs/README.md`:
- Around line 28-32: The platform README is overstating what this PR implements
and links to deferred module docs. Update the overview text and the module list
in README so they only describe the pieces actually included in this change, and
remove or rewrite references to the deferred configuration, broadcast
deployment, channel controller, service, TaskR, and examples scope. Keep the
wording aligned with the current PR scope so readers are not told that those
modules or deployment graph pieces already exist.

In `@platform/docs/system/channels.md`:
- Around line 16-22: Clarify the concurrency model in channels.md by making the
Base description, pushMessageLocking(message), and the “Channels are
SPSC-oriented” claim consistent. If the implementation is truly SPSC, revise the
mutex wording to explain its limited role outside the data path; if locking is
part of normal send/receive behavior, update the documentation to describe the
channels as MPMC or otherwise non-SPSC. Use the Base and
pushMessageLocking(message) symbols to align the wording across the affected
sections.

In `@platform/examples/system/engine/engine.cpp`:
- Around line 32-52: The device-selection block in engine.cpp blindly
dereferences topology-derived iterators and even advances through two compute
resources even though only one is used. In the code around
topology.getDevices(), getMemorySpaceList(), and getComputeResourceList(), add
guards for empty device, memory-space, and compute-resource lists before
dereferencing, and stop assuming at least two cores; select the first available
compute resource safely (or bail out/log when unavailable) so the example does
not crash on filtered or single-core topologies.

In `@platform/include/modules/subscription.hpp`:
- Around line 12-17: Remove the namespace-wide import from this public header:
`using namespace serving::system;` in `subscription.hpp` is leaking all of
`serving::system` into every includer. Update the `messageHandler_t` declaration
and any other uses in `serving::modules` to reference
`serving::system::channels::Input` and `serving::system::channels::Message`
explicitly, or replace the broad import with a narrow alias scoped inside
`serving::modules`.

In `@platform/include/system/channels/base.hpp`:
- Around line 104-130: The Base::initialize flow is the only place that
populates the shared memory slots before createChannels(), so the public channel
operations in Input::{hasMessage,getMessage,popMessage} and
Output::{isFull,pushMessage} must not run before readiness is established. Add a
protected readiness guard in Base using isReady()/_isReady, and have those
methods check it before touching any channel handles or dereferencing pointers;
if not ready, fail fast in a consistent way. Use the existing Base, Input, and
Output symbols to wire the check into the shared channel access path.

In `@platform/include/system/channels/message.hpp`:
- Around line 48-50: The getId() packing logic in message::getId() directly
shifts raw type, groupId, and sequenceId values without validating bounds, so
out-of-range metadata can bleed into adjacent fields. Update getId() to either
validate each field before packing or mask them to their expected bit widths,
and use the existing messageId_t packing expression in message.hpp as the place
to enforce this behavior.

In `@platform/include/system/channels/output.hpp`:
- Around line 66-76: Make pushMessageLocking() exception-safe by replacing the
manual lock/unlock flow with RAII so the mutex is always released even if
pushMessage(message) throws. Update the pushMessageLocking method in output.hpp
to use std::unique_lock (or an equivalent scoped lock) around the full/isFull
loop and the pushMessage call, and remove the raw lock(), unlock() handling to
avoid leaving the channel permanently locked.

In `@platform/include/system/engine.hpp`:
- Around line 46-50: The addModule registration path currently allows a null
shared_ptr to be stored, which later gets dereferenced in initialize(), start(),
and await(). Update addModule(const std::string &name,
std::shared_ptr<modules::Module> module) to validate module before inserting it
into _modules and throw a logic error when it is null, alongside the existing
duplicate-name check.

In `@platform/meson_options.txt`:
- Around line 8-10: The buildExamples meson option is currently defined but not
used by the build logic, so either remove it from meson_options.txt or wire it
into the example build path. Update platform/meson.build so the examples subdir
is gated by get_option('buildExamples') instead of buildTests, using the
buildExamples option name consistently.

---

Duplicate comments:
In @.gitmodules:
- Around line 4-9: Update the submodule URLs in .gitmodules for the HiCR and
TaskR entries because the current huawei-csl locations are broken. Change the
URL values for the platform/extern/HiCR and platform/extern/TaskR submodules to
the corresponding Algebraic-Programming GitHub repositories so cloning and
submodule initialization work again.

---

Nitpick comments:
In `@platform/docs/README.md`:
- Around line 15-24: The source layout in README still presents
include/modules/configuration/ as part of the current tree even though it is
deferred. Update the documentation in the platform README to either remove that
directory from the listed layout or clearly mark it as planned/not yet present,
keeping the wording consistent with the other layout entries in the README
section.

In `@platform/include/modules/module.hpp`:
- Around line 3-5: The public header for Module is not self-contained because it
relies on transitive includes for size_t, std::unique_ptr, and std::make_unique.
Update the Module header itself to include the standard headers that provide
those symbols so its declarations and inline usage compile regardless of include
order, and keep the existing HiCR/TaskR includes for the module types it
references.

In `@platform/meson_options.txt`:
- Around line 1-6: Remove the trailing whitespace in the meson option definition
for option('engines'); the issue is the extra space after the comma in the
choices array. Update the option declaration so the choices line is clean and
consistent with the surrounding Meson formatting, keeping the option('engines')
block otherwise unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 6edbbc1b-f82d-47a9-b466-466e3b5ef00e

📥 Commits

Reviewing files that changed from the base of the PR and between 0b0d8a0 and 831b346.

📒 Files selected for processing (23)
  • .github/workflows/ci.yml
  • .gitignore
  • .gitmodules
  • platform/.clang-format
  • platform/docs/README.md
  • platform/docs/system/README.md
  • platform/docs/system/channels.md
  • platform/docs/system/engine.md
  • platform/examples/meson.build
  • platform/examples/system/engine/engine.cpp
  • platform/examples/system/engine/meson.build
  • platform/extern/HiCR
  • platform/extern/TaskR
  • platform/include/modules/module.hpp
  • platform/include/modules/subscription.hpp
  • platform/include/system/channels/base.hpp
  • platform/include/system/channels/input.hpp
  • platform/include/system/channels/message.hpp
  • platform/include/system/channels/messageTypeRegistry.hpp
  • platform/include/system/channels/output.hpp
  • platform/include/system/engine.hpp
  • platform/meson.build
  • platform/meson_options.txt

Comment thread .github/workflows/ci.yml
Comment thread platform/docs/README.md Outdated
Comment thread platform/docs/system/channels.md Outdated
Comment thread platform/examples/system/engine/engine.cpp
Comment thread platform/include/modules/subscription.hpp Outdated
Comment thread platform/include/system/channels/base.hpp
Comment thread platform/include/system/channels/message.hpp Outdated
Comment thread platform/include/system/channels/output.hpp Outdated
Comment thread platform/include/system/engine.hpp
Comment thread platform/meson_options.txt
@lterrac

lterrac commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator Author

@coderabbitai review

@lterrac

lterrac commented Jun 30, 2026

Copy link
Copy Markdown
Collaborator Author

@gemini-code-assist review

@coderabbitai

coderabbitai Bot commented Jun 30, 2026

Copy link
Copy Markdown
✅ Action performed

Review finished.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the C++ platform-management layer for PyPTO Serving, which includes module lifecycle management, SPSC channel primitives, and cross-instance RPC control. The review feedback highlights several critical issues: potential race conditions from premature memory slot deregistration in Output::pushMessage, high CPU utilization caused by std::this_thread::yield() in the engine's polling loop, and blocking calls in worker instances that could prevent proper termination. Additionally, there are concerns regarding thread safety on the SPSC channel due to un-locked public push methods, potential silent ID corruption in release builds when assertions are compiled out, and missing null pointer checks after memory allocations in the channel base class.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread platform/include/system/channels/output.hpp
Comment thread platform/include/system/engine.hpp
Comment thread platform/include/system/engine.hpp
Comment thread platform/include/system/channels/output.hpp
Comment thread platform/include/system/channels/message.hpp
Comment thread platform/include/system/channels/base.hpp
@coderabbitai

coderabbitai Bot commented Jun 30, 2026

Copy link
Copy Markdown

Caution

Failed to replace (edit) comment. This is likely due to insufficient permissions or the comment being deleted.

Error details
{}

lterrac added a commit that referenced this pull request Jul 1, 2026
- meson.build: flatten compile_args array (['-Wfatal-errors'] +
  warningAsErrorFlags) to avoid nested array in declare_dependency
- meson.build: remove dangling endif and orphaned comment left over
  from examples removal
- .gitmodules: remove branch = master from HiCR entry to prevent
  accidental drift on git submodule update --remote
- engine.hpp: document RPC lambda lifetime contract (HiCR has no
  removeRPCTarget API; callers must terminate/await before destruction)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
lterrac added a commit that referenced this pull request Jul 1, 2026
- meson.build: flatten compile_args array (['-Wfatal-errors'] +
  warningAsErrorFlags) to avoid nested array in declare_dependency
- meson.build: remove dangling endif and orphaned comment left over
  from examples removal
- .gitmodules: remove branch = master from HiCR entry to prevent
  accidental drift on git submodule update --remote
- engine.hpp: document RPC lambda lifetime contract (HiCR has no
  removeRPCTarget API; callers must terminate/await before destruction)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
lterrac and others added 4 commits July 1, 2026 12:17
…ives

Introduces the foundational control-plane layer for the platform module
(ref #32, #13). Scoped to static infrastructure only per the agreed first
milestone: module lifecycle interface, the system Engine, host-side channel
primitives (Input/Output/Message/MessageTypeRegistry), and HiCR submodule.

Configuration types, concrete modules (broadcastDeployment, channelController,
service), and examples are deferred to a follow-up PR pending resolution of
DeploymentSpec ownership (see #32 discussion).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- meson.build: flatten compile_args array (['-Wfatal-errors'] +
  warningAsErrorFlags) to avoid nested array in declare_dependency
- meson.build: remove dangling endif and orphaned comment left over
  from examples removal
- .gitmodules: remove branch = master from HiCR entry to prevent
  accidental drift on git submodule update --remote
- engine.hpp: document RPC lambda lifetime contract (HiCR has no
  removeRPCTarget API; callers must terminate/await before destruction)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds the engine example (examples/system/engine/) which exercises the
full Engine + module lifecycle over MPI — the minimal runnable
demonstration of the platform skeleton.

Wires the example into the build under buildTests=true and registers it
as a meson test suite ('examples'). The CI platform-build job:
  - shallow-clones with submodules (fetch-depth: 1, submodules: recursive)
  - installs meson, ninja, libhwloc-dev, libboost-context-dev, libopenmpi-dev
  - configures with -Dengines=mpi -DbuildTests=true
  - builds with ninja
  - runs meson test --suite examples

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
module.hpp uses taskr::Service as the base Module's optional
periodic-service mechanism, so TaskR is a core dependency of the
platform skeleton, not just the concrete service module.

Restores the TaskR subproject in meson.build and the submodule entry
in .gitmodules (without a branch= line, consistent with HiCR).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
lterrac and others added 10 commits July 1, 2026 12:18
.gitmodules had the TaskR entry but the submodule gitlink was missing
from the git index (lost during the earlier branch reset), so CI could
not clone it with submodules: recursive. Re-registers the pinned commit.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
huawei-csl/TaskR does not exist publicly; the correct repo is
https://github.com/Algebraic-Programming/TaskR.git

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- output: add std::this_thread::yield() in pushMessageLocking busy-wait
  to avoid spinning the CPU while the channel is full
- message: expand TYPE_BITS 10→16, GROUP_BITS 20→14 (total stays 64-bit)
  to reduce FNV-1a collision probability from ~8% to ~0.003% for 20 types;
  add assert(isValid()) in getId() to catch field-width violations early
- messageTypeRegistry: keep FNV-1a for cross-node determinism (sequential
  IDs would break nodes with different module sets); remove stale comment

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
yield()/sched_yield()/pthread_yield() are identical on Linux and all
yield the OS thread, not the TaskR coroutine. When called from within a
TaskR task (boost::context coroutine), this blocks the whole worker
thread from picking up other tasks. A 1µs sleep works correctly in all
execution contexts and actually gives the SPSC consumer time to drain
the channel rather than immediately retrying.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
No step in platform-build uses authenticated git access after checkout,
and all submodules are public. persist-credentials: false prevents the
GitHub token from being written into the local git config where
PR-controlled code could read it.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
docs/README.md: remove links to untracked module docs (configuration,
service, channel-controller, broadcast-deployment); rewrite the
'implemented pieces' block to only describe what is actually in this PR
(engine lifecycle, module base, channel primitives)

docs/system/channels.md: clarify mutex role — it is used exclusively by
pushMessageLocking to serialize competing callers, not the SPSC data
path itself; update bit widths (type: 16-bit, group: 14-bit) to match
current message.hpp; expand FNV-1a description to explain cross-node
determinism

examples/system/engine/engine.cpp: guard all three topology-derived
iterator dereferences (devices, memory spaces, compute resources) with
empty-list checks; remove the dead 2-core loop since only the first
compute resource was ever used

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
channels/base.hpp: add protected checkReady() that throws before any
channel op if initialize() was not called; move _lock to protected so
subclasses can use std::unique_lock directly

channels/input.hpp: call checkReady() at the top of hasMessage() —
getMessage() and popMessage() both call hasMessage() first so they are
covered transitively

channels/output.hpp: call checkReady() at the top of isFull(); rewrite
pushMessageLocking() using std::unique_lock<std::mutex> so the lock is
always released even if pushMessage() throws

modules/subscription.hpp: replace file-scope `using namespace
serving::system` (which leaked into every including TU) with a
namespace alias scoped inside serving::modules

docs/README.md: remove include/modules/configuration/ from the source
layout (deferred, not present in this PR)

meson_options.txt: strip trailing whitespace

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… encapsulation

engine.hpp: replace std::this_thread::yield() with sleep_for(1ms) in
the await() RPC polling loop; yield() re-schedules immediately and
pegs the core at 100% when idle

output.hpp: move pushMessage to private; it was only ever called from
pushMessageLocking (which holds _lock). A concurrent caller mixing
pushMessage and pushMessageLocking would race on the SPSC channel —
making it private prevents that class of misuse entirely

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
assert(isValid()) is compiled out under NDEBUG, leaving overflow
unchecked in release. Explicitly mask each field to its MAX before
shifting so out-of-range bits can never bleed into adjacent fields.
assert is kept for debug builds to catch invalid values early.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant