feat: add platform skeleton — Engine, Module base, and channel primitives#34
feat: add platform skeleton — Engine, Module base, and channel primitives#34lterrac wants to merge 14 commits into
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughIntroduces the ChangesPlatform Foundation
Sequence Diagram(s)sequenceDiagram
participant Deployer as Engine (deployer)
participant RPCEngine
participant Workers as Engine (workers N-1)
rect rgba(70, 130, 180, 0.5)
note over Deployer,Workers: initialize
Deployer->>Deployer: initialize() — all modules
Workers->>Workers: initialize() — all modules
end
rect rgba(60, 179, 113, 0.5)
note over Deployer,Workers: run
Deployer->>RPCEngine: broadcast "Start Instances" RPC
RPCEngine->>Workers: invoke start()
Workers->>Workers: run modules, enter listen()
Deployer->>Deployer: start() — run modules
end
rect rgba(210, 105, 30, 0.5)
note over Deployer,Workers: terminate
Deployer->>RPCEngine: broadcast "Stop Instances" RPC
RPCEngine->>Workers: invoke stop() — _isRunning=false
Deployer->>Deployer: stop() locally
Deployer->>Deployer: terminate/await/finalize modules
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a C++ platform-management layer for PyPTO Serving, establishing core infrastructure for distributed system bootstrap, deployment graph configuration, and SPSC channel-based communication. The review feedback highlights several critical issues, including undefined behavior from passing std::string directly to variadic functions with %s format specifiers, potential queue starvation in the telephone game example due to unpopped messages, risk of crashes from dereferencing empty containers, high CPU usage from busy-waiting in the engine's await loop, and standard format specifier mismatches for size_t values.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
There was a problem hiding this comment.
Actionable comments posted: 9
Note
Due to the large number of review comments, Critical severity comments were prioritized as inline comments.
🟠 Major comments (22)
platform/include/system/engine.hpp-91-104 (1)
91-104:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftDeployer terminates without confirming worker shutdown.
The
terminate()method broadcasts stop RPCs to worker instances (lines 96-100) and immediately stops itself (line 102), but does not wait for acknowledgment that workers have processed the stop RPC or completed their shutdown sequence. If the deployer instance exits before workers receive or process the stop RPC, the distributed system may be left in an inconsistent state. Consider adding a barrier or acknowledgment mechanism to ensure all workers have shut down before the deployer finalizes.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@platform/include/system/engine.hpp` around lines 91 - 104, The terminate() method currently broadcasts stop RPCs via _rpcEngine->requestRPC(...) to instances from _instanceManager->getInstances() and immediately calls stop() on the deployer; change this to send stop requests and then wait for confirmation from each worker (either by using RPC responses/ack callbacks, a barrier RPC, or polling instance state via _instanceManager) before calling stop() so the deployer does not exit until all workers have acknowledged shutdown; use __SERVING_SYSTEM_STOP_RPC_NAME as the RPC and add a configurable timeout and logging for any non-responsive instances to avoid blocking forever.platform/include/system/channels/output.hpp-57-89 (1)
57-89:⚠️ Potential issue | 🟠 Major | ⚡ Quick winConst-correctness issue in
isFull()andpushMessage().Both
isFull()(line 57) andpushMessage()(line 79) are markedconst, yet they modify channel state by callingupdateDepth()andpush()on the underlying HiCR channels. If the channel member pointers areconst, this compiles only because the methods modify the pointed-to object, not the pointer itself. However, this violates const semantics: const methods should not modify observable object state. Removeconstfrom these methods or document that the constness applies only to theOutputobject's direct members, not the channel state.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@platform/include/system/channels/output.hpp` around lines 57 - 89, The methods isFull and pushMessage are incorrectly declared const because they call non-const operations on the underlying channels (updateDepth, push) and therefore modify observable state; remove the const qualifier from the signatures of isFull(const size_t) and pushMessage(const Message) (and adjust any related declarations/usages such as pushMessageLocking calling pushMessage) so these methods are non-const, or alternatively mark the channel pointers mutable if you intentionally want logical constness—prefer removing const for clarity; update function declarations in Output (isFull, pushMessage) and any forward declarations to match.platform/include/system/channels/messageTypeRegistry.hpp-55-66 (1)
55-66:⚠️ Potential issue | 🟠 Major | ⚡ Quick winHigh hash collision risk with 10-bit truncation.
The FNV-1a hash is truncated to 10 bits (line 65), providing only 1,024 possible type IDs. Due to the birthday paradox, collision probability reaches ~50% with approximately 38 registered types. While the code detects and throws on collisions (line 32-33), this may limit scalability. If the system expects more than ~30 message types, consider increasing
TYPE_BITSinmessage.hppor documenting the expected maximum number of types.Would you like me to generate a script to verify the number of expected message types in the codebase?
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@platform/include/system/channels/messageTypeRegistry.hpp` around lines 55 - 66, The fnv1a() function currently truncates the 64-bit FNV-1a hash to Message::MAX_TYPE (10 bits), causing high collision risk when >~30 types are registered; fix by increasing TYPE_BITS in message.hpp (and thus Message::MAX_TYPE) to a larger value (e.g., 16) to provide more unique IDs or alternatively stop truncating in fnv1a() and return a wider hash type for the registry to map; update references to messageType_t, Message::MAX_TYPE and fnv1a accordingly and run the collision detection tests to confirm the reduced collision rate.platform/include/system/channels/base.hpp-80-93 (1)
80-93:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftResource leak if constructor allocation fails.
If
allocateLocalMemorySlotthrows during the second or third allocation (lines 88-89), the previously allocated slots will not be freed because the destructor won't run for a partially constructed object. Consider using RAII wrappers (e.g., storing slots inunique_ptrwith custom deleters) or moving allocations to aninitialize()method with proper cleanup on failure.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@platform/include/system/channels/base.hpp` around lines 80 - 93, The Base constructor performs multiple calls to _config.coordinationMemoryManager->allocateLocalMemorySlot and if any allocation throws (e.g., during assignments to _dataChannelLocalCoordinationBufferForPayloads or _metadataChannelLocalCoordinationBuffer) previously allocated slots (_dataChannelLocalCoordinationBufferForSizes, etc.) leak because the destructor won't run for a partially constructed object; fix by making the allocations exception-safe: either wrap each slot field in an RAII wrapper (e.g., std::unique_ptr or a small SlotHolder with a custom deleter that calls coordinationMemoryManager->freeLocalMemorySlot) or move the allocation logic out of the constructor into an initialize() method that performs allocations and frees already-allocated slots on failure (or uses try/catch to release prior slots), ensuring _dataChannelLocalCoordinationBufferForSizes, _dataChannelLocalCoordinationBufferForPayloads and _metadataChannelLocalCoordinationBuffer are cleaned up on exceptions.platform/include/system/channels/output.hpp-79-89 (1)
79-89:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftException safety: memory slots may leak if push throws.
If
_dataChannel->push()(line 84) or_metadataChannel->push()(line 86) throws an exception, the registered local memory slots on lines 83 and 85 will not be deregistered, potentially causing resource leaks. Wrap the registration/deregistration in RAII guards or use a try-catch block to ensure cleanup. Additionally, registering and deregistering memory slots for every message may incur performance overhead; consider caching and reusing registered slots if the pattern allows.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@platform/include/system/channels/output.hpp` around lines 79 - 89, pushMessage currently registers payloadSlot and metadataSlot via payloadMemoryManager::registerLocalMemorySlot and coordinationMemoryManager::registerLocalMemorySlot but calls _dataChannel->push and _metadataChannel->push afterwards; if either push throws the corresponding deregisterLocalMemorySlot calls are skipped and slots leak. Fix by ensuring deregistration always runs: wrap each registered slot in an RAII guard type (e.g., LocalMemorySlotGuard) that calls deregisterLocalMemorySlot in its destructor, or at minimum use a try/finally pattern (try { push... } catch(...) { deregister both; throw; } ) around the push calls; update pushMessage to use these guards or try/catch to guarantee payloadSlot and metadataSlot are deregistered even when _dataChannel->push or _metadataChannel->push throws, and consider exposing a reusable registration cache if performance becomes an issue.platform/include/system/channels/input.hpp-80-103 (1)
80-103:⚠️ Potential issue | 🟠 Major | ⚡ Quick winThread safety concern in message access methods.
hasMessage(),getMessage(), andpopMessage()are not internally synchronized, and the check-then-act pattern (e.g.,hasMessage()followed bygetMessage()) creates a race window. If multiple threads access the sameInputchannel, one thread could pop a message between another thread'shasMessage()check and itsgetMessage()call, causing a runtime exception on line 82. Consider documenting thatInputinstances are single-consumer only, or add internal locking to these methods.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@platform/include/system/channels/input.hpp` around lines 80 - 103, The methods hasMessage(), getMessage(), and popMessage() on Input are vulnerable to a check-then-act race; add an internal mutex (e.g., std::mutex _mutex) to the Input class and acquire a lock (std::lock_guard<std::mutex>) inside hasMessage(), getMessage(), and popMessage() so the availability check and subsequent access/pop occur atomically; specifically, move the availability check into the locked region in getMessage() and popMessage() instead of calling hasMessage() outside the lock, and protect all accesses to _dataChannel and _metadataChannel under that same lock to prevent another thread from popping between check and use.platform/include/system/engine.hpp-84-86 (1)
84-86:⚠️ Potential issue | 🟠 Major | ⚡ Quick winMissing exception handling during module shutdown.
Lines 84-86 iterate over all modules calling
terminate(),await(), andfinalize(), but if any of these methods throws an exception, the iteration stops and remaining modules are left in an inconsistent state. Add per-module try-catch blocks to ensure all modules are properly cleaned up even if some fail, logging errors but continuing the shutdown sequence.🛡️ Recommended fix
- for (const auto &[name, module] : _modules) { module->terminate(); } - for (const auto &[name, module] : _modules) { module->await(); } - for (const auto &[name, module] : _modules) { module->finalize(); } + for (const auto &[name, module] : _modules) { + try { module->terminate(); } + catch (const std::exception &e) { printf("[Instance %lu] Module '%s' terminate failed: %s\n", _instanceId, name.c_str(), e.what()); } + } + for (const auto &[name, module] : _modules) { + try { module->await(); } + catch (const std::exception &e) { printf("[Instance %lu] Module '%s' await failed: %s\n", _instanceId, name.c_str(), e.what()); } + } + for (const auto &[name, module] : _modules) { + try { module->finalize(); } + catch (const std::exception &e) { printf("[Instance %lu] Module '%s' finalize failed: %s\n", _instanceId, name.c_str(), e.what()); } + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@platform/include/system/engine.hpp` around lines 84 - 86, Wrap each per-module call in the three shutdown loops so exceptions from one module don't abort the sequence: in the loops iterating over _modules (the blocks calling module->terminate(), module->await(), and module->finalize()) add a try { module->terminate(); } catch (const std::exception& e) { /* log error with module name and e.what() */ } catch (...) { /* log unknown error */ } and do the same pattern around module->await() and module->finalize(); ensure logging references the module name (from the loop key) so failures are recorded but the loop continues to attempt shutdown for all modules.platform/include/modules/subscription.hpp-12-12 (1)
12-12:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRemove
using namespacedirective from header file.Line 12 declares
using namespace serving::system;at file scope in a header, which forces all translation units that include this header to import the entireserving::systemnamespace. This is an anti-pattern that can cause name collisions and ambiguity in client code. Remove the directive and use fully qualified names (e.g.,serving::system::channels::Input) or a namespace alias within the implementation.♻️ Proposed fix
-using namespace serving::system; - namespace serving::modules { -using messageHandler_t = std::function<void(const std::shared_ptr<channels::Input>, const channels::Message &)>; +using messageHandler_t = std::function<void(const std::shared_ptr<system::channels::Input>, const system::channels::Message &)>;And update line 25:
- Subscription(const channels::Message::messageType_t type, const std::shared_ptr<channels::Input> edge, const messageHandler_t handler) + Subscription(const system::channels::Message::messageType_t type, const std::shared_ptr<system::channels::Input> edge, const messageHandler_t handler)And similarly for lines 33-34.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@platform/include/modules/subscription.hpp` at line 12, Remove the file-scope "using namespace serving::system;" from the header and update all symbol references in this header to use fully-qualified names (for example replace channels::Input with serving::system::channels::Input and channels::Output with serving::system::channels::Output) or introduce a local namespace alias (e.g., namespace ss = serving::system;) inside the header scope and use ss::channels::Input/ss::channels::Output where those types are referenced (ensure any declarations like Subscription, channel types, and related template/typedefs are updated accordingly).platform/include/modules/channelController/module.hpp-78-79 (1)
78-79:⚠️ Potential issue | 🟠 Major | ⚡ Quick winRace condition:
hasProducerandhasConsumermust lock before reading.Lines 78–79 read
_actualProducersand_actualConsumerswithout holding the corresponding mutexes. Concurrent calls toreconcile()(line 169–171) orfinalize()(line 113–117) can modify these maps, causing data races and potentially crashes or incorrect results.🔒 Proposed fix
- [[nodiscard]] __INLINE__ bool hasProducer(const edgeName_t edgeName) const { return _actualProducers.contains(edgeName); } - [[nodiscard]] __INLINE__ bool hasConsumer(const edgeName_t edgeName) const { return _actualConsumers.contains(edgeName); } + [[nodiscard]] __INLINE__ bool hasProducer(const edgeName_t edgeName) const + { + std::lock_guard lock(_producerMutex); + return _actualProducers.contains(edgeName); + } + [[nodiscard]] __INLINE__ bool hasConsumer(const edgeName_t edgeName) const + { + std::lock_guard lock(_consumerMutex); + return _actualConsumers.contains(edgeName); + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@platform/include/modules/channelController/module.hpp` around lines 78 - 79, hasProducer and hasConsumer read _actualProducers/_actualConsumers without synchronization; wrap each read in the appropriate mutex lock (the mutex that protects _actualProducers and the mutex that protects _actualConsumers) to prevent races with reconcile() and finalize(); update hasProducer(edgeName_t) and hasConsumer(edgeName_t) to acquire the corresponding lock (e.g., via std::scoped_lock or the existing lock type used elsewhere) for the duration of the contains/read operation and then return the result.platform/examples/include/deployment/helpers.hpp-202-206 (1)
202-206:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAdd error handling for file I/O and JSON parsing failures.
Lines 202-204 open and parse the deployment configuration file without any error handling. If the file does not exist, cannot be opened, or contains invalid JSON, the program will crash with an unhandled exception rather than providing a clear diagnostic message.
🛡️ Proposed fix
const std::string servingConfigFilePath = std::string(argv[1]); std::ifstream servingConfigFs(servingConfigFilePath); + if (!servingConfigFs.is_open()) { + fprintf(stderr, "Error: Cannot open config file '%s'\n", servingConfigFilePath.c_str()); + instanceManager->abort(-1); + } - auto servingConfigJs = nlohmann::json::parse(servingConfigFs); + nlohmann::json servingConfigJs; + try { + servingConfigJs = nlohmann::json::parse(servingConfigFs); + } catch (const nlohmann::json::exception &e) { + fprintf(stderr, "Error: Failed to parse config file '%s': %s\n", servingConfigFilePath.c_str(), e.what()); + instanceManager->abort(-1); + } deployment.deserialize(servingConfigJs);🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@platform/examples/include/deployment/helpers.hpp` around lines 202 - 206, Check that argv[1] is provided and validate opening the config file by testing servingConfigFs.is_open() after constructing servingConfigFs for servingConfigFilePath and log/exit with a clear message if it fails; then wrap the JSON parsing and deployment.deserialize(servingConfigJs) in a try-catch (catching nlohmann::json::parse_error and std::exception) to log the parse error or deserialize error and exit cleanly, ensuring any thrown exceptions are converted into user-friendly diagnostics rather than allowing the program to crash.platform/examples/include/deployment/helpers.hpp-37-57 (1)
37-57:⚠️ Potential issue | 🟠 Major | ⚡ Quick winValidate that instance IDs and channel IDs fit in the 20-bit key layout.
The
defaultChannelKeyBuilderpackssourceInstanceId,targetInstanceId, andchannelIdinto 20-bit fields each (masks(1ull << 20) - 1at lines 42-44). If any of these values exceed 1048575 (2^20 - 1), the upper bits will be silently truncated, causing different IDs to collide and produce identical keys. This will corrupt channel routing.🛡️ Proposed fix to add validation
__INLINE__ serving::system::channels::slotKeys_t defaultChannelKeyBuilder(const HiCR::Instance::instanceId_t sourceInstanceId, const HiCR::Instance::instanceId_t targetInstanceId, const serving::system::channels::channelId_t channelId) { + constexpr key_t MAX_20BIT = (1ull << 20) - 1; + if (sourceInstanceId > MAX_20BIT || targetInstanceId > MAX_20BIT || channelId > MAX_20BIT) { + HICR_THROW_LOGIC("Instance or channel ID exceeds 20-bit limit (max %lu): src=%ld, dst=%ld, ch=%ld", + MAX_20BIT, sourceInstanceId, targetInstanceId, channelId); + } using key_t = HiCR::GlobalMemorySlot::globalKey_t; const key_t src = (key_t(sourceInstanceId) & ((1ull << 20) - 1)) << 44;🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@platform/examples/include/deployment/helpers.hpp` around lines 37 - 57, In defaultChannelKeyBuilder, validate that sourceInstanceId, targetInstanceId and channelId each fit within 20 bits before masking/shifting (the current masks (1ull << 20) - 1 silently truncate higher bits); add explicit checks using the same symbol names (sourceInstanceId, targetInstanceId, channelId) and return or assert/fail fast with a clear error if any value > 0xFFFFF, then proceed to compute src/dst/ch and makeKey as before so no silent collisions occur (refer to HiCR::GlobalMemorySlot::globalKey_t and the makeKey lambda to locate where to add the checks).platform/examples/include/runtime/helpers.hpp-48-48 (1)
48-48:⚠️ Potential issue | 🟠 Major | ⚖️ Poor tradeoffRemove the
computeResourceCountparameter or fix the hard-coded validation at line 77.The function accepts a
computeResourceCountparameter (defaulting to 2) that suggests flexibility, but line 77 enforces exactly 2 compute resources withif (runtime.computeResources.size() != 2). This inconsistency will cause runtime exceptions whenever a caller passes any value other than 2, making the parameter misleading.Either remove the parameter and hard-code collection of 2 resources, or fix the validation to accept the requested count.
Proposed fix option 1: Remove the parameter
-__INLINE__ Runtime makeRuntime(int *argc, char ***argv, const size_t computeResourceCount = 2) +__INLINE__ Runtime makeRuntime(int *argc, char ***argv) { Runtime runtime; hwloc_topology_init(&runtime.hwlocTopologyObject); HiCR::backend::hwloc::TopologyManager hwlocTopologyManager(&runtime.hwlocTopologyObject); const auto topology = hwlocTopologyManager.queryTopology(); auto device = *topology.getDevices().begin(); auto memorySpaces = device->getMemorySpaceList(); runtime.bufferMemorySpace = *memorySpaces.begin(); auto computeResourcesIt = device->getComputeResourceList().begin(); - for (size_t i = 0; i < computeResourceCount; i++) + for (size_t i = 0; i < 2; i++) { runtime.computeResources.push_back(*computeResourcesIt); ++computeResourcesIt; }Proposed fix option 2: Accept variable counts
if (runtime.computeResources.size() != 2) { HICR_THROW_RUNTIME("Too many CR"); } + if (runtime.computeResources.size() != computeResourceCount) { HICR_THROW_RUNTIME("Compute resource count mismatch"); }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@platform/examples/include/runtime/helpers.hpp` at line 48, The makeRuntime function signature currently exposes computeResourceCount but the code validates runtime.computeResources.size() == 2, causing a mismatch; fix by either removing the computeResourceCount parameter from makeRuntime(...) and keep logic that builds exactly two resources, or (preferred) make the validation respect the passed parameter by replacing the hard-coded check with one that compares runtime.computeResources.size() to computeResourceCount (and ensure resource collection logic uses computeResourceCount when populating runtime.computeResources); refer to the makeRuntime function and the runtime.computeResources.size() validation to locate and update the code accordingly.platform/examples/system/engine/engine.cpp-44-52 (1)
44-52:⚠️ Potential issue | 🟠 Major | ⚡ Quick winCompute resource bounds checking needed in
engine.cppandservice.cpp.Both
engine.cppandservice.cppiterate to collect 2 compute resources without verifying the device provides at least that many. The shared root cause is the unchecked iterator advancement pattern. Add bounds validation in both locations to prevent undefined behavior when running on devices with fewer cores.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@platform/examples/system/engine/engine.cpp` around lines 44 - 52, The loop that unconditionally dereferences and increments computeResourcesIt must validate iterator bounds: before pushing *computeResourcesIt into computeResources (and before advancing computeResourcesIt) check that computeResourcesIt != computeResourcesEnd (or that remaining elements >= 2 via std::distance) and break or handle the error if insufficient devices; likewise ensure computeResources is non-empty before using auto computeResource = *computeResources.begin(). Apply the same guarded-iteration fix to the analogous loop in service.cpp (the unchecked computeResourcesIt advancement there) so both places safely handle cases with fewer than two compute resources.platform/examples/include/channels/helpers.hpp-64-78 (1)
64-78:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAdd null checks after
.lock()calls.Both
createDesiredInputandcreateDesiredOutputcall.lock()on a weak pointer without checking fornullptr. If the weak pointer has expired, dereferencing the result will crash.🛡️ Proposed fix to add null checks
__INLINE__ std::shared_ptr<serving::system::channels::Input> createDesiredInput(const std::shared_ptr<serving::modules::channelController::Module> &channelController, const localInput_t &inputInfo, const serving::system::channels::keyBuilderFc_t &keyBuilder) { auto input = channelController->addDesiredConsumer(inputInfo.sourceInstanceId, inputInfo.channelId, *inputInfo.edge, keyBuilder).lock(); + if (!input) HICR_THROW_LOGIC("Failed to lock desired input for edge '%s'.", inputInfo.edge->getName().c_str()); return input; } __INLINE__ std::shared_ptr<serving::system::channels::Output> createDesiredOutput(const std::shared_ptr<serving::modules::channelController::Module> &channelController, const localOutput_t &outputInfo, const serving::system::channels::keyBuilderFc_t &keyBuilder) { auto output = channelController->addDesiredProducer(outputInfo.targetInstanceId, outputInfo.channelId, *outputInfo.edge, keyBuilder).lock(); + if (!output) HICR_THROW_LOGIC("Failed to lock desired output for edge '%s'.", outputInfo.edge->getName().c_str()); return output; }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@platform/examples/include/channels/helpers.hpp` around lines 64 - 78, createDesiredInput and createDesiredOutput call .lock() on weak_ptr results without checking for expiration; after calling channelController->addDesiredConsumer(...).lock() and addDesiredProducer(...).lock() respectively, check whether the returned std::shared_ptr is non-null and handle the expired case (e.g., return nullptr or propagate an error/log) instead of returning/using a null pointer; ensure you reference the weak-to-shared conversions originating from channelController->addDesiredConsumer and channelController->addDesiredProducer in the functions createDesiredInput and createDesiredOutput.platform/examples/include/channels/helpers.hpp-116-119 (1)
116-119: 🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick winAdd timeout to
waitUntilReadyto prevent indefinite hangs.The busy-wait loop has no timeout or escape condition. If the channel never becomes ready (due to a misconfiguration, deadlock, or network partition), this will hang indefinitely and block the caller thread.
⏱️ Proposed fix to add a timeout parameter
-__INLINE__ void waitUntilReady(const std::shared_ptr<serving::system::channels::Base> &channel, const size_t intervalMs = 500) +__INLINE__ void waitUntilReady(const std::shared_ptr<serving::system::channels::Base> &channel, const size_t intervalMs = 500, const size_t timeoutMs = 30000) { - while (!channel->isReady()) { std::this_thread::sleep_for(std::chrono::milliseconds(intervalMs)); } + const auto start = std::chrono::steady_clock::now(); + while (!channel->isReady()) { + std::this_thread::sleep_for(std::chrono::milliseconds(intervalMs)); + const auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now() - start).count(); + if (elapsed > timeoutMs) HICR_THROW_LOGIC("Channel failed to become ready within %lu ms.", timeoutMs); + } }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@platform/examples/include/channels/helpers.hpp` around lines 116 - 119, The waitUntilReady busy-wait needs a timeout: change the function waitUntilReady(const std::shared_ptr<serving::system::channels::Base>&, const size_t intervalMs = 500) into a timed version that accepts a timeout parameter (e.g., size_t timeoutMs, default to a reasonable value like 30000) and make it return bool (true if channel->isReady() became true within the timeout, false otherwise); implement using std::chrono::steady_clock to record start time, loop sleeping intervalMs while checking channel->isReady() and break/return false when elapsed >= timeoutMs, and update all call sites to handle the boolean result (or throw a runtime_error at callers that need exceptions).platform/examples/modules/channelController/telephoneGame.hpp-29-30 (1)
29-30:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAdd bounded wait/timeout for message polling loops.
Lines 29–30 and Lines 36–37 wait forever if a message never arrives, which can deadlock this example and test runs.
Proposed fix
+ auto waitForMessage = [](serving::system::channels::Input& ch, std::chrono::seconds timeout) { + const auto deadline = std::chrono::steady_clock::now() + timeout; + while (!ch.hasMessage()) + { + if (std::chrono::steady_clock::now() >= deadline) return false; + std::this_thread::sleep_for(std::chrono::milliseconds(50)); + } + return true; + }; + if (isRoot) { @@ - while (inputChannel.hasMessage() == false) { std::this_thread::sleep_for(std::chrono::milliseconds(500)); } + if (!waitForMessage(inputChannel, std::chrono::seconds(10))) return; auto output = inputChannel.getMessage(); @@ else { - while (inputChannel.hasMessage() == false) { std::this_thread::sleep_for(std::chrono::milliseconds(500)); } + if (!waitForMessage(inputChannel, std::chrono::seconds(10))) return; auto input = inputChannel.getMessage();Also applies to: 36-37
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@platform/examples/modules/channelController/telephoneGame.hpp` around lines 29 - 30, The polling loops that call inputChannel.hasMessage() then sleep (surrounding inputChannel.getMessage()) must be bounded to avoid deadlock: replace the infinite-wait pattern with a timed-wait using a deadline (std::chrono::steady_clock) and a configurable timeout constant (e.g., kMessageWaitTimeout), and if the timeout elapses take a deterministic action (throw an exception or return an error/optional) instead of looping forever; apply the same change to both occurrences that surround inputChannel.getMessage() so both loops time out predictably and propagate an error instead of deadlocking.platform/examples/modules/channelController/channelController.cpp-36-40 (1)
36-40:⚠️ Potential issue | 🟠 Major | ⚡ Quick winReturn immediately after
abortto prevent unsafe fallthrough.On Line 39, control flow continues in source after
runtime.instanceManager->abort(-1). If that call ever returns (or is stubbed), Line 42+ executes with invalid CLI state.Proposed fix
if (argc != 2) { fprintf(stderr, "Error: Must provide the config file path.\n"); runtime.instanceManager->abort(-1); + return -1; }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@platform/examples/modules/channelController/channelController.cpp` around lines 36 - 40, The code calls runtime.instanceManager->abort(-1) when argc != 2 but does not stop execution afterwards; update the caller (the containing function, e.g., main in channelController.cpp) to immediately return (or call exit) right after runtime.instanceManager->abort(-1) to avoid fallthrough and any use of invalid CLI state; modify the branch that checks argc and the fprintf so that after invoking runtime.instanceManager->abort(-1) execution does not continue (add a return/exit statement directly after the abort call).platform/include/modules/configuration/replica.hpp-17-17 (1)
17-17:⚠️ Potential issue | 🟠 Major | ⚡ Quick winMissing reference in JSON constructor parameters in
replica.hppandpartition.hpp.Both
Replica(inreplica.hpp) andPartition(inpartition.hpp) declare constructors that acceptconst nlohmann::json jsby value rather than by const reference. This shared mistake forces a deep copy of the JSON object on every construction. The correct pattern (demonstrated intask.hppline 15) usesconst nlohmann::json &js.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@platform/include/modules/configuration/replica.hpp` at line 17, The Replica and Partition constructors take nlohmann::json by value causing an unnecessary deep copy; update the signatures for Replica(const nlohmann::json js) and Partition(const nlohmann::json js) to accept a const reference (const nlohmann::json& js) and adjust any corresponding declarations/definitions to match (mirror the pattern used in task.hpp's constructor) so the JSON is passed by const reference instead of copied.platform/meson.build-73-81 (1)
73-81:⚠️ Potential issue | 🟠 MajorWire
buildTestsinto the top-level examples inclusion
platform/examples/*/meson.buildalready usesif get_option('buildTests')to createtest(...)targets.- However,
platform/meson.buildonly includessubdir('examples')whenget_option('buildExamples')is true, so-DbuildTests=truealone won’t reach those files and no tests will be added.Update the
platform/meson.buildcondition to includeexamples/when eitherbuildExamplesorbuildTestsis enabled.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@platform/meson.build` around lines 73 - 81, The current top-level guard only calls subdir('examples') when get_option('buildExamples') is true, so tests gated by get_option('buildTests') inside platform/examples/* are never included when only -DbuildTests=true is passed; update the condition in platform/meson.build (the block guarded by meson.is_subproject() == false) to call subdir('examples') when either get_option('buildExamples') or get_option('buildTests') is true (use a logical OR between get_option('buildExamples') and get_option('buildTests')) so test targets defined under platform/examples are added when tests are requested.platform/extern/HiCR-1-1 (1)
1-1:⚠️ Potential issue | 🟠 MajorHiCR submodule bump: fix unreachable URL + document the
6eb59ea…change
.gitmodulessetsplatform/extern/HiCRtohttps://github.com/huawei-csl/HiCR.git.git ls-remote/git fetchagainst that URL fails withremote: Repository not found, so the updated submodule commit6eb59ea9e63461c847ec06a2dc7819ba41f509fccan’t be retrieved (and its compatibility/security/breaking impact can’t be reviewed) and CI/build may fail.- Update the submodule URL (mirror/internal repo/credentials) so this exact commit is fetchable, then document in the PR what changed in HiCR for
6eb59ea…and why (including any breaking/behavioral and security-relevant impact).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@platform/extern/HiCR` at line 1, .gitmodules points platform/extern/HiCR at an unreachable URL (https://github.com/huawei-csl/HiCR.git) causing git ls-remote/git fetch to fail and preventing retrieval of commit 6eb59ea9e63461c847ec06a2dc7819ba41f509fc; update the submodule URL for platform/extern/HiCR to a reachable mirror/internal repo or add required credentials so git fetch/ls-remote succeeds, verify the commit is fetchable, and then update the PR description to document exactly what changed in HiCR at commit 6eb59ea… including any breaking API/behavioral changes or security-relevant impacts so reviewers and CI can validate the change.platform/include/modules/configuration/edge.hpp-23-27 (1)
23-27:⚠️ Potential issue | 🟠 Major | ⚡ Quick winMissing capacity invariant check allows invalid buffer configurations.
_bufferSizecan be set larger than_bufferCapacity(constructor, deserialize, and setters). Enforce this invariant at write points to block invalid deployment state early.Suggested fix
Edge(const std::string &name, const size_t bufferCapacity, const size_t bufferSize = 0) : _name(name), _bufferCapacity(bufferCapacity), _bufferSize(bufferSize) - {} + { + if (_bufferSize > _bufferCapacity) HICR_THROW_LOGIC("[Edge] buffer size cannot exceed buffer capacity."); + } @@ __INLINE__ void deserialize(const nlohmann::json &js) { @@ if (js.contains("Buffer Capacity")) _bufferCapacity = hicr::json::getNumber<size_t>(js, "Buffer Capacity"); _bufferSize = hicr::json::getNumber<size_t>(js, "Buffer Size"); + if (_bufferSize > _bufferCapacity) HICR_THROW_LOGIC("[Edge] buffer size cannot exceed buffer capacity."); } @@ - __INLINE__ void setBufferCapacity(const size_t bufferCapacity) { _bufferCapacity = bufferCapacity; } - __INLINE__ void setBufferSize(const size_t bufferSize) { _bufferSize = bufferSize; } + __INLINE__ void setBufferCapacity(const size_t bufferCapacity) + { + if (_bufferSize > bufferCapacity) HICR_THROW_LOGIC("[Edge] buffer size cannot exceed buffer capacity."); + _bufferCapacity = bufferCapacity; + } + __INLINE__ void setBufferSize(const size_t bufferSize) + { + if (bufferSize > _bufferCapacity) HICR_THROW_LOGIC("[Edge] buffer size cannot exceed buffer capacity."); + _bufferSize = bufferSize; + }Also applies to: 55-57, 71-72
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@platform/include/modules/configuration/edge.hpp` around lines 23 - 27, The Edge class allows creating or updating an instance where _bufferSize > _bufferCapacity; add validation in the Edge(const std::string&, const size_t, const size_t) constructor and in all mutators/deserialization paths (the deserialize method and any setters like setBufferSize and setBufferCapacity) to enforce _bufferSize <= _bufferCapacity: if the invariant fails, reject the update by throwing a std::invalid_argument (or returning an error) with a clear message; when setBufferCapacity reduces capacity, also clamp or reject if current _bufferSize would exceed the new capacity to maintain the invariant.platform/include/modules/module.hpp-3-5 (1)
3-5:⚠️ Potential issue | 🟠 MajorAdd explicit
<memory>include forstd::unique_ptr/std::make_unique
platform/include/modules/module.hppusesstd::make_uniqueandstd::unique_ptrbut has no#include <memory>, so it relies on transitive includes (brittle for header self-sufficiency).Suggested fix
`#pragma` once `#include` <hicr/core/definitions.hpp> +#include <memory> `#include` <taskr/service.hpp>🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@platform/include/modules/module.hpp` around lines 3 - 5, The header is using std::unique_ptr and std::make_unique but doesn't include <memory>, relying on transitive includes; add an explicit `#include` <memory> to platform/include/modules/module.hpp (alongside the existing `#include` <hicr/core/definitions.hpp> and `#include` <taskr/service.hpp>) so symbols like std::unique_ptr and std::make_unique resolve without depending on other headers.
🟡 Minor comments (1)
platform/examples/modules/channelController/telephoneGame.hpp-24-24 (1)
24-24:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winUse a portable format for
instanceIdlogging.
printfuses%luon Lines 24/32/39/42, butHiCR::Instance::instanceId_tis an alias and may not always matchunsigned long.Proposed fix
- printf("[Instance %lu][TelephoneGame] Sending message: %s\n", instanceId, text.c_str()); + printf("[Instance %llu][TelephoneGame] Sending message: %s\n", + static_cast<unsigned long long>(instanceId), text.c_str()); @@ - printf("[Instance %lu][TelephoneGame] Received message: %s\n", instanceId, std::string(reinterpret_cast<const char *>(output.getData()), output.getSize()).c_str()); + printf("[Instance %llu][TelephoneGame] Received message: %s\n", + static_cast<unsigned long long>(instanceId), + std::string(reinterpret_cast<const char *>(output.getData()), output.getSize()).c_str());Also applies to: 32-32, 39-39, 42-42
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@platform/examples/modules/channelController/telephoneGame.hpp` at line 24, The printf calls logging instanceId use "%lu" which is not portable for HiCR::Instance::instanceId_t; update each printf (the one at Sending message and the other occurrences) to use "%llu" and cast instanceId to (unsigned long long) (e.g., printf("[Instance %llu][TelephoneGame] ...", (unsigned long long)instanceId, ...)) so the format matches a fixed-width wide integer regardless of the underlying alias; apply the same change for the occurrences referenced (lines with instanceId in TelephoneGame's prints).
🧹 Nitpick comments (8)
platform/include/system/channels/message.hpp (1)
57-61: ⚡ Quick winDocument pointer ownership and lifetime requirements.
The
Messageclass stores a non-owningconst uint8_t*pointer to payload data. There is no documentation indicating that the caller must ensure the data outlives theMessageobject. Consider adding a comment in the constructor or class documentation clarifying thatMessagedoes not take ownership ofdataand that the caller must manage the data's lifetime. Alternatively, explicitly delete copy/move constructors if copying is not intended, or document the shallow-copy semantics.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@platform/include/system/channels/message.hpp` around lines 57 - 61, Document that Message stores a non-owning pointer and requires the caller to manage the payload lifetime: add a clear comment on the Message class or the Message(const uint8_t *const data, const size_t size, const metadata_t metadata) constructor stating that _data is non-owning, Message does not free or extend the buffer, and the caller must ensure the buffer outlives the Message instance; alternatively, if shallow-copying is undesirable, explicitly delete or define copy/move constructors/operators for Message (referencing class Message, members _data/_size/_metadata) or switch to an owning type (e.g., std::vector/std::shared_ptr) if ownership should be transferred.platform/include/system/channels/base.hpp (1)
123-124: ⚡ Quick winPrefer RAII lock guards over raw lock/unlock.
Exposing raw
lock()andunlock()methods is error-prone; if an exception is thrown between lock and unlock, the mutex remains locked, causing a deadlock. Consider removing these methods and requiring callers to usestd::lock_guard<std::mutex>orstd::unique_lock<std::mutex>with a public accessor to the mutex, or provide a scoped lock helper method.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@platform/include/system/channels/base.hpp` around lines 123 - 124, The public lock()/unlock() methods on the class are unsafe—replace them with an RAII-friendly API: remove __INLINE__ void lock() and unlock(), expose the underlying mutex via a safe accessor (e.g., std::mutex& mutex() or const std::mutex& native_mutex()) and/or add a scoped helper (e.g., a template with_lock(Func&&) that acquires a std::lock_guard<std::mutex> and invokes the callback). Update callers to use std::lock_guard<std::mutex> or the new with_lock helper instead of calling lock()/unlock() directly (refer to the lock, unlock symbols and the mutex member in this header to locate changes).platform/include/modules/channelController/module.hpp (1)
153-160: 💤 Low valueClarify the error message for unconfigured communication managers.
Line 158 throws a logic error if
entry.communicationManageris not inexchangeMap. BecauseexchangeMapis pre-populated from_communicationManagersInOrder(line 154), this error means a channel was configured with a manager not in the ordered list. The current message "is not present in configured manager order" is correct but could be more explicit.📝 Suggested improvement
- if (exchangeMap.contains(entry.communicationManager) == false) HICR_THROW_LOGIC("Memory slot exchange manager is not present in configured manager order."); + if (exchangeMap.contains(entry.communicationManager) == false) + HICR_THROW_LOGIC("Channel references communication manager %p which is not in the configured manager order provided to the channel controller.", entry.communicationManager);🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@platform/include/modules/channelController/module.hpp` around lines 153 - 160, The HICR_THROW_LOGIC call when checking exchangeMap.contains(entry.communicationManager) should use a clearer error message: update the message emitted from the check in the loop that processes memorySlotsToExchange to state that the memory slot references a communication manager that was not included in _communicationManagersInOrder (include the manager pointer or identifier from entry.communicationManager and the affected global key or memorySlot for context). Modify the HICR_THROW_LOGIC invocation so it mentions exchangeMap, entry.communicationManager and the related entry (e.g., entry.globalKey) to make the failure explicit.platform/include/modules/broadcastDeployment/module.hpp (1)
120-120: ⚡ Quick winReplace C-style cast with
static_castorreinterpret_cast.Line 120 uses
(char *)which is a C-style cast. Modern C++ prefers explicit casts for clarity and type safety.♻️ Proposed fix
- std::string deploymentString = (char *)returnValue->getPointer(); + std::string deploymentString = static_cast<char *>(returnValue->getPointer());🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@platform/include/modules/broadcastDeployment/module.hpp` at line 120, Replace the C-style cast "(char *)" when building deploymentString from returnValue->getPointer(); use an explicit C++ cast: if getPointer() returns void* use reinterpret_cast<char*>(returnValue->getPointer()), otherwise if it returns a compatible pointer type use static_cast<char*>(returnValue->getPointer()); update the expression that constructs std::string deploymentString accordingly (referencing deploymentString and returnValue->getPointer()) to eliminate the C-style cast.platform/examples/modules/broadcastDeployment/broadcastDeployment.cpp (3)
79-79: 💤 Low valueConsider using a fixed maximum sleep duration to prevent excessive delays.
The sleep duration is
instanceIdseconds, which works for small, contiguous instance IDs (0, 1, 2), but if instance IDs are large or non-contiguous (e.g., 100, 200, 300), this will cause very long delays. For example code, consider capping the sleep or using modulo to bound the delay.♻️ Proposed fix to cap sleep duration
- std::this_thread::sleep_for(std::chrono::seconds(instanceId)); // Sleep a bit to ensure the output is not mixed + std::this_thread::sleep_for(std::chrono::seconds(instanceId % 10)); // Cap at 10 seconds to prevent excessive delays printf("[Instance %lu] Received deployment configuration:\n%s\n", instanceId, receivedDeployment.serialize().dump(2).c_str());🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@platform/examples/modules/broadcastDeployment/broadcastDeployment.cpp` at line 79, The current sleep uses std::this_thread::sleep_for(std::chrono::seconds(instanceId)) which can block excessively for large or non-contiguous instanceId values; change it to bound the delay (e.g., compute a delay = std::min(instanceId, MAX_DELAY_SECONDS) or use instanceId % MAX_DELAY_SECONDS) and sleep for that bounded number of seconds so outputs remain staggered but never delayed beyond a fixed maximum; update the call site where std::this_thread::sleep_for is invoked with instanceId to use the bounded delay.
51-60: 💤 Low valueUse the already-determined
isRootflag instead of re-checking instance ID equality.Lines 30-32 determine
isRoot,instanceId, anddeployerInstanceId, but line 51 re-checksinstanceId == deployerInstanceIdinstead of reusing theisRootflag. This creates an unnecessary second comparison and potential inconsistency ifisRootInstance()logic differs from ID equality.♻️ Proposed simplification
std::shared_ptr<serving::modules::broadcastDeployment::Module> broadcastDeploymentModule; - if (instanceId == deployerInstanceId) + if (isRoot) { broadcastDeploymentModule = std::make_shared<serving::modules::broadcastDeployment::Module>( runtime.instanceManager, runtime.taskComputeManager, runtime.rpcEngine, deployerInstanceId, instanceId, deployment);🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@platform/examples/modules/broadcastDeployment/broadcastDeployment.cpp` around lines 51 - 60, Replace the explicit comparison instanceId == deployerInstanceId with the already-computed isRoot flag when choosing which Module constructor to call: use isRoot to decide whether to construct serving::modules::broadcastDeployment::Module with the deployment argument or without it, updating the conditional around the broadcastDeploymentModule creation so it references isRoot (and still passes runtime.instanceManager, runtime.taskComputeManager, runtime.rpcEngine, deployerInstanceId, instanceId) instead of re-checking IDs.
62-62: ClarifygetDeployment()ordering for non-deployer instances (broadcastDeployment.cpp):Module’s non-deployer constructor default-constructs_deployment, and_deploymentis only populated ininitialize()viaretrieveDeployment(); callingserialize()on an uninitialized deployment would be unsafe becauseDeployment::serialize()dereferences_requestManager. In this example,receivedDeploymentis only serialized afterserving.initialize()/serving.await(), andretrieveDeployment()runs insideinitialize(), so this ordering is currently safe—but moving thereceivedDeploymentline belowserving.initialize()would prevent future regressions and better documents the dependency.Optional refactor
} - const auto &receivedDeployment = broadcastDeploymentModule->getDeployment(); // Adding broadcast deployment module to serving serving.addModule("BroadcastDeployment", broadcastDeploymentModule); // Initializing serving serving.initialize(); + + const auto &receivedDeployment = broadcastDeploymentModule->getDeployment(); // Running serving serving.run();🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@platform/examples/modules/broadcastDeployment/broadcastDeployment.cpp` at line 62, The code calls broadcastDeploymentModule->getDeployment() (stored as receivedDeployment) before the module's initialize() has run, which can leave Deployment uninitialized and make Deployment::serialize() unsafe because it dereferences _requestManager; move the line that obtains receivedDeployment (and any subsequent serialize/use) to after serving.initialize() and serving.await() so retrieveDeployment() runs (via Module::initialize()) and _deployment is populated before use, ensuring safe calls to Deployment::serialize().platform/include/modules/configuration/deployment.hpp (1)
86-88: 💤 Low valueRedundant getter method.
getControlBufferConst()(line 88) returns the same const reference asgetControlBuffer() const(line 86). Consider removing the redundant method to simplify the API surface.♻️ Simplify by removing redundant method
[[nodiscard]] __INLINE__ const auto &getControlBuffer() const { return _settings.controlBuffer; } [[nodiscard]] __INLINE__ auto &getControlBuffer() { return _settings.controlBuffer; } - [[nodiscard]] __INLINE__ auto &getControlBufferConst() const { return _settings.controlBuffer; }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@platform/include/modules/configuration/deployment.hpp` around lines 86 - 88, Remove the redundant getter getControlBufferConst() which returns the same const reference as getControlBuffer() const; delete the getControlBufferConst() declaration/definition and replace any call-sites using getControlBufferConst() with getControlBuffer() (the const overload), ensuring references to _settings.controlBuffer continue to use the const-qualified getter when needed.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 2e3d807d-551e-47a6-b194-c3b5186df012
📒 Files selected for processing (48)
.gitignore.gitmodulesplatform/.clang-formatplatform/docs/README.mdplatform/docs/modules/README.mdplatform/docs/modules/broadcast-deployment.mdplatform/docs/modules/channel-controller.mdplatform/docs/modules/configuration.mdplatform/docs/modules/service.mdplatform/docs/system/README.mdplatform/docs/system/channels.mdplatform/docs/system/engine.mdplatform/examples/include/channels/helpers.hppplatform/examples/include/deployment/helpers.hppplatform/examples/include/runtime/helpers.hppplatform/examples/meson.buildplatform/examples/modules/broadcastDeployment/broadcastDeployment.cppplatform/examples/modules/broadcastDeployment/meson.buildplatform/examples/modules/broadcastDeployment/policy.jsonplatform/examples/modules/channelController/channelController.cppplatform/examples/modules/channelController/meson.buildplatform/examples/modules/channelController/policy.jsonplatform/examples/modules/channelController/telephoneGame.hppplatform/examples/modules/service/meson.buildplatform/examples/modules/service/service.cppplatform/examples/system/engine/engine.cppplatform/examples/system/engine/meson.buildplatform/extern/HiCRplatform/extern/TaskRplatform/include/modules/broadcastDeployment/module.hppplatform/include/modules/channelController/module.hppplatform/include/modules/configuration/deployment.hppplatform/include/modules/configuration/edge.hppplatform/include/modules/configuration/partition.hppplatform/include/modules/configuration/replica.hppplatform/include/modules/configuration/requestManager.hppplatform/include/modules/configuration/task.hppplatform/include/modules/module.hppplatform/include/modules/service/module.hppplatform/include/modules/subscription.hppplatform/include/system/channels/base.hppplatform/include/system/channels/input.hppplatform/include/system/channels/message.hppplatform/include/system/channels/messageTypeRegistry.hppplatform/include/system/channels/output.hppplatform/include/system/engine.hppplatform/meson.buildplatform/meson_options.txt
- meson.build: flatten compile_args array (['-Wfatal-errors'] + warningAsErrorFlags) to avoid nested array in declare_dependency - meson.build: remove dangling endif and orphaned comment left over from examples removal - .gitmodules: remove branch = master from HiCR entry to prevent accidental drift on git submodule update --remote - engine.hpp: document RPC lambda lifetime contract (HiCR has no removeRPCTarget API; callers must terminate/await before destruction) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Only thing missing is wiring the code into the CI, going to work on it right now |
|
@coderabbitai review @gemini-code-assist review Triggering fresh reviews — the branch was substantially rewritten since the original reviews. The PR is now scoped to the platform skeleton only (Engine, Module base, channel primitives, system engine example, build scaffolding). Several files that attracted comments (configuration types, module examples, telephoneGame) are no longer in this PR. |
|
✏️ Learnings added
✅ Action performedFull review finished. |
There was a problem hiding this comment.
Code Review
This pull request introduces the C++ platform-management layer for PyPTO Serving, implementing a distributed engine, module lifecycle management, and SPSC-oriented communication channels that separate data payloads from metadata. The review feedback identifies several critical issues: a high risk of hash collisions in the 10-bit MessageTypeRegistry, fragility in the asymmetric engine lifecycle, an unmitigated busy-wait loop in pushMessageLocking when channels are full, a race condition where background service threads can invoke virtual methods on destroyed modules, potential ID corruption from unmasked bit-width overflows in Message::getId(), and missing null-pointer checks on memory allocations in the channel base class.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
(full review mode) 🐇 ✅ Action performedFull review finished. |
There was a problem hiding this comment.
Actionable comments posted: 10
♻️ Duplicate comments (1)
.gitmodules (1)
4-9: 🔒 Security & Privacy | 🔴 Critical | ⚡ Quick winFix broken submodule URLs — repositories have moved.
The
.gitmodulesURLs still point to non-existenthuawei-cslorganization repositories. Verification confirmed these repos live underAlgebraic-Programming:
https://github.com/Algebraic-Programming/HiCR.githttps://github.com/Algebraic-Programming/TaskR.gitUpdate both URLs to prevent clone failures.
🔧 Proposed fix
[submodule "platform/extern/HiCR"] path = platform/extern/HiCR - url = https://github.com/huawei-csl/HiCR.git + url = https://github.com/Algebraic-Programming/HiCR.git [submodule "platform/extern/TaskR"] path = platform/extern/TaskR - url = https://github.com/huawei-csl/TaskR.git + url = https://github.com/Algebraic-Programming/TaskR.git🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.gitmodules around lines 4 - 9, Update the submodule URLs in .gitmodules for the HiCR and TaskR entries because the current huawei-csl locations are broken. Change the URL values for the platform/extern/HiCR and platform/extern/TaskR submodules to the corresponding Algebraic-Programming GitHub repositories so cloning and submodule initialization work again.
🧹 Nitpick comments (3)
platform/docs/README.md (1)
15-24: 📐 Maintainability & Code Quality | 🔵 TrivialSource layout includes deferred directories.
The
include/modules/configuration/directory listed at line 18 is explicitly deferred per PR objectives. Consider removing it from the source layout or marking it as planned but not yet present to avoid confusion.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@platform/docs/README.md` around lines 15 - 24, The source layout in README still presents include/modules/configuration/ as part of the current tree even though it is deferred. Update the documentation in the platform README to either remove that directory from the listed layout or clearly mark it as planned/not yet present, keeping the wording consistent with the other layout entries in the README section.platform/meson_options.txt (1)
1-6: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueRemove trailing whitespace on line 3.
choices : [ 'mpi' ],Trailing space after the comma.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@platform/meson_options.txt` around lines 1 - 6, Remove the trailing whitespace in the meson option definition for option('engines'); the issue is the extra space after the comma in the choices array. Update the option declaration so the choices line is clean and consistent with the surrounding Meson formatting, keeping the option('engines') block otherwise unchanged.platform/include/modules/module.hpp (1)
3-5: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winMake this public header self-contained.
Moduleusessize_t,std::unique_ptr, andstd::make_unique, but those headers are only available transitively here. That makes the header brittle to include-order changes in HiCR/TaskR.Suggested change
`#pragma` once +#include <cstddef> +#include <memory> + `#include` <hicr/core/definitions.hpp> `#include` <taskr/service.hpp>🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@platform/include/modules/module.hpp` around lines 3 - 5, The public header for Module is not self-contained because it relies on transitive includes for size_t, std::unique_ptr, and std::make_unique. Update the Module header itself to include the standard headers that provide those symbols so its declarations and inline usage compile regardless of include order, and keep the existing HiCR/TaskR includes for the module types it references.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/ci.yml:
- Around line 84-87: The recursive checkout in the CI workflow leaves Git
credentials available to PR-controlled code, so update the actions/checkout@v4
step to disable credential persistence. In the checkout configuration, add
persist-credentials: false alongside fetch-depth and submodules, and only keep
credentials enabled if a later step in this job genuinely requires authenticated
git access.
In `@platform/docs/README.md`:
- Around line 28-32: The platform README is overstating what this PR implements
and links to deferred module docs. Update the overview text and the module list
in README so they only describe the pieces actually included in this change, and
remove or rewrite references to the deferred configuration, broadcast
deployment, channel controller, service, TaskR, and examples scope. Keep the
wording aligned with the current PR scope so readers are not told that those
modules or deployment graph pieces already exist.
In `@platform/docs/system/channels.md`:
- Around line 16-22: Clarify the concurrency model in channels.md by making the
Base description, pushMessageLocking(message), and the “Channels are
SPSC-oriented” claim consistent. If the implementation is truly SPSC, revise the
mutex wording to explain its limited role outside the data path; if locking is
part of normal send/receive behavior, update the documentation to describe the
channels as MPMC or otherwise non-SPSC. Use the Base and
pushMessageLocking(message) symbols to align the wording across the affected
sections.
In `@platform/examples/system/engine/engine.cpp`:
- Around line 32-52: The device-selection block in engine.cpp blindly
dereferences topology-derived iterators and even advances through two compute
resources even though only one is used. In the code around
topology.getDevices(), getMemorySpaceList(), and getComputeResourceList(), add
guards for empty device, memory-space, and compute-resource lists before
dereferencing, and stop assuming at least two cores; select the first available
compute resource safely (or bail out/log when unavailable) so the example does
not crash on filtered or single-core topologies.
In `@platform/include/modules/subscription.hpp`:
- Around line 12-17: Remove the namespace-wide import from this public header:
`using namespace serving::system;` in `subscription.hpp` is leaking all of
`serving::system` into every includer. Update the `messageHandler_t` declaration
and any other uses in `serving::modules` to reference
`serving::system::channels::Input` and `serving::system::channels::Message`
explicitly, or replace the broad import with a narrow alias scoped inside
`serving::modules`.
In `@platform/include/system/channels/base.hpp`:
- Around line 104-130: The Base::initialize flow is the only place that
populates the shared memory slots before createChannels(), so the public channel
operations in Input::{hasMessage,getMessage,popMessage} and
Output::{isFull,pushMessage} must not run before readiness is established. Add a
protected readiness guard in Base using isReady()/_isReady, and have those
methods check it before touching any channel handles or dereferencing pointers;
if not ready, fail fast in a consistent way. Use the existing Base, Input, and
Output symbols to wire the check into the shared channel access path.
In `@platform/include/system/channels/message.hpp`:
- Around line 48-50: The getId() packing logic in message::getId() directly
shifts raw type, groupId, and sequenceId values without validating bounds, so
out-of-range metadata can bleed into adjacent fields. Update getId() to either
validate each field before packing or mask them to their expected bit widths,
and use the existing messageId_t packing expression in message.hpp as the place
to enforce this behavior.
In `@platform/include/system/channels/output.hpp`:
- Around line 66-76: Make pushMessageLocking() exception-safe by replacing the
manual lock/unlock flow with RAII so the mutex is always released even if
pushMessage(message) throws. Update the pushMessageLocking method in output.hpp
to use std::unique_lock (or an equivalent scoped lock) around the full/isFull
loop and the pushMessage call, and remove the raw lock(), unlock() handling to
avoid leaving the channel permanently locked.
In `@platform/include/system/engine.hpp`:
- Around line 46-50: The addModule registration path currently allows a null
shared_ptr to be stored, which later gets dereferenced in initialize(), start(),
and await(). Update addModule(const std::string &name,
std::shared_ptr<modules::Module> module) to validate module before inserting it
into _modules and throw a logic error when it is null, alongside the existing
duplicate-name check.
In `@platform/meson_options.txt`:
- Around line 8-10: The buildExamples meson option is currently defined but not
used by the build logic, so either remove it from meson_options.txt or wire it
into the example build path. Update platform/meson.build so the examples subdir
is gated by get_option('buildExamples') instead of buildTests, using the
buildExamples option name consistently.
---
Duplicate comments:
In @.gitmodules:
- Around line 4-9: Update the submodule URLs in .gitmodules for the HiCR and
TaskR entries because the current huawei-csl locations are broken. Change the
URL values for the platform/extern/HiCR and platform/extern/TaskR submodules to
the corresponding Algebraic-Programming GitHub repositories so cloning and
submodule initialization work again.
---
Nitpick comments:
In `@platform/docs/README.md`:
- Around line 15-24: The source layout in README still presents
include/modules/configuration/ as part of the current tree even though it is
deferred. Update the documentation in the platform README to either remove that
directory from the listed layout or clearly mark it as planned/not yet present,
keeping the wording consistent with the other layout entries in the README
section.
In `@platform/include/modules/module.hpp`:
- Around line 3-5: The public header for Module is not self-contained because it
relies on transitive includes for size_t, std::unique_ptr, and std::make_unique.
Update the Module header itself to include the standard headers that provide
those symbols so its declarations and inline usage compile regardless of include
order, and keep the existing HiCR/TaskR includes for the module types it
references.
In `@platform/meson_options.txt`:
- Around line 1-6: Remove the trailing whitespace in the meson option definition
for option('engines'); the issue is the extra space after the comma in the
choices array. Update the option declaration so the choices line is clean and
consistent with the surrounding Meson formatting, keeping the option('engines')
block otherwise unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 6edbbc1b-f82d-47a9-b466-466e3b5ef00e
📒 Files selected for processing (23)
.github/workflows/ci.yml.gitignore.gitmodulesplatform/.clang-formatplatform/docs/README.mdplatform/docs/system/README.mdplatform/docs/system/channels.mdplatform/docs/system/engine.mdplatform/examples/meson.buildplatform/examples/system/engine/engine.cppplatform/examples/system/engine/meson.buildplatform/extern/HiCRplatform/extern/TaskRplatform/include/modules/module.hppplatform/include/modules/subscription.hppplatform/include/system/channels/base.hppplatform/include/system/channels/input.hppplatform/include/system/channels/message.hppplatform/include/system/channels/messageTypeRegistry.hppplatform/include/system/channels/output.hppplatform/include/system/engine.hppplatform/meson.buildplatform/meson_options.txt
|
@coderabbitai review |
|
@gemini-code-assist review |
✅ Action performedReview finished.
|
There was a problem hiding this comment.
Code Review
This pull request introduces the C++ platform-management layer for PyPTO Serving, which includes module lifecycle management, SPSC channel primitives, and cross-instance RPC control. The review feedback highlights several critical issues: potential race conditions from premature memory slot deregistration in Output::pushMessage, high CPU utilization caused by std::this_thread::yield() in the engine's polling loop, and blocking calls in worker instances that could prevent proper termination. Additionally, there are concerns regarding thread safety on the SPSC channel due to un-locked public push methods, potential silent ID corruption in release builds when assertions are compiled out, and missing null pointer checks after memory allocations in the channel base class.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
Caution Failed to replace (edit) comment. This is likely due to insufficient permissions or the comment being deleted. Error details |
- meson.build: flatten compile_args array (['-Wfatal-errors'] + warningAsErrorFlags) to avoid nested array in declare_dependency - meson.build: remove dangling endif and orphaned comment left over from examples removal - .gitmodules: remove branch = master from HiCR entry to prevent accidental drift on git submodule update --remote - engine.hpp: document RPC lambda lifetime contract (HiCR has no removeRPCTarget API; callers must terminate/await before destruction) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- meson.build: flatten compile_args array (['-Wfatal-errors'] + warningAsErrorFlags) to avoid nested array in declare_dependency - meson.build: remove dangling endif and orphaned comment left over from examples removal - .gitmodules: remove branch = master from HiCR entry to prevent accidental drift on git submodule update --remote - engine.hpp: document RPC lambda lifetime contract (HiCR has no removeRPCTarget API; callers must terminate/await before destruction) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ives Introduces the foundational control-plane layer for the platform module (ref #32, #13). Scoped to static infrastructure only per the agreed first milestone: module lifecycle interface, the system Engine, host-side channel primitives (Input/Output/Message/MessageTypeRegistry), and HiCR submodule. Configuration types, concrete modules (broadcastDeployment, channelController, service), and examples are deferred to a follow-up PR pending resolution of DeploymentSpec ownership (see #32 discussion). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- meson.build: flatten compile_args array (['-Wfatal-errors'] + warningAsErrorFlags) to avoid nested array in declare_dependency - meson.build: remove dangling endif and orphaned comment left over from examples removal - .gitmodules: remove branch = master from HiCR entry to prevent accidental drift on git submodule update --remote - engine.hpp: document RPC lambda lifetime contract (HiCR has no removeRPCTarget API; callers must terminate/await before destruction) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds the engine example (examples/system/engine/) which exercises the
full Engine + module lifecycle over MPI — the minimal runnable
demonstration of the platform skeleton.
Wires the example into the build under buildTests=true and registers it
as a meson test suite ('examples'). The CI platform-build job:
- shallow-clones with submodules (fetch-depth: 1, submodules: recursive)
- installs meson, ninja, libhwloc-dev, libboost-context-dev, libopenmpi-dev
- configures with -Dengines=mpi -DbuildTests=true
- builds with ninja
- runs meson test --suite examples
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
module.hpp uses taskr::Service as the base Module's optional periodic-service mechanism, so TaskR is a core dependency of the platform skeleton, not just the concrete service module. Restores the TaskR subproject in meson.build and the submodule entry in .gitmodules (without a branch= line, consistent with HiCR). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
.gitmodules had the TaskR entry but the submodule gitlink was missing from the git index (lost during the earlier branch reset), so CI could not clone it with submodules: recursive. Re-registers the pinned commit. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
huawei-csl/TaskR does not exist publicly; the correct repo is https://github.com/Algebraic-Programming/TaskR.git Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- output: add std::this_thread::yield() in pushMessageLocking busy-wait to avoid spinning the CPU while the channel is full - message: expand TYPE_BITS 10→16, GROUP_BITS 20→14 (total stays 64-bit) to reduce FNV-1a collision probability from ~8% to ~0.003% for 20 types; add assert(isValid()) in getId() to catch field-width violations early - messageTypeRegistry: keep FNV-1a for cross-node determinism (sequential IDs would break nodes with different module sets); remove stale comment Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
yield()/sched_yield()/pthread_yield() are identical on Linux and all yield the OS thread, not the TaskR coroutine. When called from within a TaskR task (boost::context coroutine), this blocks the whole worker thread from picking up other tasks. A 1µs sleep works correctly in all execution contexts and actually gives the SPSC consumer time to drain the channel rather than immediately retrying. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
No step in platform-build uses authenticated git access after checkout, and all submodules are public. persist-credentials: false prevents the GitHub token from being written into the local git config where PR-controlled code could read it. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
docs/README.md: remove links to untracked module docs (configuration, service, channel-controller, broadcast-deployment); rewrite the 'implemented pieces' block to only describe what is actually in this PR (engine lifecycle, module base, channel primitives) docs/system/channels.md: clarify mutex role — it is used exclusively by pushMessageLocking to serialize competing callers, not the SPSC data path itself; update bit widths (type: 16-bit, group: 14-bit) to match current message.hpp; expand FNV-1a description to explain cross-node determinism examples/system/engine/engine.cpp: guard all three topology-derived iterator dereferences (devices, memory spaces, compute resources) with empty-list checks; remove the dead 2-core loop since only the first compute resource was ever used Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
channels/base.hpp: add protected checkReady() that throws before any channel op if initialize() was not called; move _lock to protected so subclasses can use std::unique_lock directly channels/input.hpp: call checkReady() at the top of hasMessage() — getMessage() and popMessage() both call hasMessage() first so they are covered transitively channels/output.hpp: call checkReady() at the top of isFull(); rewrite pushMessageLocking() using std::unique_lock<std::mutex> so the lock is always released even if pushMessage() throws modules/subscription.hpp: replace file-scope `using namespace serving::system` (which leaked into every including TU) with a namespace alias scoped inside serving::modules docs/README.md: remove include/modules/configuration/ from the source layout (deferred, not present in this PR) meson_options.txt: strip trailing whitespace Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… encapsulation engine.hpp: replace std::this_thread::yield() with sleep_for(1ms) in the await() RPC polling loop; yield() re-schedules immediately and pegs the core at 100% when idle output.hpp: move pushMessage to private; it was only ever called from pushMessageLocking (which holds _lock). A concurrent caller mixing pushMessage and pushMessageLocking would race on the SPSC channel — making it private prevents that class of misuse entirely Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
assert(isValid()) is compiled out under NDEBUG, leaving overflow unchecked in release. Explicitly mask each field to its MAX before shifting so out-of-range bits can never bleed into adjacent fields. assert is kept for debug builds to catch invalid values early. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Scoped-down version of the initial platform PR, per review feedback on #32 (first milestone scope too broad).
What this includes
platform/include/system/engine.hpp— the control-plane kernel: manages module lifecycle (initialize / run / await / terminate / finalize), start/stop RPCs, deployer/worker split. No model semantics.platform/include/modules/module.hpp+subscription.hpp— pure lifecycle interface that all future platform modules implement.platform/include/system/channels/— host-side channel primitives: Input, Output, Message, MessageTypeRegistry, Base. Pure communication transport.platform/meson.build+meson_options.txt— build scaffolding with HiCR subproject.platform/extern/HiCR— HiCR submodule (used by engine and channels).platform/docs/— top-level README, system/engine, system/channels docs.What is explicitly deferred
include/modules/configuration/extern/TaskRexamples/Relation to #32 design discussion
This PR takes a position only on the undisputed cornerstone: Platform is a control plane that manages module lifecycle and channel primitives. It does not take a position on DeploymentSpec ownership or request-dispatch vs resource-routing — those remain open per the #32 discussion.
Tracked by #32 and #13.