PPL Alerting: PPL Monitor Execution by toepkerd · Pull Request #2130 · opensearch-project/alerting

toepkerd · 2026-05-08T23:17:03Z

Description

This PR primarily contains logic for running PPL Monitors. It also adds RBAC checks for manual Monitor executions, a gap revealed by a security review.

To support RBAC filtering for manual monitor execution, a new field manual is added to ExecuteMonitorRequest/Response. Many of the file changes are simply to accommodate this new field and can be ignored in review.

The only files that need deep review are:

TriggerService.kt
InputService.kt
AlertService.kt
QueryLevelMonitorRunner.kt
TransportExecuteMonitorAction.kt
QueryLevelTriggerExecutionContext.kt

New `manual` Field

The manual field in ExecuteMonitorRequest distinguishes between user-initiated executions (via the REST API) and system-initiated executions (via the job scheduler or MonitorJobPoller). When a user calls the Execute Monitor API, the REST handler sets manual = true. When the scheduler or poller triggers execution internally, it sets manual = false. In TransportExecuteMonitorAction, this field gates the RBAC check: when manual is true and the monitor was fetched by ID, checkUserPermissionsWithResource is called to verify the calling user's backend roles overlap with the monitor owner's backend roles (enforcing filter_by_backend_roles). When manual is false, this check is skipped entirely — scheduled executions run as the system and must not be blocked by user-level RBAC. This ensures that monitors continue to fire on schedule regardless of which user created them, while still preventing unauthorized users from manually triggering another user's monitor via the API.

Check List

New functionality includes testing.
New functionality has been documented.
API changes companion pull request created.
Commits are signed per the DCO using --signoff.
Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

…ring V2 APIs Signed-off-by: Dennis Toepker <toepkerd@amazon.com>

…Alerts Signed-off-by: Dennis Toepker <toepkerd@amazon.com>

…roject#2099) * Block unsupported actions when multi-tenancy is enabled Signed-off-by: Chase Engelbrecht <engechas@dev-dsk-engechas-2a-28f138b0.us-west-2.amazon.com> * Fix ktlint Signed-off-by: Chase Engelbrecht <engechas@dev-dsk-engechas-2a-28f138b0.us-west-2.amazon.com> * Fix CIs Signed-off-by: Chase Engelbrecht <engechas@dev-dsk-engechas-2a-28f138b0.us-west-2.amazon.com> * fix(test): Fix MultiTenancyBlock integration tests Rename MultiTenancyBlockIT to MultiTenancyBlockSingleNodeTests so it runs under the 'test' Gradle task (which matches *Tests.class) instead of 'integTest' (which matches *IT.class). The class extends OpenSearchSingleNodeTestCase which requires an embedded node, but the integTest task starts a full REST cluster causing node lifecycle conflicts. Also fix IndexWorkflowRequest in the test to include a non-empty delegates list so request validation passes before the multi-tenancy check in doExecute runs. Remove the query-level monitor success test since it requires SDK tenant context propagation that is not available in the embedded single-node test environment. Signed-off-by: Chase Engelbrecht <engechas@dev-dsk-engechas-2a-28f138b0.us-west-2.amazon.com> * fix(test): Fix TransportMultiTenancyBlockTests unit tests Replace Mockito.mock() calls for final Kotlin classes (ScheduledJobIndices, LockService, DocLevelMonitorQueries, MonitorRunnerService) with real instances constructed from already-mocked dependencies, since Mockito cannot mock final classes without the inline mock maker which is incompatible with the OpenSearch test framework. Fix invokeDoExecute reflection helper to find the correct typed doExecute method, preferring the typed override over the base class ActionRequest-based method when both exist. Fix assertMethodNotAllowed to check the AlertingException status directly instead of trying to unwrap to OpenSearchStatusException, since AlertingException.wrap() wraps the cause in a plain Exception. Register all required AlertingSettings and DestinationSettings in the test ClusterSettings to prevent SettingsException during transport action construction. Replace mocked IndexWorkflowRequest with a real instance using randomWorkflow helper since IndexWorkflowRequest is also a final class. Signed-off-by: Chase Engelbrecht <engechas@dev-dsk-engechas-2a-28f138b0.us-west-2.amazon.com> --------- Signed-off-by: Chase Engelbrecht <engechas@dev-dsk-engechas-2a-28f138b0.us-west-2.amazon.com> Co-authored-by: Chase Engelbrecht <engechas@dev-dsk-engechas-2a-28f138b0.us-west-2.amazon.com> Signed-off-by: Dennis Toepker <toepkerd@amazon.com>

…earch-project#2096) * Add EventBridge Scheduler CRUD for external monitor scheduling Adds infrastructure for managing EventBridge Scheduler schedules as part of the Oasis external job scheduling feature (OSSA-606/607). New files: - ExternalSchedulerService: CRUD operations for EB schedules with TODO stubs for SDK wiring (OSSA-608) - ScheduleTranslator: Converts OpenSearch cron/interval schedules to EB rate() and cron() expressions - MonitorPayloadBuilder: Builds gzip+base64 SQS target input from monitor Modified files: - TransportIndexMonitorAction: Create/update EB schedule hooks with rollback on create failure, non-fatal on update failure - TransportDeleteMonitorAction: Delete EB schedule before monitor record, write cell info to ThreadContext for OASIS post-filter Signed-off-by: Riya Saxena <riysaxen@amazon.com> * Sync with common-utils PR opensearch-project#939: consolidate scheduler utilities Removes local copies of MonitorPayloadBuilder, ScheduleTranslator, and its tests — these now live in common-utils under the commons.alerting.util package. - MonitorPayloadBuilder → SchedulePayloadBuilder (renamed upstream) - buildTargetInput signature simplified (Option A): only monitor and jobStartTimePlaceholder. Routing fields (appId, workspaceId, ebCellAccountId) are now carried on Monitor.metadata, populated upstream by the OASIS ActionFilter, not passed as builder params. - SQS payload contains only job_start_time and monitorConfig. Consumer parses monitor.id and metadata from inside monitorConfig (single source of truth, no top-level duplication). - Fix import paths for ScheduleTranslator in ExternalSchedulerService and its tests. * Clean up external scheduler: remove internal codenames and redundant region param - Rename EB_CELL_* constants and ebCell* params to scheduler* for clarity - Drop schedulerRegion: EB schedule is always created in the cluster's own region, so plugin can resolve it locally instead of threading it through ThreadContext and service calls - Remove internal codenames (OASIS, OSSA-606/608, Neo DS, LLD) from comments, KDoc, and log messages - Update tests to match new signatures * Address PR review: drop ScheduleRequest, fail update on schedule error, always log - Remove ScheduleRequest data class — it was internal dead weight until SDK is wired, and when wired we'll construct CreateScheduleRequest directly from the AWS SDK. Inline its fields into log statements instead. - Fail monitor update if EB schedule update throws (was swallowed previously, which hid failures and left monitor and schedule out of sync). - Log scheduler account from ThreadContext unconditionally for debuggability, even when null. * Gate external scheduler hooks behind plugins.alerting.external_scheduler.enabled Adds a new dynamic boolean cluster setting to enable/disable managing monitor schedules on EventBridge. Defaults to false so OSS users see no behavioral change; deployments that use the external scheduler can flip it on via the cluster settings API. The check lives at the caller site (both transport actions) rather than inside ExternalSchedulerService so callers know explicitly whether a schedule was created/updated/deleted. This addresses Chase's review: either gate at the caller, or throw if gating inside the service. * Read scheduler routing from plugin settings; keep accountId ThreadContext override Replaces the all-ThreadContext design with plugin settings for the three routing values the transport actions need: - plugins.alerting.external_scheduler.account_id (default) - plugins.alerting.external_scheduler.queue_arn - plugins.alerting.external_scheduler.role_arn account_id remains overridable per-request via the existing transient ThreadContext key (scheduler.account_id) to support multi-tenant routing where upstream filters select a different account than the cluster default. queue_arn and role_arn come from plugin settings only — these are cluster- level EB deployment details that do not vary per-request. All three settings are dynamic so they can be flipped at runtime via the cluster settings API without a restart. Addresses Chase's review: - 'These should come from plugin settings and do not need to include cell in the name' - 'The account id should additionally check if there is an entry in the threadcontext and override the plugin setting value if so' * Extract SchedulerRoutingResolver with unit tests Moves the ThreadContext-override + plugin-setting fallback logic out of TransportIndexMonitorAction and TransportDeleteMonitorAction into a pure object, SchedulerRoutingResolver, in the service package. - Eliminates duplication between the two transport actions (both had an identical resolveSchedulerAccountId helper and similar blank-checking). - Exposes two entry points: resolve() for create/update (accountId, queueArn, roleArn) and resolveForDelete() (accountId, roleArn only). - Added SchedulerRoutingResolverTests covering all branches: setting-only, ThreadContext override wins, blank override ignored, any required field missing returns null. Addresses Chase's review on test coverage: 'add unit tests for the other methods as well' — the new logic introduced by the plugin-settings switch is now fully covered by pure unit tests that don't require transport-action or cluster wiring. * Fix missing @volatile fields that broke compile Restores the three @volatile declarations for externalSchedulerAccountId, externalSchedulerQueueArn, and externalSchedulerRoleArn in TransportIndexMonitorAction. These were referenced by the setting update consumers and the create/update helpers but the field declarations themselves were dropped during a prior edit, breaking compilation. * Trigger CI after common-utils opensearch-project#939 publish Signed-off-by: Riya Saxena <riysaxen@amazon.com> * Add transport action unit tests for external scheduler settings Register EXTERNAL_SCHEDULER_* settings in TransportDeleteMonitorActionTests and create TransportIndexMonitorActionTests covering: - Action construction with scheduler enabled/disabled - Dynamic settings registration verification - ThreadContext scheduler account id override - Setting default values Addresses review comment: 'Let's add unit tests for the other methods as well. We should also update the existing tests for index/delete monitor action to cover the new logic.' Signed-off-by: Riya Saxena <riysaxen@amazon.com> * Temporarily pin common-utils to 3.7.0.0-SNAPSHOT common-utils main was bumped to 3.7.0-SNAPSHOT before PR opensearch-project#939 (ScheduleTranslator/SchedulePayloadBuilder) merged, so the snapshot artifact is published under 3.7.0.0-SNAPSHOT. Alerting main is still on 3.6.0-SNAPSHOT. Pin until alerting's own version bump lands. TODO: revert to opensearch_build after alerting version bump to 3.7.0 Signed-off-by: Riya Saxena <riysaxen@amazon.com> * Fix sample-remote-monitor-plugin Monitor constructor for new params Add null metadata and null target params to all 3 Monitor constructor calls in SampleRemoteMonitorRestHandler. These fields were added to the Monitor class in common-utils 3.7.0.0-SNAPSHOT (PR opensearch-project#939 metadata field and Target object). Signed-off-by: Riya Saxena <riysaxen@amazon.com> * Address Chase's review: guard delete, unify Routing, pass objects 1. [BLOCKING] Add externalSchedulerEnabled guard to delete flow — deleteExternalSchedule now only called when flag is true, matching create/update behavior. 2. Merge DeleteRouting into Routing — single data class with queueArn set to empty string for delete path. Simplifies the model. 3. Move logSchedulerContext from TransportDeleteMonitorAction into ExternalSchedulerService — service class has concrete details about where the schedule is being created/deleted. 4. Extract EB_SCHEDULED_TIME_PLACEHOLDER constant into ExternalSchedulerService instead of inline string literal. 5. Pass full Routing object to ExternalSchedulerService methods instead of breaking apart into individual fields at call sites. Signed-off-by: Riya Saxena <riysaxen@amazon.com> * Fix ExternalSchedulerServiceTests for Routing object signatures Update test calls to use SchedulerRoutingResolver.Routing objects instead of individual string params, matching the refactored ExternalSchedulerService method signatures. Signed-off-by: Riya Saxena <riysaxen@amazon.com> --------- Signed-off-by: Riya Saxena <riysaxen@amazon.com> Signed-off-by: Dennis Toepker <toepkerd@amazon.com>

…pensearch-project#2104) Migrate from SchedulePayloadBuilder.buildTargetInput() to building ScheduleJobPayload-compatible JSON for EB schedule target input. The new payload includes monitorId at top level, matching the ScheduleJobPayload POJO added in common-utils PR opensearch-project#943. - Replace SchedulePayloadBuilder import with ScheduleJobPayload - Add buildScheduleJobPayloadJson() helper using ScheduleJobPayload field constants with EB placeholder for job_start_time - Update ExternalSchedulerServiceTests to use real payload JSON - Add test verifying payload contains all ScheduleJobPayload fields Signed-off-by: Riya Saxena <riysaxen@amazon.com> Signed-off-by: Dennis Toepker <toepkerd@amazon.com>

…xternal monitor scheduling (opensearch-project#2103) * feat(poller): Add MonitorJobPoller for external monitor scheduling - JobQueueUrlsProvider: interface for queue URL discovery - MonitorJobPoller: AbstractLifecycleComponent with coroutine-based poll workers, round-robin queue selection. Workers only launched when enabled (MULTI_TENANCY_ENABLED). Idles until jobQueueUrlsProvider and sqsClient are set. - Queue integration: receiveMessage (max=1, waitTime=0, visibility=90), deleteMessage on success, exception resilience - Message parsing: uses SchedulePayloadBuilder.parseTargetInput() from common-utils for deserialization (monitor id preserved via monitorId field) - Execution: uses suspendUntil with ExecuteMonitorAction, inline Monitor (skips metadata store fetch when monitor object is provided) - TransportExecuteMonitorAction: prioritize inline monitor over monitorId fetch when both are present - Plugin integration: MonitorJobPoller created in createComponents, returned as lifecycle component - Cache SchedulerClient per account using AssumeRoleCredentialsCache with StsAssumeRoleCredentialsProvider for automatic credential refresh - ExternalSchedulerService.initialize(settings) reads region and messageGroupKeyName from settings directly instead of external mutation - Rename SQS_SEND_MESSAGE_ARN to EB_SQS_UNIVERSAL_TARGET_ARN - Remove manual STS AssumeRole + StaticCredentialsProvider per call Signed-off-by: Dennis Toepker <toepkerd@amazon.com>

@Suppress

* feat(bucket-trigger): Add bucket-level trigger evaluation using standard bucket_selector Replace the custom BucketSelectorExt aggregation with standard bucket_selector for bucket-level trigger evaluation when multi_tenant_trigger_eval_enabled is true. This enables trigger evaluation on user clusters that do not have the alerting plugin installed. **When the flag is on, bucket-level monitors are limited to 1 trigger.** The standard bucket_selector is injected directly into the monitor's search query so a single search call performs both data collection and trigger evaluation. Include/exclude filtering from BucketSelectorExtFilter is replicated post-response by BucketKeyFilter. The existing BucketSelectorExt path is completely unchanged when the flag is off (the default). New files: - BucketSelectorQueryBuilder: injects standard bucket_selector sub-agg - BucketKeyFilter: post-response include/exclude bucket filtering - TriggerService.runBucketLevelTriggerFromFilteredResponse(): reads remaining buckets from filtered response Modified files: - BucketLevelMonitorRunner: flag-gated single-query path - InputService: useStandardBucketSelector parameter - TransportIndexMonitorAction: 1-trigger validation for bucket-level monitors when flag is on - AggregationQueryRewriter: skipBucketSelectorInjection parameter Integration tests: - RemoteBucketLevelTriggerIT: 13 tests covering composite/terms agg, alert lifecycle, dry run, nested parent path, include/exclude filter, validation of 1-trigger limit, Painless script in query - RemoteBucketLevelTriggerRegressionIT: 5 tests verifying flag=false preserves existing BucketSelectorExt behavior Signed-off-by: Surya Sashank Nistala <snistala@amazon.com> * fix: Remove dead code and clean up PR - Remove collectInputResultsForTrigger() from InputService (dead code from per-trigger query approach) - Remove skipAllBucketSelectorInjection parameter from collectInputResults and getSearchRequest (no longer used) - Collapse identical triggerCtx if/else branches in BucketLevelMonitorRunner - Restore .gitignore to original state Signed-off-by: Surya Sashank Nistala <snistala@amazon.com> * fix: Address PR review comments — remove reflection, add runtime validation - BucketSelectorQueryBuilder: use public bucketsPathsMap getter from common-utils instead of reflection (opensearch-project/common-utils#949) - BucketKeyFilter: use IncludeExclude.convertToStringFilter() instead of reflection on private fields - TriggerService: replace blanket @Suppress(UNCHECKED_CAST) with runtime type checks using require() - BucketLevelMonitorRunner: add 1-trigger validation at execution time to cover the _execute API path (in addition to create/update validation) Signed-off-by: Surya Sashank Nistala <snistala@amazon.com> --------- Signed-off-by: Surya Sashank Nistala <snistala@amazon.com> Signed-off-by: Dennis Toepker <toepkerd@amazon.com>

Signed-off-by: Dennis Toepker <toepkerd@amazon.com>

Replace all org.json JSONObject/JSONArray usage in PPLUtils.kt, InputService.kt, and TriggerService.kt with Jackson equivalents (ObjectMapper, JsonNode, ArrayNode). The only remaining org.json usage is the TransportPPLQueryRequest constructor call which requires org.json.JSONObject (SQL plugin API). The org.json dependency remains as implementation since the SQL plugin needs it at runtime. Added explicit jackson-databind dependency since it was previously pulled transitively by org.json. No functionality changes - all JSON is read, written, and modified in the exact same way as before. Signed-off-by: Dennis Toepker <toepkerd@amazon.com> � Conflicts: � alerting/src/main/kotlin/org/opensearch/alerting/PPLUtils.kt

* Add tenant to alerts service requests Signed-off-by: Chase Engelbrecht <engechas@dev-dsk-engechas-2a-28f138b0.us-west-2.amazon.com> * Add dependencies Signed-off-by: Chase Engelbrecht <engechas@dev-dsk-engechas-2a-28f138b0.us-west-2.amazon.com> --------- Signed-off-by: Chase Engelbrecht <engechas@dev-dsk-engechas-2a-28f138b0.us-west-2.amazon.com> Co-authored-by: Chase Engelbrecht <engechas@dev-dsk-engechas-2a-28f138b0.us-west-2.amazon.com> Signed-off-by: Dennis Toepker <toepkerd@amazon.com>

Signed-off-by: Dennis Toepker <toepkerd@amazon.com>

…d allocation to thread dispatch Signed-off-by: Dennis Toepker <toepkerd@amazon.com>

…d of suspend Signed-off-by: Dennis Toepker <toepkerd@amazon.com>

Signed-off-by: Dennis Toepker <toepkerd@amazon.com>

…onitor config req body Signed-off-by: Dennis Toepker <toepkerd@amazon.com>

lezzago · 2026-05-11T15:13:25Z

    override val results: List<Map<String, Any>>,
+    val pplQueryResults: List<Map<String, Any?>>, // each list element is a result row


Can you explain the difference between the two and how the value differs for ppl monitors?

Despite both being List<Map<String, Any>>, the shape/structure of the data inside is different between the two, so the code can't assume it can parse/read that field the same way in either a PPL or DSL Monitor run.

Example results value (used by Query-Level Monitors and other DSL-based Monitors):

[ { "_shards": {"total": 1, "failed": 0, "successful": 1, "skipped": 0}, "hits": { "hits": [{"_id": "abc", "_source": {"status": 500, "endpoint": "/api"}}], "total": {"value": 3, "relation": "eq"}, "max_score": 1.0 }, "took": 45, "timed_out": false } ]

Example pplQueryResults value (used only by PPL Monitors):

[ {"endpoint": "/api/orders", "error_count": 3}, {"endpoint": "/api/checkout", "error_count": 1} ]

Will add comment explaining the difference between these fields.

lezzago · 2026-05-11T15:14:41Z

            Setting.Property.NodeScope, Setting.Property.Dynamic
        )

-        val PPL_MONITOR_EXECUTION_MAX_DURATION = Setting.positiveTimeSetting(


Why are we removing this?

it's replaced with a per-trigger max execution duration setting as opposed to a per-monitor max execution duration. Each custom trigger can run its own query, so the timeboxing is done at a per trigger level instead of a per monitor level.

lezzago · 2026-05-11T15:26:22Z

+                            if (execMonitorRequest.manual && !checkUserPermissionsWithResource(
+                                    user, monitor.user, actionListener,
+                                    "monitor", execMonitorRequest.monitorId
+                                )


Instead should we call the get monitor api with the user's context, so it would automatically validate correctly if the user has permissions to even fetch the monitor as the get monitor api would do the RBAC check.

I'm not sure about this approach. Get Monitor calls checkUserPermissionsWithResource just like ExecuteMonitor is doing here (reference), adding an extra hop from ExecuteMonitor to GetMonitor to call checkUserPermissionsWithResource doesn't seem very clean when ExecuteMonitor can do that itself.

Exploring running checkUserPermissionsWithResource() directly, without checking manual field first.

lezzago · 2026-05-11T15:29:55Z

+        val query = if (ctx.monitor.isPPLMonitor()) (ctx.monitor.inputs[0] as PPLInput).query else null
+        val queryResults = if (ctx.monitor.isPPLMonitor()) ctx.pplQueryResults else emptyList()


Rename this to pplQuery and pplQueryResults if this is only meant for ppl monitors.

This naming is used so that the query and query results in Alerts specifically can be included for more than just PPL Monitors, in response to this feedback from common-utils. This generalization (of creating a field that either PPL Monitors or existing Monitor types can use) only exists for Alerts currently. In internal Monitor execution logic, a distinction is made between DSL results and PPL results because their shapes are different despite both being List<Map<String, Any>>. See: #2130 (comment).

Need to clarify comment that these fields are for PPL Monitors for now, but can be extended for other Monitor types to use.

lezzago · 2026-05-11T19:38:28Z

+
+        var queryResponseJson: JsonNode? = null
+
+        withTimeout(monitorExecutionDuration.millis) {


if it times out, do we want any logging and catching the exception?

That is already done. QueryLevelMonitorRunner calls collectInputResultsForPPLMonitor(), which wraps runPPLBaseQuery() in a try catch. collectInputResultsForPPLMonitor()'s try catch catches the timeout exception (or any other exception), logs an error log, then returns an InputRunResults with a non-empty error. This causes the generation of an ERROR alert.

eirsep · 2026-05-11T21:48:54Z

            }
        }

+        ExternalSchedulerService.initialize(settings)


why are these changes appearing as an addition from this CR?

This must have been an oversight on my part when fixing merge conflicts. ExternalSchedulerService.initialize(settings) is already called on line 393 (and it's not showing up as an addition in the PR). Will remove this.

eirsep · 2026-05-11T22:03:05Z

+            // these query results are for number_of_results PPL triggers,
+            // only the number of results returned by base query matters
+            // for those triggers, not the contents themselves
+            val basePplQueryResults = runPPLBaseQuery(


if there are no numResults triggers skip runnin base query

eirsep · 2026-05-11T22:13:31Z

        }
    }

+    suspend fun collectInputResultsForPPLMonitor(


i would bat for PPL monitor to have only 1 custom PPL trigger and if it has any custom trigger then it shouldn't have a num results trigger.. but it can have multiple num_results triggers in the same monitor.

running one monitor sequentially would be a nightmare because partial failures would be tough to ahndle and if the job crahes after processing 2 out 5 custom triggers, then when it retries there would be duplicate

just an open discussion..

Alerts are collected while PPL Triggers are run (just like Query-Level Triggers), then bulk saved all at once after all Triggers are executed. If the job crashes in the middle of processing its Triggers, no Alerts would have been written. This means no duplicate Alerts when the PPL Monitor retries.

eirsep · 2026-05-11T22:27:40Z

+
+        var queryResponseJson: JsonNode? = null
+
+        withTimeout(monitorExecutionDuration.millis) {


What is the timeout value.. also what is max number of PPL triggers allowed?
is x less than 5 minutes.

Am asking because the monitor cannot run longer than 5 minutes in total since the job scheduler lock expires and will fire a duplicate job

Will reduce max ppl query duration setting default value from 1m to 30s, and add a new max ppl triggers setting whose max value is 10.

… execution timeout Signed-off-by: Dennis Toepker <toepkerd@amazon.com>

lezzago · 2026-05-12T00:02:20Z

+                logger.debug("query results for trigger ${pplTrigger.id}: $queryResponseJson")
+                logger.debug("time taken to execute query against sql/ppl plugin: $timeTaken")
+
+                // val numPplResults = basePplQueryResults.getLong("total")


Did you mean to delete the commented code?

Signed-off-by: Dennis Toepker <toepkerd@amazon.com>

…s setting with max value of 10 Signed-off-by: Dennis Toepker <toepkerd@amazon.com>

toepkerd-zz and others added 30 commits May 8, 2026 16:04

PPL Alerting: refactoring PPL Alerting to be behind V1 APIs, and reti…

3dc2980

…ring V2 APIs Signed-off-by: Dennis Toepker <toepkerd@amazon.com>

PPL Alerting: Removing Stateless Alerts and Supporting Only Stateful …

67c37ed

…Alerts Signed-off-by: Dennis Toepker <toepkerd@amazon.com>

undoing build.gradle commenting out

2455a04

Signed-off-by: Dennis Toepker <toepkerd@amazon.com>

fix linter

76ff91b

Signed-off-by: Dennis Toepker <toepkerd@amazon.com>

bump to 3.7 and fix compiler errors

c7137ae

Signed-off-by: Dennis Toepker <toepkerd@amazon.com>

fix integ test compiler errors

6ae3bab

Signed-off-by: Dennis Toepker <toepkerd@amazon.com>

undoing accidental change to return Workflow? instead of Workflow

bfc6f18

Signed-off-by: Dennis Toepker <toepkerd@amazon.com>

removing scheduled jobs index version increment

4e660c8

Signed-off-by: Dennis Toepker <toepkerd@amazon.com>

removing mention of stateless alerts from RestGetAlertsAction

3c6827f

Signed-off-by: Dennis Toepker <toepkerd@amazon.com>

misc cleanup

2457792

Signed-off-by: Dennis Toepker <toepkerd@amazon.com>

moving monitor rest tests

1227e67

Signed-off-by: Dennis Toepker <toepkerd@amazon.com>

misc cleanup

8a37080

Signed-off-by: Dennis Toepker <toepkerd@amazon.com>

removing various vestiges of older implementations, simplifying threa…

0cf9f76

…d allocation to thread dispatch Signed-off-by: Dennis Toepker <toepkerd@amazon.com>

refactoring index ppl monitor query validation to use callback instea…

1152af1

…d of suspend Signed-off-by: Dennis Toepker <toepkerd@amazon.com>

adding comments for the different execute ppl query helpers

d00e39e

Signed-off-by: Dennis Toepker <toepkerd@amazon.com>

reverting scheduled job stats changes

1d217e7

Signed-off-by: Dennis Toepker <toepkerd@amazon.com>

removing outdated MonitorDataSourcesIT logic

575ce45

Signed-off-by: Dennis Toepker <toepkerd@amazon.com>

cleanup

aa9beea

Signed-off-by: Dennis Toepker <toepkerd@amazon.com>

cleanup

1a77b18

Signed-off-by: Dennis Toepker <toepkerd@amazon.com>

restoring release notes

76f0947

Signed-off-by: Dennis Toepker <toepkerd@amazon.com>

cleanup

63768cf

Signed-off-by: Dennis Toepker <toepkerd@amazon.com>

cleanup

1c07da8

Signed-off-by: Dennis Toepker <toepkerd@amazon.com>

more cleanup

3740b37

Signed-off-by: Dennis Toepker <toepkerd@amazon.com>

toepkerd-zz added 2 commits May 8, 2026 16:32

adding security ITs for manual execute monitor honoring RBAC

06a525c

Signed-off-by: Dennis Toepker <toepkerd@amazon.com>

fixing security ITs and removing rbac check in execute monitor with m…

065c629

…onitor config req body Signed-off-by: Dennis Toepker <toepkerd@amazon.com>

toepkerd marked this pull request as ready for review May 9, 2026 00:03

toepkerd requested review from AWSHurneyt, KashKondaka, amsiglan, bowenlan-amzn, eirsep, engechas, getsaurabh02, goyamegh, lezzago, manaswini1920, nishtham-amazon, rishabhmaurya, riysaxen-amzn and vikhy-aws as code owners May 9, 2026 00:03

toepkerd changed the title ~~Ppl v1~~ PPL Alerting: PPL Monitor Execution May 9, 2026

lezzago reviewed May 11, 2026

View reviewed changes

eirsep reviewed May 11, 2026

View reviewed changes

eirsep requested changes May 11, 2026

View reviewed changes

eirsep reviewed May 11, 2026

View reviewed changes

addressing various comments and renaming timeout setting to ppl query…

cebd930

… execution timeout Signed-off-by: Dennis Toepker <toepkerd@amazon.com>

lezzago reviewed May 12, 2026

View reviewed changes

toepkerd-zz added 3 commits May 11, 2026 22:51

skip running base query if PPL Monitor only contains custom triggers

39299ac

Signed-off-by: Dennis Toepker <toepkerd@amazon.com>

removing dead commented code

50cf51e

Signed-off-by: Dennis Toepker <toepkerd@amazon.com>

refactor: Remove manual field from ExecuteMonitorRequest

cf9216b

Signed-off-by: Dennis Toepker <toepkerd@amazon.com>

toepkerd force-pushed the ppl-v1 branch from 9f54cff to cf9216b Compare May 12, 2026 16:56

reducing max ppl query duration to 30s and adding new max ppl trigger…

275e23f

…s setting with max value of 10 Signed-off-by: Dennis Toepker <toepkerd@amazon.com>

		override val results: List<Map<String, Any>>,
		val pplQueryResults: List<Map<String, Any?>>, // each list element is a result row

		val query = if (ctx.monitor.isPPLMonitor()) (ctx.monitor.inputs[0] as PPLInput).query else null
		val queryResults = if (ctx.monitor.isPPLMonitor()) ctx.pplQueryResults else emptyList()


		var queryResponseJson: JsonNode? = null

		withTimeout(monitorExecutionDuration.millis) {

Conversation

toepkerd commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

The only files that need deep review are:

New manual Field

Check List

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

toepkerd May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

toepkerd commented May 8, 2026 •

edited

Loading

New `manual` Field

toepkerd May 11, 2026 •

edited

Loading