Skip to content

PPL Alerting: PPL Monitor Execution#2130

Open
toepkerd wants to merge 39 commits into
opensearch-project:mainfrom
toepkerd:ppl-v1
Open

PPL Alerting: PPL Monitor Execution#2130
toepkerd wants to merge 39 commits into
opensearch-project:mainfrom
toepkerd:ppl-v1

Conversation

@toepkerd
Copy link
Copy Markdown
Collaborator

@toepkerd toepkerd commented May 8, 2026

Description

This PR primarily contains logic for running PPL Monitors. It also adds RBAC checks for manual Monitor executions, a gap revealed by a security review.

To support RBAC filtering for manual monitor execution, a new field manual is added to ExecuteMonitorRequest/Response. Many of the file changes are simply to accommodate this new field and can be ignored in review.

The only files that need deep review are:

TriggerService.kt
InputService.kt
AlertService.kt
QueryLevelMonitorRunner.kt
TransportExecuteMonitorAction.kt
QueryLevelTriggerExecutionContext.kt

New manual Field

The manual field in ExecuteMonitorRequest distinguishes between user-initiated executions (via the REST API) and system-initiated executions (via the job scheduler or MonitorJobPoller). When a user calls the Execute Monitor API, the REST handler sets manual = true. When the scheduler or poller triggers execution internally, it sets manual = false. In TransportExecuteMonitorAction, this field gates the RBAC check: when manual is true and the monitor was fetched by ID, checkUserPermissionsWithResource is called to verify the calling user's backend roles overlap with the monitor owner's backend roles (enforcing filter_by_backend_roles). When manual is false, this check is skipped entirely — scheduled executions run as the system and must not be blocked by user-level RBAC. This ensures that monitors continue to fire on schedule regardless of which user created them, while still preventing unauthorized users from manually triggering another user's monitor via the API.

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

toepkerd-zz and others added 30 commits May 8, 2026 16:04
…ring V2 APIs

Signed-off-by: Dennis Toepker <toepkerd@amazon.com>
…Alerts

Signed-off-by: Dennis Toepker <toepkerd@amazon.com>
…roject#2099)

* Block unsupported actions when multi-tenancy is enabled

Signed-off-by: Chase Engelbrecht <engechas@dev-dsk-engechas-2a-28f138b0.us-west-2.amazon.com>

* Fix ktlint

Signed-off-by: Chase Engelbrecht <engechas@dev-dsk-engechas-2a-28f138b0.us-west-2.amazon.com>

* Fix CIs

Signed-off-by: Chase Engelbrecht <engechas@dev-dsk-engechas-2a-28f138b0.us-west-2.amazon.com>

* fix(test): Fix MultiTenancyBlock integration tests

Rename MultiTenancyBlockIT to MultiTenancyBlockSingleNodeTests so it
runs under the 'test' Gradle task (which matches *Tests.class) instead
of 'integTest' (which matches *IT.class). The class extends
OpenSearchSingleNodeTestCase which requires an embedded node, but the
integTest task starts a full REST cluster causing node lifecycle
conflicts.

Also fix IndexWorkflowRequest in the test to include a non-empty
delegates list so request validation passes before the multi-tenancy
check in doExecute runs.

Remove the query-level monitor success test since it requires SDK
tenant context propagation that is not available in the embedded
single-node test environment.

Signed-off-by: Chase Engelbrecht <engechas@dev-dsk-engechas-2a-28f138b0.us-west-2.amazon.com>

* fix(test): Fix TransportMultiTenancyBlockTests unit tests

Replace Mockito.mock() calls for final Kotlin classes
(ScheduledJobIndices, LockService, DocLevelMonitorQueries,
MonitorRunnerService) with real instances constructed from
already-mocked dependencies, since Mockito cannot mock final classes
without the inline mock maker which is incompatible with the
OpenSearch test framework.

Fix invokeDoExecute reflection helper to find the correct typed
doExecute method, preferring the typed override over the base class
ActionRequest-based method when both exist.

Fix assertMethodNotAllowed to check the AlertingException status
directly instead of trying to unwrap to OpenSearchStatusException,
since AlertingException.wrap() wraps the cause in a plain Exception.

Register all required AlertingSettings and DestinationSettings in the
test ClusterSettings to prevent SettingsException during transport
action construction.

Replace mocked IndexWorkflowRequest with a real instance using
randomWorkflow helper since IndexWorkflowRequest is also a final
class.

Signed-off-by: Chase Engelbrecht <engechas@dev-dsk-engechas-2a-28f138b0.us-west-2.amazon.com>

---------

Signed-off-by: Chase Engelbrecht <engechas@dev-dsk-engechas-2a-28f138b0.us-west-2.amazon.com>
Co-authored-by: Chase Engelbrecht <engechas@dev-dsk-engechas-2a-28f138b0.us-west-2.amazon.com>
Signed-off-by: Dennis Toepker <toepkerd@amazon.com>
…earch-project#2096)

* Add EventBridge Scheduler CRUD for external monitor scheduling

Adds infrastructure for managing EventBridge Scheduler schedules as part of
the Oasis external job scheduling feature (OSSA-606/607).

New files:
- ExternalSchedulerService: CRUD operations for EB schedules with TODO stubs
  for SDK wiring (OSSA-608)
- ScheduleTranslator: Converts OpenSearch cron/interval schedules to EB
  rate() and cron() expressions
- MonitorPayloadBuilder: Builds gzip+base64 SQS target input from monitor

Modified files:
- TransportIndexMonitorAction: Create/update EB schedule hooks with rollback
  on create failure, non-fatal on update failure
- TransportDeleteMonitorAction: Delete EB schedule before monitor record,
  write cell info to ThreadContext for OASIS post-filter

Signed-off-by: Riya Saxena <riysaxen@amazon.com>

* Sync with common-utils PR opensearch-project#939: consolidate scheduler utilities

Removes local copies of MonitorPayloadBuilder, ScheduleTranslator, and
its tests — these now live in common-utils under the commons.alerting.util
package.

- MonitorPayloadBuilder → SchedulePayloadBuilder (renamed upstream)
- buildTargetInput signature simplified (Option A): only monitor and
  jobStartTimePlaceholder. Routing fields (appId, workspaceId,
  ebCellAccountId) are now carried on Monitor.metadata, populated
  upstream by the OASIS ActionFilter, not passed as builder params.
- SQS payload contains only job_start_time and monitorConfig. Consumer
  parses monitor.id and metadata from inside monitorConfig (single
  source of truth, no top-level duplication).
- Fix import paths for ScheduleTranslator in ExternalSchedulerService
  and its tests.

* Clean up external scheduler: remove internal codenames and redundant region param

- Rename EB_CELL_* constants and ebCell* params to scheduler* for clarity
- Drop schedulerRegion: EB schedule is always created in the cluster's own
  region, so plugin can resolve it locally instead of threading it through
  ThreadContext and service calls
- Remove internal codenames (OASIS, OSSA-606/608, Neo DS, LLD) from comments,
  KDoc, and log messages
- Update tests to match new signatures

* Address PR review: drop ScheduleRequest, fail update on schedule error, always log

- Remove ScheduleRequest data class — it was internal dead weight until SDK
  is wired, and when wired we'll construct CreateScheduleRequest directly
  from the AWS SDK. Inline its fields into log statements instead.
- Fail monitor update if EB schedule update throws (was swallowed previously,
  which hid failures and left monitor and schedule out of sync).
- Log scheduler account from ThreadContext unconditionally for debuggability,
  even when null.

* Gate external scheduler hooks behind plugins.alerting.external_scheduler.enabled

Adds a new dynamic boolean cluster setting to enable/disable managing monitor
schedules on EventBridge. Defaults to false so OSS users see no behavioral
change; deployments that use the external scheduler can flip it on via the
cluster settings API.

The check lives at the caller site (both transport actions) rather than
inside ExternalSchedulerService so callers know explicitly whether a
schedule was created/updated/deleted. This addresses Chase's review:
either gate at the caller, or throw if gating inside the service.

* Read scheduler routing from plugin settings; keep accountId ThreadContext override

Replaces the all-ThreadContext design with plugin settings for the three
routing values the transport actions need:

- plugins.alerting.external_scheduler.account_id (default)
- plugins.alerting.external_scheduler.queue_arn
- plugins.alerting.external_scheduler.role_arn

account_id remains overridable per-request via the existing transient
ThreadContext key (scheduler.account_id) to support multi-tenant routing
where upstream filters select a different account than the cluster default.
queue_arn and role_arn come from plugin settings only — these are cluster-
level EB deployment details that do not vary per-request.

All three settings are dynamic so they can be flipped at runtime via the
cluster settings API without a restart.

Addresses Chase's review:
- 'These should come from plugin settings and do not need to include cell
  in the name'
- 'The account id should additionally check if there is an entry in the
  threadcontext and override the plugin setting value if so'

* Extract SchedulerRoutingResolver with unit tests

Moves the ThreadContext-override + plugin-setting fallback logic out of
TransportIndexMonitorAction and TransportDeleteMonitorAction into a pure
object, SchedulerRoutingResolver, in the service package.

- Eliminates duplication between the two transport actions (both had an
  identical resolveSchedulerAccountId helper and similar blank-checking).
- Exposes two entry points: resolve() for create/update (accountId,
  queueArn, roleArn) and resolveForDelete() (accountId, roleArn only).
- Added SchedulerRoutingResolverTests covering all branches: setting-only,
  ThreadContext override wins, blank override ignored, any required field
  missing returns null.

Addresses Chase's review on test coverage: 'add unit tests for the other
methods as well' — the new logic introduced by the plugin-settings switch
is now fully covered by pure unit tests that don't require transport-action
or cluster wiring.

* Fix missing @volatile fields that broke compile

Restores the three @volatile declarations for externalSchedulerAccountId,
externalSchedulerQueueArn, and externalSchedulerRoleArn in
TransportIndexMonitorAction. These were referenced by the setting update
consumers and the create/update helpers but the field declarations
themselves were dropped during a prior edit, breaking compilation.

* Trigger CI after common-utils opensearch-project#939 publish

Signed-off-by: Riya Saxena <riysaxen@amazon.com>

* Add transport action unit tests for external scheduler settings

Register EXTERNAL_SCHEDULER_* settings in TransportDeleteMonitorActionTests
and create TransportIndexMonitorActionTests covering:
- Action construction with scheduler enabled/disabled
- Dynamic settings registration verification
- ThreadContext scheduler account id override
- Setting default values

Addresses review comment: 'Let's add unit tests for the other methods
as well. We should also update the existing tests for index/delete
monitor action to cover the new logic.'

Signed-off-by: Riya Saxena <riysaxen@amazon.com>

* Temporarily pin common-utils to 3.7.0.0-SNAPSHOT

common-utils main was bumped to 3.7.0-SNAPSHOT before PR opensearch-project#939
(ScheduleTranslator/SchedulePayloadBuilder) merged, so the snapshot
artifact is published under 3.7.0.0-SNAPSHOT. Alerting main is still
on 3.6.0-SNAPSHOT. Pin until alerting's own version bump lands.

TODO: revert to opensearch_build after alerting version bump to 3.7.0
Signed-off-by: Riya Saxena <riysaxen@amazon.com>

* Fix sample-remote-monitor-plugin Monitor constructor for new params

Add null metadata and null target params to all 3 Monitor constructor
calls in SampleRemoteMonitorRestHandler. These fields were added to
the Monitor class in common-utils 3.7.0.0-SNAPSHOT (PR opensearch-project#939 metadata
field and Target object).

Signed-off-by: Riya Saxena <riysaxen@amazon.com>

* Address Chase's review: guard delete, unify Routing, pass objects

1. [BLOCKING] Add externalSchedulerEnabled guard to delete flow —
   deleteExternalSchedule now only called when flag is true, matching
   create/update behavior.

2. Merge DeleteRouting into Routing — single data class with queueArn
   set to empty string for delete path. Simplifies the model.

3. Move logSchedulerContext from TransportDeleteMonitorAction into
   ExternalSchedulerService — service class has concrete details about
   where the schedule is being created/deleted.

4. Extract EB_SCHEDULED_TIME_PLACEHOLDER constant into
   ExternalSchedulerService instead of inline string literal.

5. Pass full Routing object to ExternalSchedulerService methods instead
   of breaking apart into individual fields at call sites.

Signed-off-by: Riya Saxena <riysaxen@amazon.com>

* Fix ExternalSchedulerServiceTests for Routing object signatures

Update test calls to use SchedulerRoutingResolver.Routing objects
instead of individual string params, matching the refactored
ExternalSchedulerService method signatures.

Signed-off-by: Riya Saxena <riysaxen@amazon.com>

---------

Signed-off-by: Riya Saxena <riysaxen@amazon.com>
Signed-off-by: Dennis Toepker <toepkerd@amazon.com>
…pensearch-project#2104)

Migrate from SchedulePayloadBuilder.buildTargetInput() to building
ScheduleJobPayload-compatible JSON for EB schedule target input.
The new payload includes monitorId at top level, matching the
ScheduleJobPayload POJO added in common-utils PR opensearch-project#943.

- Replace SchedulePayloadBuilder import with ScheduleJobPayload
- Add buildScheduleJobPayloadJson() helper using ScheduleJobPayload
  field constants with EB placeholder for job_start_time
- Update ExternalSchedulerServiceTests to use real payload JSON
- Add test verifying payload contains all ScheduleJobPayload fields

Signed-off-by: Riya Saxena <riysaxen@amazon.com>
Signed-off-by: Dennis Toepker <toepkerd@amazon.com>
…xternal monitor scheduling (opensearch-project#2103)

* feat(poller): Add MonitorJobPoller for external monitor scheduling

- JobQueueUrlsProvider: interface for queue URL discovery
- MonitorJobPoller: AbstractLifecycleComponent with coroutine-based
  poll workers, round-robin queue selection. Workers only launched
  when enabled (MULTI_TENANCY_ENABLED). Idles until jobQueueUrlsProvider
  and sqsClient are set.
- Queue integration: receiveMessage (max=1, waitTime=0, visibility=90),
  deleteMessage on success, exception resilience
- Message parsing: uses SchedulePayloadBuilder.parseTargetInput() from
  common-utils for deserialization (monitor id preserved via monitorId field)
- Execution: uses suspendUntil with ExecuteMonitorAction, inline Monitor
  (skips metadata store fetch when monitor object is provided)
- TransportExecuteMonitorAction: prioritize inline monitor over
  monitorId fetch when both are present
- Plugin integration: MonitorJobPoller created in createComponents,
  returned as lifecycle component

- Cache SchedulerClient per account using AssumeRoleCredentialsCache
  with StsAssumeRoleCredentialsProvider for automatic credential refresh
- ExternalSchedulerService.initialize(settings) reads region and
  messageGroupKeyName from settings directly instead of external mutation
- Rename SQS_SEND_MESSAGE_ARN to EB_SQS_UNIVERSAL_TARGET_ARN
- Remove manual STS AssumeRole + StaticCredentialsProvider per call

Signed-off-by: Dennis Toepker <toepkerd@amazon.com>
* feat(bucket-trigger): Add bucket-level trigger evaluation using standard bucket_selector

Replace the custom BucketSelectorExt aggregation with standard
bucket_selector for bucket-level trigger evaluation when
multi_tenant_trigger_eval_enabled is true. This enables trigger
evaluation on user clusters that do not have the alerting plugin
installed.

**When the flag is on, bucket-level monitors are limited to 1 trigger.**
The standard bucket_selector is injected directly into the monitor's
search query so a single search call performs both data collection and
trigger evaluation. Include/exclude filtering from BucketSelectorExtFilter
is replicated post-response by BucketKeyFilter.

The existing BucketSelectorExt path is completely unchanged when the
flag is off (the default).

New files:
- BucketSelectorQueryBuilder: injects standard bucket_selector sub-agg
- BucketKeyFilter: post-response include/exclude bucket filtering
- TriggerService.runBucketLevelTriggerFromFilteredResponse(): reads
  remaining buckets from filtered response

Modified files:
- BucketLevelMonitorRunner: flag-gated single-query path
- InputService: useStandardBucketSelector parameter
- TransportIndexMonitorAction: 1-trigger validation for bucket-level
  monitors when flag is on
- AggregationQueryRewriter: skipBucketSelectorInjection parameter

Integration tests:
- RemoteBucketLevelTriggerIT: 13 tests covering composite/terms agg,
  alert lifecycle, dry run, nested parent path, include/exclude filter,
  validation of 1-trigger limit, Painless script in query
- RemoteBucketLevelTriggerRegressionIT: 5 tests verifying flag=false
  preserves existing BucketSelectorExt behavior

Signed-off-by: Surya Sashank Nistala <snistala@amazon.com>

* fix: Remove dead code and clean up PR

- Remove collectInputResultsForTrigger() from InputService (dead code
  from per-trigger query approach)
- Remove skipAllBucketSelectorInjection parameter from collectInputResults
  and getSearchRequest (no longer used)
- Collapse identical triggerCtx if/else branches in BucketLevelMonitorRunner
- Restore .gitignore to original state

Signed-off-by: Surya Sashank Nistala <snistala@amazon.com>

* fix: Address PR review comments — remove reflection, add runtime validation

- BucketSelectorQueryBuilder: use public bucketsPathsMap getter from
  common-utils instead of reflection (opensearch-project/common-utils#949)
- BucketKeyFilter: use IncludeExclude.convertToStringFilter() instead
  of reflection on private fields
- TriggerService: replace blanket @Suppress(UNCHECKED_CAST) with
  runtime type checks using require()
- BucketLevelMonitorRunner: add 1-trigger validation at execution time
  to cover the _execute API path (in addition to create/update validation)

Signed-off-by: Surya Sashank Nistala <snistala@amazon.com>

---------

Signed-off-by: Surya Sashank Nistala <snistala@amazon.com>
Signed-off-by: Dennis Toepker <toepkerd@amazon.com>
Signed-off-by: Dennis Toepker <toepkerd@amazon.com>
Signed-off-by: Dennis Toepker <toepkerd@amazon.com>
Signed-off-by: Dennis Toepker <toepkerd@amazon.com>
Signed-off-by: Dennis Toepker <toepkerd@amazon.com>
Signed-off-by: Dennis Toepker <toepkerd@amazon.com>
Signed-off-by: Dennis Toepker <toepkerd@amazon.com>
Replace all org.json JSONObject/JSONArray usage in PPLUtils.kt,
InputService.kt, and TriggerService.kt with Jackson equivalents
(ObjectMapper, JsonNode, ArrayNode).

The only remaining org.json usage is the TransportPPLQueryRequest
constructor call which requires org.json.JSONObject (SQL plugin API).
The org.json dependency remains as implementation since the SQL
plugin needs it at runtime. Added explicit jackson-databind
dependency since it was previously pulled transitively by org.json.

No functionality changes - all JSON is read, written, and modified
in the exact same way as before.

Signed-off-by: Dennis Toepker <toepkerd@amazon.com>

� Conflicts:
�	alerting/src/main/kotlin/org/opensearch/alerting/PPLUtils.kt
* Add tenant to alerts service requests

Signed-off-by: Chase Engelbrecht <engechas@dev-dsk-engechas-2a-28f138b0.us-west-2.amazon.com>

* Add dependencies

Signed-off-by: Chase Engelbrecht <engechas@dev-dsk-engechas-2a-28f138b0.us-west-2.amazon.com>

---------

Signed-off-by: Chase Engelbrecht <engechas@dev-dsk-engechas-2a-28f138b0.us-west-2.amazon.com>
Co-authored-by: Chase Engelbrecht <engechas@dev-dsk-engechas-2a-28f138b0.us-west-2.amazon.com>
Signed-off-by: Dennis Toepker <toepkerd@amazon.com>
Signed-off-by: Dennis Toepker <toepkerd@amazon.com>
Signed-off-by: Dennis Toepker <toepkerd@amazon.com>
Signed-off-by: Dennis Toepker <toepkerd@amazon.com>
Signed-off-by: Dennis Toepker <toepkerd@amazon.com>
…d allocation to thread dispatch

Signed-off-by: Dennis Toepker <toepkerd@amazon.com>
…d of suspend

Signed-off-by: Dennis Toepker <toepkerd@amazon.com>
Signed-off-by: Dennis Toepker <toepkerd@amazon.com>
Signed-off-by: Dennis Toepker <toepkerd@amazon.com>
Signed-off-by: Dennis Toepker <toepkerd@amazon.com>
Signed-off-by: Dennis Toepker <toepkerd@amazon.com>
Signed-off-by: Dennis Toepker <toepkerd@amazon.com>
Signed-off-by: Dennis Toepker <toepkerd@amazon.com>
Signed-off-by: Dennis Toepker <toepkerd@amazon.com>
Signed-off-by: Dennis Toepker <toepkerd@amazon.com>
Signed-off-by: Dennis Toepker <toepkerd@amazon.com>
Signed-off-by: Dennis Toepker <toepkerd@amazon.com>
…onitor config req body

Signed-off-by: Dennis Toepker <toepkerd@amazon.com>
@toepkerd toepkerd marked this pull request as ready for review May 9, 2026 00:03
@toepkerd toepkerd changed the title Ppl v1 PPL Alerting: PPL Monitor Execution May 9, 2026
Comment on lines 21 to +22
override val results: List<Map<String, Any>>,
val pplQueryResults: List<Map<String, Any?>>, // each list element is a result row
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain the difference between the two and how the value differs for ppl monitors?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Despite both being List<Map<String, Any>>, the shape/structure of the data inside is different between the two, so the code can't assume it can parse/read that field the same way in either a PPL or DSL Monitor run.

Example results value (used by Query-Level Monitors and other DSL-based Monitors):

[
    {
      "_shards": {"total": 1, "failed": 0, "successful": 1, "skipped": 0},
      "hits": {
        "hits": [{"_id": "abc", "_source": {"status": 500, "endpoint": "/api"}}],
        "total": {"value": 3, "relation": "eq"},
        "max_score": 1.0
      },
      "took": 45,
      "timed_out": false
    }
  ]

Example pplQueryResults value (used only by PPL Monitors):

[
    {"endpoint": "/api/orders", "error_count": 3},
    {"endpoint": "/api/checkout", "error_count": 1}
]

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add comment explaining the difference between these fields.

Setting.Property.NodeScope, Setting.Property.Dynamic
)

val PPL_MONITOR_EXECUTION_MAX_DURATION = Setting.positiveTimeSetting(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we removing this?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's replaced with a per-trigger max execution duration setting as opposed to a per-monitor max execution duration. Each custom trigger can run its own query, so the timeboxing is done at a per trigger level instead of a per monitor level.

Comment on lines +179 to +182
if (execMonitorRequest.manual && !checkUserPermissionsWithResource(
user, monitor.user, actionListener,
"monitor", execMonitorRequest.monitorId
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead should we call the get monitor api with the user's context, so it would automatically validate correctly if the user has permissions to even fetch the monitor as the get monitor api would do the RBAC check.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this approach. Get Monitor calls checkUserPermissionsWithResource just like ExecuteMonitor is doing here (reference), adding an extra hop from ExecuteMonitor to GetMonitor to call checkUserPermissionsWithResource doesn't seem very clean when ExecuteMonitor can do that itself.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exploring running checkUserPermissionsWithResource() directly, without checking manual field first.

Comment on lines +209 to +210
val query = if (ctx.monitor.isPPLMonitor()) (ctx.monitor.inputs[0] as PPLInput).query else null
val queryResults = if (ctx.monitor.isPPLMonitor()) ctx.pplQueryResults else emptyList()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename this to pplQuery and pplQueryResults if this is only meant for ppl monitors.

Copy link
Copy Markdown
Collaborator Author

@toepkerd toepkerd May 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This naming is used so that the query and query results in Alerts specifically can be included for more than just PPL Monitors, in response to this feedback from common-utils. This generalization (of creating a field that either PPL Monitors or existing Monitor types can use) only exists for Alerts currently. In internal Monitor execution logic, a distinction is made between DSL results and PPL results because their shapes are different despite both being List<Map<String, Any>>. See: #2130 (comment).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to clarify comment that these fields are for PPL Monitors for now, but can be extended for other Monitor types to use.


var queryResponseJson: JsonNode? = null

withTimeout(monitorExecutionDuration.millis) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it times out, do we want any logging and catching the exception?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is already done. QueryLevelMonitorRunner calls collectInputResultsForPPLMonitor(), which wraps runPPLBaseQuery() in a try catch. collectInputResultsForPPLMonitor()'s try catch catches the timeout exception (or any other exception), logs an error log, then returns an InputRunResults with a non-empty error. This causes the generation of an ERROR alert.

}
}

ExternalSchedulerService.initialize(settings)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are these changes appearing as an addition from this CR?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This must have been an oversight on my part when fixing merge conflicts. ExternalSchedulerService.initialize(settings) is already called on line 393 (and it's not showing up as an addition in the PR). Will remove this.

// these query results are for number_of_results PPL triggers,
// only the number of results returned by base query matters
// for those triggers, not the contents themselves
val basePplQueryResults = runPPLBaseQuery(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if there are no numResults triggers skip runnin base query

}
}

suspend fun collectInputResultsForPPLMonitor(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would bat for PPL monitor to have only 1 custom PPL trigger and if it has any custom trigger then it shouldn't have a num results trigger.. but it can have multiple num_results triggers in the same monitor.

running one monitor sequentially would be a nightmare because partial failures would be tough to ahndle and if the job crahes after processing 2 out 5 custom triggers, then when it retries there would be duplicate

just an open discussion..

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alerts are collected while PPL Triggers are run (just like Query-Level Triggers), then bulk saved all at once after all Triggers are executed. If the job crashes in the middle of processing its Triggers, no Alerts would have been written. This means no duplicate Alerts when the PPL Monitor retries.


var queryResponseJson: JsonNode? = null

withTimeout(monitorExecutionDuration.millis) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the timeout value.. also what is max number of PPL triggers allowed?
is x less than 5 minutes.

Am asking because the monitor cannot run longer than 5 minutes in total since the job scheduler lock expires and will fire a duplicate job

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will reduce max ppl query duration setting default value from 1m to 30s, and add a new max ppl triggers setting whose max value is 10.

… execution timeout

Signed-off-by: Dennis Toepker <toepkerd@amazon.com>
logger.debug("query results for trigger ${pplTrigger.id}: $queryResponseJson")
logger.debug("time taken to execute query against sql/ppl plugin: $timeTaken")

// val numPplResults = basePplQueryResults.getLong("total")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to delete the commented code?

Signed-off-by: Dennis Toepker <toepkerd@amazon.com>
Signed-off-by: Dennis Toepker <toepkerd@amazon.com>
Signed-off-by: Dennis Toepker <toepkerd@amazon.com>
…s setting with max value of 10

Signed-off-by: Dennis Toepker <toepkerd@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants