automatic troubleshooting endpoint context docs by joeypoon · Pull Request #728 · elastic/endpoint-package

joeypoon · 2026-03-13T01:15:44Z

Change Summary

Adds Automatic Troubleshooting context docs useful for troubleshooting common endpoint issues.

ferullo

There's a lot to review here so I'm going to submit a review for each file. The first one is done. It's reasonable/expected to sit on my recommendations until an Endpoint developer reviews too (so they can see my comments and contradict me, etc).

ferullo · 2026-03-18T16:43:15Z

package/endpoint/docs/knowledge_base/bsod/windows_bsod_endpoint_driver.md

+
+## Summary
+
+Elastic Defend uses a kernel-mode driver (`elastic_endpoint_driver.sys`) for file system filtering, network monitoring, and process/object callbacks. Because this driver runs at the kernel level, bugs or incompatibilities can cause system-wide crashes (BSODs) rather than isolated process failures. Most BSOD issues traced to the endpoint driver fall into a few categories: regressions introduced in specific driver versions, conflicts with other kernel-mode drivers (third-party security products), or running on unsupported OS versions.


Because this driver runs at the kernel level, bugs or incompatibilities can cause system-wide crashes (BSODs) rather than isolated process failures.

Does an LLM need to be told that?

ferullo · 2026-03-18T16:44:23Z

package/endpoint/docs/knowledge_base/bsod/windows_bsod_endpoint_driver.md

+
+Elastic Defend uses a kernel-mode driver (`elastic_endpoint_driver.sys`) for file system filtering, network monitoring, and process/object callbacks. Because this driver runs at the kernel level, bugs or incompatibilities can cause system-wide crashes (BSODs) rather than isolated process failures. Most BSOD issues traced to the endpoint driver fall into a few categories: regressions introduced in specific driver versions, conflicts with other kernel-mode drivers (third-party security products), or running on unsupported OS versions.
+
+Memory dump analysis with WinDbg (`!analyze -v`) is essential for root-cause determination. The bugcheck code alone is not sufficient — the faulting call stack identifies which code path triggered the crash.


I don't expect customers to do this. Perhaps text that describes how to collect a memory dump and a note that says to share it with us?

ferullo · 2026-03-18T16:46:28Z

package/endpoint/docs/knowledge_base/bsod/windows_bsod_endpoint_driver.md

+
+### ODX-enabled volume crash (8.19.8, 9.1.8, 9.2.2)
+
+A regression introduced in versions 8.19.8, 9.1.8, and 9.2.2 causes BSODs on systems with ODX (Offloaded Data Transfer) enabled volumes, particularly affecting Hyper-V clusters and Windows Server 2016 Datacenter. The crash occurs in the file system filter driver's post-FsControl handler (`bkPostFsControl`) when processing offloaded write completions. The faulting call stack typically shows `elastic_endpoint_driver!bkPostFsControl` followed by `FLTMGR!FltGetStreamContext`.


he crash occurs in the file system filter driver's post-FsControl handler (bkPostFsControl) when processing offloaded write completions. The faulting call stack typically shows elastic_endpoint_driver!bkPostFsControl followed by FLTMGR!FltGetStreamContext.

I doubt this is useful info for a Kibana user.

ferullo · 2026-03-18T16:46:37Z

package/endpoint/docs/knowledge_base/bsod/windows_bsod_endpoint_driver.md

+
+A regression in the network driver introduced in Elastic Defend versions 8.17.8, 8.18.3, and 9.0.3 can cause kernel pool corruption on systems with a large number of long-lived network connections that remain inactive for 30+ minutes. The corruption manifests as BSODs with various bugcheck codes including `IRQL_NOT_LESS_OR_EQUAL`, `SYSTEM_SERVICE_EXCEPTION`, `KERNEL_MODE_HEAP_CORRUPTION`, or `PAGE_FAULT_IN_NONPAGED_AREA`.
+
+This is the most frequently reported BSOD pattern and affects Windows Server environments with persistent connections (e.g. database servers, backup servers running Veeam with PostgreSQL). The kernel pool may already be corrupted when the driver attempts a routine memory allocation, causing the crash to appear in unrelated code paths.


I doubt this paragraph is useful info for a Kibana user.

ferullo · 2026-03-18T16:47:33Z

package/endpoint/docs/knowledge_base/bsod/windows_bsod_endpoint_driver.md

+
+**Fixed version**: 9.2.4.
+
+**Mitigation**: Downgrade to the prior agent version (e.g. 9.2.1) until the upgrade to 9.2.4+ can be performed.


Agent doesn't support downgrade (or am I mistaken). This should say "Upgrade to a version with the fix" I think.

ferullo · 2026-03-18T16:49:21Z

package/endpoint/docs/knowledge_base/bsod/windows_bsod_endpoint_driver.md

+
+Other security products running kernel-mode drivers can interfere with Elastic Defend's driver initialization or runtime operation. The most commonly reported conflicts include:
+
+- **Trellix Access Control**: Trellix's kernel driver can intercept the Windows Base Filtering Engine (BFE) service, causing Defend's WFP (Windows Filtering Platform) driver initialization to hang or take an extremely long time. When another thread within Defend tasks the kernel driver before initialization completes, the system crashes. The call stack typically shows `elastic_endpoint_driver!HandleIrpDeviceControl` calling `bkRegisterCallbacks` with `KeAcquireGuardedMutex` on an uninitialized mutex. This interaction was introduced by a WFP driver refactor in 8.16.0. Fixed in 8.17.6, 8.18.1, and 9.0.1. Workaround: disable Trellix Access Protection or add a Trellix exclusion for the BFE service (`svchost.exe`).


Suggested change

- **Trellix Access Control**: Trellix's kernel driver can intercept the Windows Base Filtering Engine (BFE) service, causing Defend's WFP (Windows Filtering Platform) driver initialization to hang or take an extremely long time. When another thread within Defend tasks the kernel driver before initialization completes, the system crashes. The call stack typically shows `elastic_endpoint_driver!HandleIrpDeviceControl` calling `bkRegisterCallbacks` with `KeAcquireGuardedMutex` on an uninitialized mutex. This interaction was introduced by a WFP driver refactor in 8.16.0. Fixed in 8.17.6, 8.18.1, and 9.0.1. Workaround: disable Trellix Access Protection or add a Trellix exclusion for the BFE service (`svchost.exe`).

- **Trellix Access Control**: Trellix's kernel driver can intercept the Windows Base Filtering Engine (BFE) service, causing Defend's WFP (Windows Filtering Platform) driver initialization to hang or take an extremely long time. This interaction was introduced by an Elastic Defend refactor in 8.16.0. Fixed in 8.17.6, 8.18.1, and 9.0.1. Workaround: disable Trellix Access Protection or add a Trellix exclusion for the BFE service (`svchost.exe`).

Question for Defend developers: is "add a Trellix exclusion for the BFE service (svchost.exe)." an acceptable recommendation?

I don't think that's a good recommendation to give a customer after the fixed versions were released. At this point the recommendation should be to upgrade.

ferullo · 2026-03-18T16:50:13Z

package/endpoint/docs/knowledge_base/bsod/windows_bsod_endpoint_driver.md

+
+- **CrowdStrike, Kaspersky, Windows Defender coexistence**: Running multiple endpoint security products increases the probability of kernel-level interactions. Each additional kernel-mode filter driver introduces another point of contention for file system, registry, and network callbacks. When BSODs occur on systems with multiple security products, simplify by removing redundant products.
+
+- **High third-party driver count**: Systems with an unusually high number of third-party kernel drivers (e.g. 168+ drivers) amplify the risk of pool corruption being attributed to or triggered by any one driver. Enable Driver Verifier on suspect third-party drivers to isolate the true source.


IDK, is this actionable by users?

ferullo · 2026-03-18T16:51:31Z

package/endpoint/docs/knowledge_base/bsod/windows_bsod_endpoint_driver.md

+
+### Unsupported OS version
+
+Upgrading Elastic Defend to a version that dropped support for the host's Windows version causes immediate BSODs or boot loops. The most common case is upgrading to 8.13+ on Windows Server 2012 R2, which lost support in that release. The system crashes during driver load because the driver uses kernel APIs unavailable on the older OS.


Something is missing here, we added support for Windows Server 2012 R2 back in 8.16.0

ferullo · 2026-03-18T16:52:16Z

package/endpoint/docs/knowledge_base/bsod/windows_bsod_endpoint_driver.md

+
+In some cases the endpoint driver causes a system deadlock rather than a classic BSOD. The system becomes completely unresponsive — applications freeze, Task Manager hangs, and the Elastic service cannot be stopped. This typically requires a hard reboot. A kernel memory dump captured during the lockup (via keyboard-initiated crash: right Ctrl + Scroll Lock twice) is required for diagnosis.
+
+This pattern has been observed when the driver's file system filter processing enters a long-running or blocking state while monitoring specific applications. If the lockup is reproducible with a specific application, adding that application as a Trusted Application may resolve the conflict.


This paragraph doesn't seem useful to a user.

ferullo · 2026-03-18T16:52:37Z

package/endpoint/docs/knowledge_base/bsod/windows_bsod_endpoint_driver.md

+
+## Investigation priorities
+
+1) Collect the full kernel memory dump (`C:\Windows\MEMORY.DMP` or minidumps from `C:\Windows\Minidump\`). Run WinDbg `!analyze -v` to identify the exact bugcheck code, faulting module, and call stack. Without the dump, root cause cannot be determined.


Suggested change

1) Collect the full kernel memory dump (`C:\Windows\MEMORY.DMP` or minidumps from `C:\Windows\Minidump\`). Run WinDbg `!analyze -v` to identify the exact bugcheck code, faulting module, and call stack. Without the dump, root cause cannot be determined.

1) Collect the full kernel memory dump (`C:\Windows\MEMORY.DMP` or minidumps from `C:\Windows\Minidump\`). Share the dump with Elastic.

ferullo · 2026-03-18T16:55:50Z

package/endpoint/docs/knowledge_base/device_control/device_control_notification.md

+
+## Symptom
+
+A custom notification message has been configured in the Elastic Defend Device Control policy to display when a USB device is blocked, but the Windows system tray popup does not appear. Instead, the user sees only a generic Windows Explorer error stating the device is not accessible. Alternatively, device-specific allow/block rules based on `device.serial_number` do not match the intended device because the serial number field contains `0` or a seemingly random value.


These two situations are completely different. Should a single MD doc cover them both or is it better to break this doc up? I'm holding off on reading this file until this is answered.

I have no personal preference, I'm just bringing this up in case it helps with context windows.

nicholasberlin · 2026-03-19T13:56:08Z

package/endpoint/docs/knowledge_base/high_cpu/linux_high_cpu.md

+
+## Summary
+
+Elastic Defend on Linux uses eBPF and fanotify to monitor process, file, network, and DNS activity in real time. Each event is enriched, hashed, evaluated against behavioral rules, and forwarded to the configured output. The most common drivers of high CPU on Linux are monitoring scripts that spawn many short-lived child processes (each generating process events that trigger behavioral rules), output server disconnections causing retry storms, the Events plugin hashing large binaries during policy application with an empty cache, and memory scanning of large processes.


Suggested change

Elastic Defend on Linux uses eBPF and fanotify to monitor process, file, network, and DNS activity in real time. Each event is enriched, hashed, evaluated against behavioral rules, and forwarded to the configured output. The most common drivers of high CPU on Linux are monitoring scripts that spawn many short-lived child processes (each generating process events that trigger behavioral rules), output server disconnections causing retry storms, the Events plugin hashing large binaries during policy application with an empty cache, and memory scanning of large processes.

Elastic Defend on Linux uses eBPF or tracefs to monitor process, file, network, and DNS activity in real time. Each event is enriched, hashed, evaluated against behavioral rules, and forwarded to the configured output. The most common drivers of high CPU on Linux are monitoring scripts that spawn many short-lived child processes (each generating process events that trigger behavioral rules), output server disconnections causing retry storms, the Events plugin hashing large binaries during policy application with an empty cache, and memory scanning of large processes.

nicholasberlin · 2026-03-19T14:00:27Z

package/endpoint/docs/knowledge_base/high_cpu/linux_high_cpu.md

+
+CPU returns to normal within approximately 40 seconds after connectivity is restored. The command `sudo /opt/Elastic/Endpoint/elastic-endpoint test output` can be used to verify output connectivity — on affected versions this command itself will spike CPU when the output is unreachable.
+
+Upgrade to 8.13.4+ where the retry loop includes proper backoff. Check `logs-elastic_agent.endpoint_security-*` for the error patterns above.


8.13.4 is really old, do we need to call this out?

nicholasberlin · 2026-03-19T14:01:21Z

package/endpoint/docs/knowledge_base/high_cpu/linux_high_cpu.md

+The endpoint will report status as CONFIGURING during this time:
+- `Endpoint is setting status to CONFIGURING, reason: Policy Application Status`
+
+On first run with an empty cache, the CONFIGURING phase can take 5–30 minutes depending on the number and size of running processes. This is expected behavior. Subsequent restarts are fast because the cache persists.


5-30 minutes? is that supposed to be seconds?

I suppose not, because 30 seconds wouldn't be a problem.

ferullo · 2026-03-19T15:09:07Z

package/endpoint/docs/knowledge_base/endpoint_exceptions/endpoint_exceptions_not_working.md

+
+## Symptom
+
+An Endpoint Alert Exception has been created in Elastic Defend, but the endpoint continues to generate alerts or block processes that should be excluded. Alert documents matching the exception criteria still appear in `logs-endpoint.alerts-*`.


We should add another doc for when Alert documents do not appear in logs-endpoint.alerts-*

A shipping error from Endpoint

A configuration error in Kafka or Logstash

Actually Endpoint isn't actively blocking activity, it's caused by some sort of interaction with another AV or process -- targeted Trusted Apps is the most common solution

ferullo · 2026-03-19T15:10:49Z

package/endpoint/docs/knowledge_base/endpoint_exceptions/endpoint_exceptions_not_working.md

+
+### Exception scoped to wrong integration policy
+
+Endpoint Alert Exceptions can be scoped to specific integration policies ("per policy") or applied globally. If the exception is assigned to a policy that does not cover the affected endpoint, it has no effect on that endpoint.


This is not true. We plan to ship per-policy exceptions in 9.4 and that'll be opt-in.

ferullo · 2026-03-19T15:15:49Z

package/endpoint/docs/knowledge_base/endpoint_exceptions/endpoint_exceptions_not_working.md

+
+### Exception on wrong cluster in cross-cluster search (CCS) setups
+
+In architectures where Fleet-managed agents send data to a "data cluster" and analysts query via CCS from a separate "analyst cluster", Endpoint Alert Exceptions must be created on the data cluster (the one running Fleet Server). Exceptions created on the analyst cluster are not distributed to endpoints because CCS does not support Fleet Actions.


Suggested change

In architectures where Fleet-managed agents send data to a "data cluster" and analysts query via CCS from a separate "analyst cluster", Endpoint Alert Exceptions must be created on the data cluster (the one running Fleet Server). Exceptions created on the analyst cluster are not distributed to endpoints because CCS does not support Fleet Actions.

In some architectures Fleet-managed agents are managed via an "analyst cluster" but write bulk data to a "data cluster".. In these architectures, analysts query data via CCS from the data cluster, Endpoint Alert Exceptions must be created on the analyst cluster (the one running Fleet Server). Exceptions created on the data cluster are not distributed to endpoints because CCS does not support Fleet Actions.

ferullo · 2026-03-19T15:16:11Z

package/endpoint/docs/knowledge_base/endpoint_exceptions/endpoint_exceptions_not_working.md

+
+In architectures where Fleet-managed agents send data to a "data cluster" and analysts query via CCS from a separate "analyst cluster", Endpoint Alert Exceptions must be created on the data cluster (the one running Fleet Server). Exceptions created on the analyst cluster are not distributed to endpoints because CCS does not support Fleet Actions.
+
+If exceptions exist only on the analyst cluster, alert documents will continue to be generated by endpoints and indexed on the data cluster. The analyst cluster will then alert on those documents via CCS.


Suggested change

If exceptions exist only on the analyst cluster, alert documents will continue to be generated by endpoints and indexed on the data cluster. The analyst cluster will then alert on those documents via CCS.

If exceptions exist only on the data cluster, alert documents will continue to be generated by endpoints and indexed on the data cluster. The analyst cluster will then alert on those documents via CCS.

ferullo · 2026-03-19T15:18:30Z

package/endpoint/docs/knowledge_base/endpoint_exceptions/endpoint_exceptions_not_working.md

+### Exception not suppressing Kibana detection rule alerts
+
+Endpoint Alert Exceptions suppress alerts generated by the endpoint itself — they prevent the endpoint from creating alert documents in `logs-endpoint.alerts-*`. However, if a Kibana detection rule (e.g. the prebuilt "Malware Detection Alert" rule) is configured to fire on endpoint alert documents, that Kibana rule generates its own separate alert documents.
+
+If the endpoint exception is working correctly (no new documents in `logs-endpoint.alerts-*`), but alerts still appear in the Kibana Alerts page, the source is the Kibana detection rule, not the endpoint. In this case, add a Detection Rule Exception on the Kibana rule itself, or disable the redundant detection rule.
+
+To distinguish: check the `kibana.alert.rule.rule_type_id` field. Endpoint-generated alerts have `event.module: endpoint` and `event.dataset: endpoint.alerts`. Kibana detection rule alerts have fields like `kibana.alert.rule.category: "Custom Query Rule"` and `kibana.alert.rule.producer: "siem"`.


This whole section makes no sense. It's not possible for the SIEM rule to create an alert if there is no underlying Endpoint alert in logs-endpoint.alerts-*.

Maybe we mean to say alerts created by most SIEM rules need SIEM exceptions not Endpoint exceptions?

ferullo · 2026-03-19T15:30:15Z

package/endpoint/docs/knowledge_base/endpoint_exceptions/false_positive_malware_detection.md

+
+To suppress on Elastic Defend 9.2+: add the process as a **Trusted Application** (this disables behavioral detections for the process entirely) OR add an **Endpoint Alert Exception** targeting the specific rule ID and process if you want to suppress only that one rule.
+
+To suppress on pre-9.2: Trusted Applications do not suppress behavioral detections on these older versions. Use an **Endpoint Alert Exception** targeting `rule.id` and the relevant process fields.


Suggested change

To suppress on pre-9.2: Trusted Applications do not suppress behavioral detections on these older versions. Use an **Endpoint Alert Exception** targeting `rule.id` and the relevant process fields.

ferullo · 2026-03-19T15:30:55Z

package/endpoint/docs/knowledge_base/endpoint_exceptions/false_positive_malware_detection.md

+- A third-party driver or application causes memory corruption that makes non-executable memory pages appear executable, leading to unexpected scanning of heap regions.
+- Faulty RAM causes bit flips in page table entries, changing page protection bits and exposing non-code memory to the scanner.


Suggested change

- A third-party driver or application causes memory corruption that makes non-executable memory pages appear executable, leading to unexpected scanning of heap regions.

- Faulty RAM causes bit flips in page table entries, changing page protection bits and exposing non-code memory to the scanner.

ferullo · 2026-03-23T19:26:25Z

package/endpoint/docs/knowledge_base/endpoint_exceptions/false_positive_malware_detection.md

+### Alert field data corrupted by product bug
+
+In rare cases, the endpoint may populate alert fields with corrupted data (e.g. truncated `dll.path` or `dll.name` values due to kernel-level memory corruption from a third-party driver). When the corrupted field values happen to match a behavioral detection rule's criteria — for example, a DLL path truncated to appear as if it has an unusual extension — the rule fires on data that does not reflect the actual system state.
+
+These are product bugs, not fixable via exceptions in the general case. However, a targeted Endpoint Alert Exception can suppress the specific false-positive pattern as a workaround. For example, suppressing a rule when the DLL's code signature status indicates it was never actually loaded (`dll.code_signature.status: "errorCode_endpoint: Initital state, no attempt to load signature was made"`).
+
+If you encounter alerts where field values appear truncated or nonsensical, collect agent diagnostics and the full alert document for Elastic Support. Check for third-party kernel drivers that may be causing memory corruption — the Windows Driver Verifier can help identify misbehaving drivers on non-production systems.


This was a very edge case, I think we should remove it. It won't generalize well.

Suggested change

### Alert field data corrupted by product bug

In rare cases, the endpoint may populate alert fields with corrupted data (e.g. truncated `dll.path` or `dll.name` values due to kernel-level memory corruption from a third-party driver). When the corrupted field values happen to match a behavioral detection rule's criteria — for example, a DLL path truncated to appear as if it has an unusual extension — the rule fires on data that does not reflect the actual system state.

These are product bugs, not fixable via exceptions in the general case. However, a targeted Endpoint Alert Exception can suppress the specific false-positive pattern as a workaround. For example, suppressing a rule when the DLL's code signature status indicates it was never actually loaded (`dll.code_signature.status: "errorCode_endpoint: Initital state, no attempt to load signature was made"`).

If you encounter alerts where field values appear truncated or nonsensical, collect agent diagnostics and the full alert document for Elastic Support. Check for third-party kernel drivers that may be causing memory corruption — the Windows Driver Verifier can help identify misbehaving drivers on non-production systems.

ferullo · 2026-03-23T19:27:02Z

package/endpoint/docs/knowledge_base/endpoint_exceptions/false_positive_malware_detection.md

+
+The Elastic Defend blocklist prevents known-malicious binaries from executing. However, if a process matches the **Global Exception List** before the blocklist is evaluated, the blocklist entry is skipped. The Global Exception List contains entries for major trusted software publishers (e.g. Microsoft). This means binaries signed by these vendors cannot be blocked via the user blocklist — the global exception takes priority.
+
+This is a known product limitation. If the goal is to prevent specific trusted-signed binaries from running, the blocklist is not the correct mechanism. Consider using operating system-level application control policies (e.g. Windows AppLocker or WDAC) instead.


Suggested change

This is a known product limitation. If the goal is to prevent specific trusted-signed binaries from running, the blocklist is not the correct mechanism. Consider using operating system-level application control policies (e.g. Windows AppLocker or WDAC) instead.

ferullo · 2026-03-23T19:27:57Z

package/endpoint/docs/knowledge_base/endpoint_exceptions/false_positive_malware_detection.md

+
+## Investigation priorities
+
+1) Determine the detection engine that generated the alert by checking `event.code`: `malicious_file` (malware engine), `behavior` (behavioral protection), `memory_signature` (memory scanning). This determines which artifact type to use for suppression.


Suggested change

1) Determine the detection engine that generated the alert by checking `event.code`: `malicious_file` (malware engine), `behavior` (behavioral protection), `memory_signature` (memory scanning). This determines which artifact type to use for suppression.

1) Determine the detection engine that generated the alert by checking `event.code`: `malicious_file` (malware protection), `behavior` (malicious behavior protection), `memory_signature` (memory threat protection), `ransomware` (ransomware protection). This determines which artifact type to use for suppression.

ferullo · 2026-03-23T19:30:56Z

package/endpoint/docs/knowledge_base/high_cpu/linux_high_cpu.md

+
+Monitoring and automation scripts that run on a schedule (cron, systemd timers) and spawn many child processes are the most common cause of high CPU on Linux. A single monitoring script invoking `curl`, `mysql`, `ssh`, `grep`, `sed`, `awk`, and `bash` in rapid succession generates a burst of process creation events, each of which Elastic Defend must enrich and evaluate against behavioral rules.
+
+A typical pattern is hourly CPU spikes lasting 5–10 minutes, aligning with cron schedules (e.g. xx:31–xx:41 every hour). In one case, a script at `/var/cache/system-monitoring/helper/compare-inventory.sh` that used `curl` to collect data triggered the behavioral rule "Suspicious Download and Redirect by Web Server" 86,340 times in a single diagnostics window, driving endpoint service CPU above 200% (out of 800% on 8 cores).


Suggested change

A typical pattern is hourly CPU spikes lasting 5–10 minutes, aligning with cron schedules (e.g. xx:31–xx:41 every hour). In one case, a script at `/var/cache/system-monitoring/helper/compare-inventory.sh` that used `curl` to collect data triggered the behavioral rule "Suspicious Download and Redirect by Web Server" 86,340 times in a single diagnostics window, driving endpoint service CPU above 200% (out of 800% on 8 cores).

A typical pattern is hourly CPU spikes lasting 5–10 minutes, aligning with cron schedules (e.g. xx:31–xx:41 every hour).

ferullo · 2026-03-23T19:32:00Z

package/endpoint/docs/knowledge_base/high_cpu/linux_high_cpu.md

+
+A typical pattern is hourly CPU spikes lasting 5–10 minutes, aligning with cron schedules (e.g. xx:31–xx:41 every hour). In one case, a script at `/var/cache/system-monitoring/helper/compare-inventory.sh` that used `curl` to collect data triggered the behavioral rule "Suspicious Download and Redirect by Web Server" 86,340 times in a single diagnostics window, driving endpoint service CPU above 200% (out of 800% on 8 cores).
+
+Adding the parent script as a Trusted Application stops monitoring of its process tree but does not prevent behavioral rules from firing if the rule matches on child process characteristics. On versions prior to 9.2, behavioral detections still fire for trusted processes. On 9.2+, behavioral detections are disabled for Trusted Applications.


@nicholasberlin Trusted Application Descendants or Event Filter Descendants will help here, right?

ferullo · 2026-03-23T19:32:32Z

package/endpoint/docs/knowledge_base/high_cpu/linux_high_cpu.md

+- Create an **Endpoint Alert Exception** targeting the specific rule ID and parent process:
+  - `rule.id IS <rule-id>` AND `process.parent.executable IS /path/to/script`
+- Upgrade to 8.19.11+ or 9.2+ for improved handling of trusted process behavioral rules.
+- Review `linux.advanced.events` settings to disable unnecessary event types (e.g. `linux.advanced.events.dns` if DNS monitoring is not needed).


Suggested change

- Review `linux.advanced.events` settings to disable unnecessary event types (e.g. `linux.advanced.events.dns` if DNS monitoring is not needed).

- Review `linux.advanced.events` settings to disable unnecessary event types (e.g. `linux.advanced.events.dns` if malicious behavior rules based on DNS monitoring is not needed).

ferullo · 2026-03-23T19:32:52Z

package/endpoint/docs/knowledge_base/high_cpu/linux_high_cpu.md

+  - `rule.id IS <rule-id>` AND `process.parent.executable IS /path/to/script`
+- Upgrade to 8.19.11+ or 9.2+ for improved handling of trusted process behavioral rules.
+- Review `linux.advanced.events` settings to disable unnecessary event types (e.g. `linux.advanced.events.dns` if DNS monitoring is not needed).
+- Use **Event Filters** to reduce event volume from known-noisy directories without creating a monitoring blind spot.


Suggested change

- Use **Event Filters** to reduce event volume from known-noisy directories without creating a monitoring blind spot.

- Use **Event Filters** to reduce event volume from known-noisy directories without creating a active protection blind spot.

ferullo · 2026-03-23T19:33:51Z

package/endpoint/docs/knowledge_base/high_cpu/linux_high_cpu.md

+
+### Events plugin hung hashing large binaries during policy application
+
+During policy application, Elastic Defend hashes all running processes and their executables. When the file cache (`/opt/Elastic/Endpoint/state/cache.db`) is empty — first install, cache deleted, or after upgrade — the endpoint must hash every binary from scratch. Large binaries (e.g. Oracle at `/data/app/oracle/product/*/bin/oracle`) can cause the Events plugin to hang in `ConfigurationCallback` while performing SHA1 hashing, driving CPU to 100% for extended periods.


Suggested change

During policy application, Elastic Defend hashes all running processes and their executables. When the file cache (`/opt/Elastic/Endpoint/state/cache.db`) is empty — first install, cache deleted, or after upgrade — the endpoint must hash every binary from scratch. Large binaries (e.g. Oracle at `/data/app/oracle/product/*/bin/oracle`) can cause the Events plugin to hang in `ConfigurationCallback` while performing SHA1 hashing, driving CPU to 100% for extended periods.

During policy application, Elastic Defend hashes all running processes and their executables. When the file cache (`/opt/Elastic/Endpoint/state/cache.db`) is empty — first install, cache deleted, or after upgrade — the endpoint must hash every binary from scratch. Large binaries (e.g. Oracle at `/data/app/oracle/product/*/bin/oracle`) can drive CPU to 100% for extended periods.

tomsonpl

Looks alright, left a few questions just for clarity sake :)

tomsonpl · 2026-03-24T15:49:48Z

package/endpoint/docs/knowledge_base/trusted_apps/trusted_apps_not_effective.md

+---
+type: automatic_troubleshooting
+sub_type: trusted_apps
+link: https://www.elastic.co/docs/solutions/security/manage-elastic-defend/optimize-elastic-defend


I see that only 3 files contain the link field, just checking if this is intentional?

yeah, this is an optional field

tomsonpl · 2026-03-24T15:50:28Z

package/endpoint/docs/knowledge_base/incompatible_software/incompatible_software_third_party.md

+---
+type: automatic_troubleshooting
+sub_type: incompatible_software
+date: '2026-03-11'


no OS field here?

ferullo · 2026-03-24T17:38:41Z

package/endpoint/docs/knowledge_base/high_cpu/windows_high_cpu.md

+
+## Summary
+
+Elastic Defend performs real-time file hashing, digital signature verification, memory scanning, behavioral rule evaluation, and event enrichment. Each protection layer adds processing overhead, and the cumulative effect depends on the workload profile of the host. The most common drivers of high CPU on Windows are third-party security product conflicts creating mutual scanning loops, high-volume security events (logon/logoff) overwhelming the rules engine, file-intensive operations triggering repeated hashing of large binaries, and VDI/Citrix environments where the file metadata cache is empty on every session.


Suggested change

Elastic Defend performs real-time file hashing, digital signature verification, memory scanning, behavioral rule evaluation, and event enrichment. Each protection layer adds processing overhead, and the cumulative effect depends on the workload profile of the host. The most common drivers of high CPU on Windows are third-party security product conflicts creating mutual scanning loops, high-volume security events (logon/logoff) overwhelming the rules engine, file-intensive operations triggering repeated hashing of large binaries, and VDI/Citrix environments where the file metadata cache is empty on every session.

Elastic Defend performs real-time file hashing, digital signature verification, memory scanning, behavioral rule evaluation, and event enrichment. Each protection layer adds processing overhead, and the cumulative effect depends on the workload profile of the host. The most common drivers of high CPU on Windows are third-party security product conflicts creating mutual scanning loops, high-volume security events (logon/logoff) overwhelming the Malicious Behavior engine, file-intensive operations triggering repeated hashing of large binaries, and VDI/Citrix environments where the file metadata cache is empty on every session.

ferullo · 2026-03-24T17:42:59Z

package/endpoint/docs/knowledge_base/high_cpu/windows_high_cpu.md

+
+Elastic Defend hashes and verifies digital signatures of executables and DLLs when they are loaded. Large binaries like `msedge.dll` (195 MB) can take 10–15 seconds per hash operation. On hosts running browsers, Office applications, or developer tools that load many large DLLs, this creates sustained CPU spikes.
+
+The endpoint maintains a file metadata cache to avoid re-hashing known files. On versions prior to 8.0.1, the cache size (`FILE_OBJECT_CACHE_SIZE`) was limited to 500 entries, leading to frequent cache evictions and redundant hashing. Upgrading to 8.0.1+ significantly improves caching behavior.


pre 8.0.1 is so old it is no longer supported.

Suggested change

The endpoint maintains a file metadata cache to avoid re-hashing known files. On versions prior to 8.0.1, the cache size (`FILE_OBJECT_CACHE_SIZE`) was limited to 500 entries, leading to frequent cache evictions and redundant hashing. Upgrading to 8.0.1+ significantly improves caching behavior.

ferullo · 2026-03-24T17:43:35Z

package/endpoint/docs/knowledge_base/high_cpu/windows_high_cpu.md

+
+The endpoint maintains a file metadata cache to avoid re-hashing known files. On versions prior to 8.0.1, the cache size (`FILE_OBJECT_CACHE_SIZE`) was limited to 500 entries, leading to frequent cache evictions and redundant hashing. Upgrading to 8.0.1+ significantly improves caching behavior.
+
+For immediate relief on older versions, set `windows.advanced.kernel.asyncimageload` and `windows.advanced.kernel.syncimageload` to `false` in advanced policy settings. This reduces CPU by approximately 85% for DLL load processing at the cost of reduced visibility into library load events.


Suggested change

For immediate relief on older versions, set `windows.advanced.kernel.asyncimageload` and `windows.advanced.kernel.syncimageload` to `false` in advanced policy settings. This reduces CPU by approximately 85% for DLL load processing at the cost of reduced visibility into library load events.

intxgo

Flushing some comments/suggestions

intxgo · 2026-03-25T15:32:55Z

package/endpoint/docs/knowledge_base/trusted_apps/trusted_apps_not_effective.md

+
+## Summary
+
+A Trusted Application tells Elastic Defend to stop monitoring a process entirely — it creates an intentional blind spot. This is fundamentally different from an Endpoint Alert Exception, which only suppresses alerts while continuing to monitor. Many "trusted app not working" issues stem from using the wrong artifact type, misconfiguring the entry's condition fields, or running a version affected by known event-processing bugs.


Suggested change

A Trusted Application tells Elastic Defend to stop monitoring a process entirely — it creates an intentional blind spot. This is fundamentally different from an Endpoint Alert Exception, which only suppresses alerts while continuing to monitor. Many "trusted app not working" issues stem from using the wrong artifact type, misconfiguring the entry's condition fields, or running a version affected by known event-processing bugs.

A Trusted Application tells Elastic Defend to stop monitoring a process entirely — it creates an intentional blind spot - implicitly meaning it only works on executables. This is fundamentally different from an Endpoint Alert Exception, which only suppresses alerts while continuing to monitor. Many "trusted app not working" issues stem from using the wrong artifact type, misconfiguring the entry's condition fields, or running a version affected by known event-processing bugs.

implicitly meaning it only works on executables

This is true for standard trusted apps but Advanced Trusted Apps lets users trust non-process activity. The original version seems more correct to me.

@ferullo I have to verify that, but I seriously doubt it, last time I walked the code, TA was evaluated only on execution callbacks, so anything like file access/modification/creation won't work here, but that's what people assume this can do

intxgo · 2026-03-25T15:44:18Z

package/endpoint/docs/knowledge_base/trusted_apps/trusted_apps_not_effective.md

+## Common issues
+
+### Wrong condition field or value


Suggested change

## Common issues

### Wrong condition field or value

## Common issues

### Logical misuse

Trusted application makes an application process trusted, allowing it to do whatever it does, modify files, start other applications, etc. When you're creating a Trusted Application entry you're adding a filter telling {{elastic-defend}} to fully trust a process matching the filter and all its actions. However it can't be used the other way around, to whitelist a particular action for some process(es), for example to trust any app to modify a particular csv file whilst doing full monitoring for other actions mabe by the process(es).

### Wrong condition field or value

or ### Conceptual misues ?

I can see from SDHs that people really misuse TA, most likely because we don't have a commonly used feature "Suppress protections (trust) everything matching the path: xxx".

The problem is that they hardly ever cleanup unsuccessful TA entries, either they don't know or don't care, but the more entries the more CPU is spend on every process start evaluating the TA list.

We've seen TA artifacts containing paths like C:\somethin\*.sys, C:\*\*.txt

Replace {{elastic-defend}} with Elastic Defend.

we don't have a commonly used feature "Suppress protections (trust) everything matching the path: xxx"

It's not obvious but the solution are Endpoint Alert Exceptions for that (either based on file.path MATCHES c:\whatever\* or process.executable MATCHES c:\whatever\*

agree, we need to walk-by-hand the users here, so explain them TA is not what they think is the right tool -> point them towards exceptions... and here is another rabbit hole. I've recently tried to ask public LLM how to make an exception in Elastic Defend for a particular alert, it got it all wrong! LLMs (likewise me, looking at the public documents) are really confused about how to set Endpoint exceptions. The whole documentation is written for BEATS, loosely adding in Elastic Defend to the picture as "Endpoint exception" but IMO this is really confusing when you're having actions/processes blocked on your machine, going to Kibana to "suppress them". The verbiage in the UI and docs could be much more improved. I know I'm biased as Endpoint developer, but the UI/doc kind of hide/obscure Endpoint alerts and exception lists mechanics.

intxgo · 2026-03-25T16:11:49Z

package/endpoint/docs/knowledge_base/missed_checkins/windows_missed_checkins.md

+## Symptom
+
+Elastic Agent reports the Endpoint component as `FAILED` or `DEGRADED` with the message "Failed: endpoint service missed 3 check-ins". The Endpoint may appear unhealthy in Fleet, and endpoint events, alerts, and metadata stop being ingested. In some cases thousands of endpoints go unhealthy simultaneously.


Suggested change

## Symptom

Elastic Agent reports the Endpoint component as `FAILED` or `DEGRADED` with the message "Failed: endpoint service missed 3 check-ins". The Endpoint may appear unhealthy in Fleet, and endpoint events, alerts, and metadata stop being ingested. In some cases thousands of endpoints go unhealthy simultaneously.

## Symptom

The Agent appears `ORPHANED` in Fleet, and Endpoint policy changes do not apply nor response actions can execute. In some cases thousands of endpoints go unhealthy simultaneously.

## Summary

Elastic Endpoint has lost communication with Elastic Agent, but is still protecting the system according to last known policy and sends all events and alerts to the stack.

## Common issues

### Elastic Agent not running

Elastic Agent crashed, has been explicitly disabled or removed.

### Communication between Agent and Endpoint services not working

Elastic Agent reports the Endpoint component as `FAILED` or `DEGRADED` with the message "Failed: endpoint service missed 3 check-ins". For troubleshooting see next symptom below.

## Symptom

Elastic Agent appears `UNHEALTHY` in fleet and Elastic Defend integration indicates errors. Elastic Agent reports the Endpoint component as `FAILED` or `DEGRADED` with the message "Failed: endpoint service missed 3 check-ins". Endpoint events, alerts, and metadata stop being ingested. In some cases thousands of endpoints go unhealthy simultaneously.

intxgo · 2026-03-25T16:12:49Z

package/endpoint/docs/knowledge_base/missed_checkins/linux_missed_checkins.md

+os: [Linux]
+date: '2026-03-09'
+---
+


Similar to windows case, please also explain Elastic Agent crash case.

intxgo · 2026-03-25T16:14:52Z

package/endpoint/docs/knowledge_base/incompatible_software/incompatible_software_third_party.md

+
+## Symptom
+
+A third-party application malfunctions, crashes, or causes system instability when Elastic Defend (or the Endgame Sensor) is installed. Common manifestations include: application crashes (e.g. browsers closing unexpectedly), network policies or VPN functionality breaking, CPU spikes rendering the system unresponsive, or virtual machines hanging during policy changes. The issue resolves when the endpoint security product is uninstalled or specific protection features are disabled.


Suggested change

A third-party application malfunctions, crashes, or causes system instability when Elastic Defend (or the Endgame Sensor) is installed. Common manifestations include: application crashes (e.g. browsers closing unexpectedly), network policies or VPN functionality breaking, CPU spikes rendering the system unresponsive, or virtual machines hanging during policy changes. The issue resolves when the endpoint security product is uninstalled or specific protection features are disabled.

A third-party application malfunctions, crashes, or causes system instability when Elastic Defend is installed. Common manifestations include: application crashes (e.g. browsers closing unexpectedly), network policies or VPN functionality breaking, CPU spikes rendering the system unresponsive, or virtual machines hanging during policy changes. The issue resolves when the endpoint security product is uninstalled or specific protection features are disabled.

why that discontinued product appeared here? Furthermore it kind of suggested that Endpoint and Sensor are recommended to run together.

intxgo · 2026-03-25T16:15:30Z

package/endpoint/docs/knowledge_base/incompatible_software/incompatible_software_third_party.md

+
+## Summary
+
+Elastic Defend operates at multiple system levels — kernel-mode drivers for file system and network filtering, user-mode DLL injection for exploit protection (Endgame Sensor), and eBPF probes on Linux for host isolation and event collection. These deep integrations can conflict with other software that operates at similarly privileged levels: other security products with kernel drivers, eBPF-based networking tools, VPN clients that intercept DNS or network traffic, and applications that generate extreme volumes of system activity.


Suggested change

Elastic Defend operates at multiple system levels — kernel-mode drivers for file system and network filtering, user-mode DLL injection for exploit protection (Endgame Sensor), and eBPF probes on Linux for host isolation and event collection. These deep integrations can conflict with other software that operates at similarly privileged levels: other security products with kernel drivers, eBPF-based networking tools, VPN clients that intercept DNS or network traffic, and applications that generate extreme volumes of system activity.

Elastic Defend operates at multiple system levels — kernel-mode drivers for file system and network filtering, user-mode DLL injection for exploit protection, and eBPF probes on Linux for host isolation and event collection. These deep integrations can conflict with other software that operates at similarly privileged levels: other security products with kernel drivers, eBPF-based networking tools, VPN clients that intercept DNS or network traffic, and applications that generate extreme volumes of system activity.

intxgo · 2026-03-25T16:22:00Z

package/endpoint/docs/knowledge_base/high_cpu/windows_high_cpu.md

+
+Silverfort is a particularly acute case on Domain Controllers because its `SilverfortServer.exe` generates over 10,000 TCP connections per minute via WinDivert, each producing a network event that Elastic Defend must process. Combined with Malicious Behavior Protection requiring network event enrichment for the rules engine, this can saturate CPU and eventually cause blue screens.
+
+To confirm, check `elastic-endpoint top` for the third-party product's processes. If they dominate `overall.week_ms` in the metrics, add them as **Trusted Applications** in Elastic Defend. Also add Elastic Defend's paths to the third-party product's exclusion list.


Suggested change

To confirm, check `elastic-endpoint top` for the third-party product's processes. If they dominate `overall.week_ms` in the metrics, add them as **Trusted Applications** in Elastic Defend. Also add Elastic Defend's paths to the third-party product's exclusion list.

Add all 3rd party security applications as **Trusted Applications** in Elastic Defend to break feedback loops. Also add Elastic Defend's paths to the third-party product's exclusion list.

To confirm the problem on Endpoint side check `elastic-endpoint top` for the third-party product's processes. If they dominate `overall.week_ms` in the metrics.

intxgo · 2026-03-25T16:27:28Z

package/endpoint/docs/knowledge_base/bsod/windows_bsod_endpoint_driver.md

+
+## Symptom
+
+A Windows system running Elastic Defend experiences a Blue Screen of Death (BSOD) or kernel crash. The memory dump analysis references `elastic_endpoint_driver.sys` or `elastic-endpoint-driver.sys`. Common bugcheck codes include `IRQL_NOT_LESS_OR_EQUAL`, `KERNEL_MODE_HEAP_CORRUPTION` (0x13A), `SYSTEM_SERVICE_EXCEPTION` (0x3B), or `PAGE_FAULT_IN_NONPAGED_AREA`. The crash may occur shortly after an agent upgrade, during heavy I/O workloads, or after the system has been running for some time with many network connections. In severe cases the system enters a boot loop.


Suggested change

A Windows system running Elastic Defend experiences a Blue Screen of Death (BSOD) or kernel crash. The memory dump analysis references `elastic_endpoint_driver.sys` or `elastic-endpoint-driver.sys`. Common bugcheck codes include `IRQL_NOT_LESS_OR_EQUAL`, `KERNEL_MODE_HEAP_CORRUPTION` (0x13A), `SYSTEM_SERVICE_EXCEPTION` (0x3B), or `PAGE_FAULT_IN_NONPAGED_AREA`. The crash may occur shortly after an agent upgrade, during heavy I/O workloads, or after the system has been running for some time with many network connections. In severe cases the system enters a boot loop.

A Windows system running Elastic Defend experiences a Blue Screen of Death (BSOD) or kernel crash. The memory dump analysis references `elastic_endpoint_driver.sys` or `elastic-endpoint-driver.sys`. The crash may occur shortly after an agent upgrade, 3rd party security product installation or it's configuration change, during heavy I/O workloads, or after the system has been running for some time with many network connections. In severe cases the system enters a boot loop.

I don't think we're commonly causing crashes, so this sentence makes absolutely no sense to me.

ferullo · 2026-03-26T17:20:04Z

package/endpoint/docs/knowledge_base/incompatible_software/incompatible_software_third_party.md

+
+## Summary
+
+Elastic Defend operates at multiple system levels — kernel-mode drivers for file system and network filtering, user-mode DLL injection for exploit protection (Endgame Sensor), and eBPF probes on Linux for host isolation and event collection. These deep integrations can conflict with other software that operates at similarly privileged levels: other security products with kernel drivers, eBPF-based networking tools, VPN clients that intercept DNS or network traffic, and applications that generate extreme volumes of system activity.


Suggested change

Elastic Defend operates at multiple system levels — kernel-mode drivers for file system and network filtering, user-mode DLL injection for exploit protection (Endgame Sensor), and eBPF probes on Linux for host isolation and event collection. These deep integrations can conflict with other software that operates at similarly privileged levels: other security products with kernel drivers, eBPF-based networking tools, VPN clients that intercept DNS or network traffic, and applications that generate extreme volumes of system activity.

Elastic Defend operates at multiple system levels — kernel-mode drivers for file system and network filtering on Windows, system extension and network content filtering on macOS, and eBPF probes on Linux. These deep integrations can conflict with other software that operates at similarly privileged levels: other security products with kernel drivers, eBPF-based networking tools, VPN clients that intercept DNS or network traffic, and applications that generate extreme volumes of system activity.

ferullo · 2026-03-26T17:22:51Z

package/endpoint/docs/knowledge_base/incompatible_software/incompatible_software_third_party.md

+### Browser crashes from exploit protection DLL after Windows updates (Endgame Sensor)
+
+On systems running the Endgame Sensor, Chrome, Edge, and potentially other Chromium-based browsers crash within seconds of opening. The crash dump shows a heap corruption or access violation in `esensordbi.dll` (the Endgame exploit protection injection DLL), specifically in `HookManager_InstallExternalHook`. This issue is triggered by specific Windows security updates (e.g. KB5074109 on Windows 11 25H2) that change the set of DLLs loaded into browser processes.
+
+The root cause is a bug in the Endgame Sensor's hook manager linked list. When the maximum number of hooks (4) for a particular function in a module is reached, the error handling path frees a `HOOK_ENTRY` that is still part of the global hook list. The next traversal of the list encounters corrupted `flink`/`blink` pointers and crashes. Windows updates that add new DLLs to browser processes increase the number of hooks installed, making it more likely to hit the limit and trigger the bug.
+
+**Fixed in**: Endgame Sensor 3.65.3.
+
+**Workaround** (for older sensor versions): In the Endgame policy under Threats > Exploit, uncheck Chrome and Edge to disable exploit protection injection into those browsers. Additionally, under Settings > Event Collection > Windows Event Collection, disable beta event sources. This prevents `esensordbi.dll` from being injected into the affected processes.
+
+Disabling exploit protection for browsers reduces defense against browser exploitation. Keep browsers and the OS fully patched to minimize the window of exposure.


Suggested change

### Browser crashes from exploit protection DLL after Windows updates (Endgame Sensor)

On systems running the Endgame Sensor, Chrome, Edge, and potentially other Chromium-based browsers crash within seconds of opening. The crash dump shows a heap corruption or access violation in `esensordbi.dll` (the Endgame exploit protection injection DLL), specifically in `HookManager_InstallExternalHook`. This issue is triggered by specific Windows security updates (e.g. KB5074109 on Windows 11 25H2) that change the set of DLLs loaded into browser processes.

The root cause is a bug in the Endgame Sensor's hook manager linked list. When the maximum number of hooks (4) for a particular function in a module is reached, the error handling path frees a `HOOK_ENTRY` that is still part of the global hook list. The next traversal of the list encounters corrupted `flink`/`blink` pointers and crashes. Windows updates that add new DLLs to browser processes increase the number of hooks installed, making it more likely to hit the limit and trigger the bug.

**Fixed in**: Endgame Sensor 3.65.3.

**Workaround** (for older sensor versions): In the Endgame policy under Threats > Exploit, uncheck Chrome and Edge to disable exploit protection injection into those browsers. Additionally, under Settings > Event Collection > Windows Event Collection, disable beta event sources. This prevents `esensordbi.dll` from being injected into the affected processes.

Disabling exploit protection for browsers reduces defense against browser exploitation. Keep browsers and the OS fully patched to minimize the window of exposure.

ferullo · 2026-03-26T17:27:54Z

package/endpoint/docs/knowledge_base/incompatible_software/incompatible_software_third_party.md

+
+To diagnose: compare DNS resolution behavior with Elastic Defend enabled vs disabled while the VPN is connected. Use `nslookup` or `dig` to test resolution of both internal (VPN-routed) and external domains.
+
+To resolve: if DNS events are not essential, disable DNS event collection in the Elastic Defend policy. If DNS events are needed, add the VPN client process as a Trusted Application to stop Endpoint from monitoring its network activity. If the conflict persists, set `mac.advanced.capture.env.dns: false` in advanced policy settings to disable DNS event capture at a lower level.


mac.advanced.capture.env.dns does not exist. I don't see any advanced option that can do this, @ricardo-estc is there one?

ferullo · 2026-03-26T17:28:46Z

package/endpoint/docs/knowledge_base/incompatible_software/incompatible_software_third_party.md

+
+### Microsoft DFSR replication issues with file monitoring
+
+On Windows Server systems running Distributed File System Replication (DFSR), Elastic Defend's file system filter driver can interfere with replication operations. Symptoms include replication backlogs, unexpected staging folder growth, or DFSR service errors in the event log. The issue is exacerbated when rollback self-healing (`windows.advanced.artifacts.global.rollback.self_healing.enabled`) is active, as Elastic Defend may detect and attempt to roll back legitimate DFSR staging operations that involve suspicious file patterns.


Suggested change

On Windows Server systems running Distributed File System Replication (DFSR), Elastic Defend's file system filter driver can interfere with replication operations. Symptoms include replication backlogs, unexpected staging folder growth, or DFSR service errors in the event log. The issue is exacerbated when rollback self-healing (`windows.advanced.artifacts.global.rollback.self_healing.enabled`) is active, as Elastic Defend may detect and attempt to roll back legitimate DFSR staging operations that involve suspicious file patterns.

On Windows Server systems running Distributed File System Replication (DFSR), Elastic Defend's file system filter driver can interfere with replication operations. Symptoms include replication backlogs, unexpected staging folder growth, or DFSR service errors in the event log. The issue is exacerbated when rollback self-healing (`windows.advanced.alerts.rollback.self_healing.enabled`) is active, as Elastic Defend may detect and attempt to roll back legitimate DFSR staging operations that involve suspicious file patterns.

ferullo · 2026-03-26T17:30:09Z

package/endpoint/docs/knowledge_base/incompatible_software/incompatible_software_third_party.md

+1) Identify the conflicting third-party software by correlating the onset of symptoms with software installation, updates, or configuration changes. Query `logs-endpoint.events.process-*` for recently started processes that match known conflicting software.
+2) Determine which Elastic Defend subsystem is involved — kernel network driver (`windows.advanced.kernel.network`), eBPF probes (`linux.advanced.host_isolation.allowed`), exploit protection DLL injection (Endgame policy), or file system filter driver. Each has different mitigation controls.
+3) Check `metrics-endpoint.metrics-*` for CPU usage patterns and correlate with `logs-endpoint.events.process-*` for processes generating high event volumes. Use `elastic-endpoint top` on the affected host to identify which processes and event types drive resource consumption.
+4) For browser crashes on Windows, collect crash dumps from `%HOMEPATH%\AppData\Local\Google\Chrome\User Data\Crashpad\reports` or `%LOCALAPPDATA%\CrashDumps` and check for `esensordbi.dll` in the faulting module. Verify the Endgame Sensor version and check if a fix is available.


Suggested change

4) For browser crashes on Windows, collect crash dumps from `%HOMEPATH%\AppData\Local\Google\Chrome\User Data\Crashpad\reports` or `%LOCALAPPDATA%\CrashDumps` and check for `esensordbi.dll` in the faulting module. Verify the Endgame Sensor version and check if a fix is available.

4) For browser crashes on Windows, collect crash dumps from `%HOMEPATH%\AppData\Local\Google\Chrome\User Data\Crashpad\reports` or `%LOCALAPPDATA%\CrashDumps`.

ferullo · 2026-03-26T17:33:32Z

package/endpoint/docs/knowledge_base/missed_checkins/linux_missed_checkins.md

+
+This sets the correct SELinux file context so systemd is permitted to execute the Endpoint binary. The fix persists across Endpoint upgrades as long as the `semanage` rule remains.
+
+### "No valid comms client available" in Endpoint logs


This log message is wrong. @intxgo do you know what log is correct here?

I'll park it for tomorrow, thanks

Indeed, that message is for output only.

Endpoint -> Agent connection issue can be breaking in multiple places, I think we can point users at the 2 major logs

- ### "No valid comms client available" in Endpoint logs + ### "Unable to retrieve connection info from Agent" or "Agent GRPC connection failed to establish within deadline" in Endpoint logs

ferullo · 2026-03-26T17:33:55Z

package/endpoint/docs/knowledge_base/missed_checkins/linux_missed_checkins.md

+
+### "No valid comms client available" in Endpoint logs
+
+This message in the Endpoint state logs (`endpoint-*.log`) indicates that the Endpoint process started but cannot establish a communication channel with Elastic Agent. Common causes:


Suggested change

This message in the Endpoint state logs (`endpoint-*.log`) indicates that the Endpoint process started but cannot establish a communication channel with Elastic Agent. Common causes:

This message in the Endpoint state logs (`logs-elastic_agent.endpoint_security-*`) indicates that the Endpoint process started but cannot establish a communication channel with Elastic Agent. Common causes:

ferullo · 2026-03-26T17:34:24Z

package/endpoint/docs/knowledge_base/missed_checkins/linux_missed_checkins.md

+
+- The Agent process restarted or is not yet ready to accept connections when Endpoint initializes.
+- A firewall rule or security policy is blocking localhost connections on ports 6788/6789.
+- The Endpoint process is aborting shortly after startup due to another issue (look for `Aborting due to signal` immediately after the comms error).


Suggested change

- The Endpoint process is aborting shortly after startup due to another issue (look for `Aborting due to signal` immediately after the comms error).

- The Endpoint process is aborting shortly after startup due to another issue (look for `Aborting due to signal` immediately after the comms error).

- Another system process is listening on these ports.

ferullo · 2026-03-26T17:35:41Z

package/endpoint/docs/knowledge_base/missed_checkins/linux_missed_checkins.md

+
+Elastic Agent and Endpoint communicate over TCP ports 6788 and 6789 on localhost. If another process is already bound to one of these ports, Endpoint cannot establish its communication channel with Agent and will miss check-ins.
+
+To check: run `ss -tlnp | grep -E '6788|6789'` on the affected host. If a non-Elastic process is listening on either port, it must be reconfigured to use a different port, or the Endpoint communication port can be changed via advanced policy settings.


or the Endpoint communication port can be changed via advanced policy settings.

Is that true?

old Endpoints hardcoded 6788, newer endpoints hardcode named pipe name

6789 was always given by Agent as bootstrap message, if it can be configured on Agent, then Endpoint will use what's given in bootstrap

ferullo · 2026-03-26T17:36:43Z

package/endpoint/docs/knowledge_base/missed_checkins/linux_missed_checkins.md

+4) Check `.fleet-agents*` for the agent's `last_checkin` timestamp and current status to confirm the endpoint is actually missing check-ins vs. the entire agent being offline
+5) Run `ss -tlnp | grep -E '6788|6789'` on the host to check for port conflicts
+6) Check `dmesg` and `journalctl -k` for OOM killer events targeting the Endpoint process
+7) Check `metrics-endpoint.policy-*` for FAILED policy responses indicating the endpoint tried to start but could not apply its policy


Suggested change

7) Check `metrics-endpoint.policy-*` for FAILED policy responses indicating the endpoint tried to start but could not apply its policy

ferullo · 2026-03-26T17:42:04Z

package/endpoint/docs/knowledge_base/missed_checkins/windows_missed_checkins.md

+
+On Windows systems using non-English locales (confirmed on Chinese GB2312 and Korean editions), a unicode conversion bug causes the Endpoint process to crash in a loop immediately after startup. The crash occurs during kernel communication initialization and produces hundreds of restarts per log file. Agent reports the Endpoint as FAILED because it never completes startup.
+
+The bug is tracked in elastic/security#1973 and was fixed in Endpoint 8.13.3. Check the Endpoint logs for crash stack traces mentioning `CompletionPort.cpp` or `BigChief.cpp` during the `Connecting to minifilter` phase.


Suggested change

The bug is tracked in elastic/security#1973 and was fixed in Endpoint 8.13.3. Check the Endpoint logs for crash stack traces mentioning `CompletionPort.cpp` or `BigChief.cpp` during the `Connecting to minifilter` phase.

ferullo · 2026-03-26T17:46:02Z

package/endpoint/docs/knowledge_base/output_config/output_kafka_message_size.md

+The Kafka broker rejects messages that exceed its configured `max.message.bytes` limit. The endpoint logs the error:
+
+```
+KafkaClient.cpp:561 KafkaClient failed to deliver record with unrecoverable error: Broker: Message size too large [10] | non-retriable


Drop filename/line number since that could easily change.

Suggested change

KafkaClient.cpp:561 KafkaClient failed to deliver record with unrecoverable error: Broker: Message size too large [10] | non-retriable

KafkaClient failed to deliver record with unrecoverable error: Broker: Message size too large [10] | non-retriable

ferullo · 2026-03-26T17:46:56Z

package/endpoint/docs/knowledge_base/output_config/output_kafka_message_size.md

+Characteristic log pattern during the outage:
+
+```
+AgentContext.cpp:565 Endpoint is setting status to DEGRADED, reason: Unable to connect to output server


Suggested change

AgentContext.cpp:565 Endpoint is setting status to DEGRADED, reason: Unable to connect to output server

Endpoint is setting status to DEGRADED, reason: Unable to connect to output server

ferullo · 2026-03-26T17:47:03Z

package/endpoint/docs/knowledge_base/output_config/output_kafka_message_size.md

+
+```
+AgentContext.cpp:565 Endpoint is setting status to DEGRADED, reason: Unable to connect to output server
+LogstashClient.cpp:662 SSL handshake with Logstash server at [host]:[port] encountered an error


Suggested change

LogstashClient.cpp:662 SSL handshake with Logstash server at [host]:[port] encountered an error

SSL handshake with Logstash server at [host]:[port] encountered an error

ferullo · 2026-03-26T17:47:11Z

package/endpoint/docs/knowledge_base/output_config/output_kafka_message_size.md

+```
+AgentContext.cpp:565 Endpoint is setting status to DEGRADED, reason: Unable to connect to output server
+LogstashClient.cpp:662 SSL handshake with Logstash server at [host]:[port] encountered an error
+BulkQueueConsumer.cpp:192 Logstash connection is down


Suggested change

BulkQueueConsumer.cpp:192 Logstash connection is down

Logstash connection is down

ferullo · 2026-03-26T17:48:20Z

package/endpoint/docs/knowledge_base/output_config/output_kafka_message_size.md

+
+After a prolonged Logstash disconnection, the endpoint accumulates unsent events in its internal queue. When connectivity is restored, the endpoint resumes sending but starts with the oldest queued events. The delivery delay equals the duration of the outage — if the output was down for one hour, events will be approximately one hour behind real time until the backlog is drained.
+
+The backlog does not self-resolve faster than real-time ingestion because the endpoint sends queued events at the same rate as live events. On deployments with high event volume (e.g. 45,000+ agents behind a Logstash cluster), this delay can persist for the same duration as the outage.


@brian-mckinney is this or was it ever true?

ferullo · 2026-03-26T17:48:39Z

package/endpoint/docs/knowledge_base/output_config/output_kafka_message_size.md

+The endpoint implements a backoff algorithm for reconnection attempts. If the backoff interval has grown large due to repeated failures, the connection may not be reattempted immediately even after connectivity is restored. Running `elastic-endpoint test output` on the affected host manually cancels the backoff and triggers an immediate reconnection attempt. The corresponding log entry is:
+
+```
+ServiceCommsFunctions.cpp:234 Canceling output backoff due to 'test output' command


Suggested change

ServiceCommsFunctions.cpp:234 Canceling output backoff due to 'test output' command

Canceling output backoff due to 'test output' command

ferullo · 2026-03-26T17:50:59Z

package/endpoint/docs/knowledge_base/trusted_apps/trusted_apps_not_effective.md

+
+## Summary
+
+A Trusted Application tells Elastic Defend to stop monitoring a process entirely — it creates an intentional blind spot. This is fundamentally different from an Endpoint Alert Exception, which only suppresses alerts while continuing to monitor. Many "trusted app not working" issues stem from using the wrong artifact type, misconfiguring the entry's condition fields, or running a version affected by known event-processing bugs.


implicitly meaning it only works on executables

This is true for standard trusted apps but Advanced Trusted Apps lets users trust non-process activity. The original version seems more correct to me.

ferullo · 2026-03-26T17:53:55Z

package/endpoint/docs/knowledge_base/trusted_apps/trusted_apps_not_effective.md

+## Common issues
+
+### Wrong condition field or value


Replace {{elastic-defend}} with Elastic Defend.

we don't have a commonly used feature "Suppress protections (trust) everything matching the path: xxx"

It's not obvious but the solution are Endpoint Alert Exceptions for that (either based on file.path MATCHES c:\whatever\* or process.executable MATCHES c:\whatever\*

ferullo · 2026-03-26T17:55:57Z

package/endpoint/docs/knowledge_base/trusted_apps/trusted_apps_not_effective.md

+
+A Trusted Application entry is scoped to specific integration policies. If the entry is not assigned to the policy covering the affected endpoint, it has no effect. Similarly, a Trusted Application configured for the wrong OS will not apply.
+
+In the Kibana Trusted Applications UI, verify the "Assignment" column shows the correct integration policy. If using "Global" assignment, confirm the affected endpoint's policy is not explicitly excluded.


If using "Global" assignment, confirm the affected endpoint's policy is not explicitly excluded.

Is that a feature I'm not aware of?

intxgo · 2026-03-26T18:50:39Z

A gneral question, is it going to be stack versioned? Is the current PR for the lastest 9.3 stack, or what? I agree with other comments here that it doesn't make sense to train AI on past already fixed bug symptoms, also quoting "Endgame" or "sensor" really makes no sense here, it's unrelated product, even not supported anymore.

joeypoon added 4 commits March 9, 2026 19:54

add high_cpu, missed_checkins, bsod, trustedapps context docs

79ea268

add endpoint_exceptions and incompatible_software context docs

d4c162e

add output_config context doc for Kafka/Logstash output troubleshooting

537edfe

add device_control context doc for notification and serial number issues

c0a6c43

joeypoon requested a review from a team as a code owner March 13, 2026 01:15

joeypoon requested review from pzl and tomsonpl March 13, 2026 01:15

pzl approved these changes Mar 16, 2026

View reviewed changes

ferullo reviewed Mar 18, 2026

View reviewed changes

nicholasberlin reviewed Mar 19, 2026

View reviewed changes

ferullo reviewed Mar 23, 2026

View reviewed changes

tomsonpl approved these changes Mar 24, 2026

View reviewed changes

ferullo reviewed Mar 24, 2026

View reviewed changes

intxgo reviewed Mar 25, 2026

View reviewed changes

ferullo reviewed Mar 26, 2026

View reviewed changes


		## Summary

		Elastic Defend uses a kernel-mode driver (`elastic_endpoint_driver.sys`) for file system filtering, network monitoring, and process/object callbacks. Because this driver runs at the kernel level, bugs or incompatibilities can cause system-wide crashes (BSODs) rather than isolated process failures. Most BSOD issues traced to the endpoint driver fall into a few categories: regressions introduced in specific driver versions, conflicts with other kernel-mode drivers (third-party security products), or running on unsupported OS versions.


		Elastic Defend uses a kernel-mode driver (`elastic_endpoint_driver.sys`) for file system filtering, network monitoring, and process/object callbacks. Because this driver runs at the kernel level, bugs or incompatibilities can cause system-wide crashes (BSODs) rather than isolated process failures. Most BSOD issues traced to the endpoint driver fall into a few categories: regressions introduced in specific driver versions, conflicts with other kernel-mode drivers (third-party security products), or running on unsupported OS versions.

		Memory dump analysis with WinDbg (`!analyze -v`) is essential for root-cause determination. The bugcheck code alone is not sufficient — the faulting call stack identifies which code path triggered the crash.


		### ODX-enabled volume crash (8.19.8, 9.1.8, 9.2.2)

		A regression introduced in versions 8.19.8, 9.1.8, and 9.2.2 causes BSODs on systems with ODX (Offloaded Data Transfer) enabled volumes, particularly affecting Hyper-V clusters and Windows Server 2016 Datacenter. The crash occurs in the file system filter driver's post-FsControl handler (`bkPostFsControl`) when processing offloaded write completions. The faulting call stack typically shows `elastic_endpoint_driver!bkPostFsControl` followed by `FLTMGR!FltGetStreamContext`.


		A regression in the network driver introduced in Elastic Defend versions 8.17.8, 8.18.3, and 9.0.3 can cause kernel pool corruption on systems with a large number of long-lived network connections that remain inactive for 30+ minutes. The corruption manifests as BSODs with various bugcheck codes including `IRQL_NOT_LESS_OR_EQUAL`, `SYSTEM_SERVICE_EXCEPTION`, `KERNEL_MODE_HEAP_CORRUPTION`, or `PAGE_FAULT_IN_NONPAGED_AREA`.

		This is the most frequently reported BSOD pattern and affects Windows Server environments with persistent connections (e.g. database servers, backup servers running Veeam with PostgreSQL). The kernel pool may already be corrupted when the driver attempts a routine memory allocation, causing the crash to appear in unrelated code paths.


		Fixed version: 9.2.4.

		Mitigation: Downgrade to the prior agent version (e.g. 9.2.1) until the upgrade to 9.2.4+ can be performed.


		Other security products running kernel-mode drivers can interfere with Elastic Defend's driver initialization or runtime operation. The most commonly reported conflicts include:

		- Trellix Access Control: Trellix's kernel driver can intercept the Windows Base Filtering Engine (BFE) service, causing Defend's WFP (Windows Filtering Platform) driver initialization to hang or take an extremely long time. When another thread within Defend tasks the kernel driver before initialization completes, the system crashes. The call stack typically shows `elastic_endpoint_driver!HandleIrpDeviceControl` calling `bkRegisterCallbacks` with `KeAcquireGuardedMutex` on an uninitialized mutex. This interaction was introduced by a WFP driver refactor in 8.16.0. Fixed in 8.17.6, 8.18.1, and 9.0.1. Workaround: disable Trellix Access Protection or add a Trellix exclusion for the BFE service (`svchost.exe`).


		- CrowdStrike, Kaspersky, Windows Defender coexistence: Running multiple endpoint security products increases the probability of kernel-level interactions. Each additional kernel-mode filter driver introduces another point of contention for file system, registry, and network callbacks. When BSODs occur on systems with multiple security products, simplify by removing redundant products.

		- High third-party driver count: Systems with an unusually high number of third-party kernel drivers (e.g. 168+ drivers) amplify the risk of pool corruption being attributed to or triggered by any one driver. Enable Driver Verifier on suspect third-party drivers to isolate the true source.


		### Unsupported OS version

		Upgrading Elastic Defend to a version that dropped support for the host's Windows version causes immediate BSODs or boot loops. The most common case is upgrading to 8.13+ on Windows Server 2012 R2, which lost support in that release. The system crashes during driver load because the driver uses kernel APIs unavailable on the older OS.


		In some cases the endpoint driver causes a system deadlock rather than a classic BSOD. The system becomes completely unresponsive — applications freeze, Task Manager hangs, and the Elastic service cannot be stopped. This typically requires a hard reboot. A kernel memory dump captured during the lockup (via keyboard-initiated crash: right Ctrl + Scroll Lock twice) is required for diagnosis.

		This pattern has been observed when the driver's file system filter processing enters a long-running or blocking state while monitoring specific applications. If the lockup is reproducible with a specific application, adding that application as a Trusted Application may resolve the conflict.


		## Investigation priorities

		1) Collect the full kernel memory dump (`C:\Windows\MEMORY.DMP` or minidumps from `C:\Windows\Minidump\`). Run WinDbg `!analyze -v` to identify the exact bugcheck code, faulting module, and call stack. Without the dump, root cause cannot be determined.


		## Symptom

		A custom notification message has been configured in the Elastic Defend Device Control policy to display when a USB device is blocked, but the Windows system tray popup does not appear. Instead, the user sees only a generic Windows Explorer error stating the device is not accessible. Alternatively, device-specific allow/block rules based on `device.serial_number` do not match the intended device because the serial number field contains `0` or a seemingly random value.


		## Summary

		Elastic Defend on Linux uses eBPF and fanotify to monitor process, file, network, and DNS activity in real time. Each event is enriched, hashed, evaluated against behavioral rules, and forwarded to the configured output. The most common drivers of high CPU on Linux are monitoring scripts that spawn many short-lived child processes (each generating process events that trigger behavioral rules), output server disconnections causing retry storms, the Events plugin hashing large binaries during policy application with an empty cache, and memory scanning of large processes.


		CPU returns to normal within approximately 40 seconds after connectivity is restored. The command `sudo /opt/Elastic/Endpoint/elastic-endpoint test output` can be used to verify output connectivity — on affected versions this command itself will spike CPU when the output is unreachable.

		Upgrade to 8.13.4+ where the retry loop includes proper backoff. Check `logs-elastic_agent.endpoint_security-*` for the error patterns above.


		## Symptom

		An Endpoint Alert Exception has been created in Elastic Defend, but the endpoint continues to generate alerts or block processes that should be excluded. Alert documents matching the exception criteria still appear in `logs-endpoint.alerts-*`.


		### Exception scoped to wrong integration policy

		Endpoint Alert Exceptions can be scoped to specific integration policies ("per policy") or applied globally. If the exception is assigned to a policy that does not cover the affected endpoint, it has no effect on that endpoint.


		### Exception on wrong cluster in cross-cluster search (CCS) setups

		In architectures where Fleet-managed agents send data to a "data cluster" and analysts query via CCS from a separate "analyst cluster", Endpoint Alert Exceptions must be created on the data cluster (the one running Fleet Server). Exceptions created on the analyst cluster are not distributed to endpoints because CCS does not support Fleet Actions.


		In architectures where Fleet-managed agents send data to a "data cluster" and analysts query via CCS from a separate "analyst cluster", Endpoint Alert Exceptions must be created on the data cluster (the one running Fleet Server). Exceptions created on the analyst cluster are not distributed to endpoints because CCS does not support Fleet Actions.

		If exceptions exist only on the analyst cluster, alert documents will continue to be generated by endpoints and indexed on the data cluster. The analyst cluster will then alert on those documents via CCS.

	If exceptions exist only on the analyst cluster, alert documents will continue to be generated by endpoints and indexed on the data cluster. The analyst cluster will then alert on those documents via CCS.
	If exceptions exist only on the data cluster, alert documents will continue to be generated by endpoints and indexed on the data cluster. The analyst cluster will then alert on those documents via CCS.


		To suppress on Elastic Defend 9.2+: add the process as a Trusted Application (this disables behavioral detections for the process entirely) OR add an Endpoint Alert Exception targeting the specific rule ID and process if you want to suppress only that one rule.

		To suppress on pre-9.2: Trusted Applications do not suppress behavioral detections on these older versions. Use an Endpoint Alert Exception targeting `rule.id` and the relevant process fields.

		- A third-party driver or application causes memory corruption that makes non-executable memory pages appear executable, leading to unexpected scanning of heap regions.
		- Faulty RAM causes bit flips in page table entries, changing page protection bits and exposing non-code memory to the scanner.

Conversation

joeypoon commented Mar 13, 2026

Change Summary

Uh oh!

ferullo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tomsonpl left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

intxgo Mar 26, 2026 •

edited

Loading


		The Elastic Defend blocklist prevents known-malicious binaries from executing. However, if a process matches the Global Exception List before the blocklist is evaluated, the blocklist entry is skipped. The Global Exception List contains entries for major trusted software publishers (e.g. Microsoft). This means binaries signed by these vendors cannot be blocked via the user blocklist — the global exception takes priority.

		This is a known product limitation. If the goal is to prevent specific trusted-signed binaries from running, the blocklist is not the correct mechanism. Consider using operating system-level application control policies (e.g. Windows AppLocker or WDAC) instead.


		## Investigation priorities

		1) Determine the detection engine that generated the alert by checking `event.code`: `malicious_file` (malware engine), `behavior` (behavioral protection), `memory_signature` (memory scanning). This determines which artifact type to use for suppression.


		Monitoring and automation scripts that run on a schedule (cron, systemd timers) and spawn many child processes are the most common cause of high CPU on Linux. A single monitoring script invoking `curl`, `mysql`, `ssh`, `grep`, `sed`, `awk`, and `bash` in rapid succession generates a burst of process creation events, each of which Elastic Defend must enrich and evaluate against behavioral rules.

		A typical pattern is hourly CPU spikes lasting 5–10 minutes, aligning with cron schedules (e.g. xx:31–xx:41 every hour). In one case, a script at `/var/cache/system-monitoring/helper/compare-inventory.sh` that used `curl` to collect data triggered the behavioral rule "Suspicious Download and Redirect by Web Server" 86,340 times in a single diagnostics window, driving endpoint service CPU above 200% (out of 800% on 8 cores).


		A typical pattern is hourly CPU spikes lasting 5–10 minutes, aligning with cron schedules (e.g. xx:31–xx:41 every hour). In one case, a script at `/var/cache/system-monitoring/helper/compare-inventory.sh` that used `curl` to collect data triggered the behavioral rule "Suspicious Download and Redirect by Web Server" 86,340 times in a single diagnostics window, driving endpoint service CPU above 200% (out of 800% on 8 cores).

		Adding the parent script as a Trusted Application stops monitoring of its process tree but does not prevent behavioral rules from firing if the rule matches on child process characteristics. On versions prior to 9.2, behavioral detections still fire for trusted processes. On 9.2+, behavioral detections are disabled for Trusted Applications.

	- Review `linux.advanced.events` settings to disable unnecessary event types (e.g. `linux.advanced.events.dns` if DNS monitoring is not needed).
	- Review `linux.advanced.events` settings to disable unnecessary event types (e.g. `linux.advanced.events.dns` if malicious behavior rules based on DNS monitoring is not needed).

	- Use Event Filters to reduce event volume from known-noisy directories without creating a monitoring blind spot.
	- Use Event Filters to reduce event volume from known-noisy directories without creating a active protection blind spot.


		### Events plugin hung hashing large binaries during policy application

		During policy application, Elastic Defend hashes all running processes and their executables. When the file cache (`/opt/Elastic/Endpoint/state/cache.db`) is empty — first install, cache deleted, or after upgrade — the endpoint must hash every binary from scratch. Large binaries (e.g. Oracle at `/data/app/oracle/product/*/bin/oracle`) can cause the Events plugin to hang in `ConfigurationCallback` while performing SHA1 hashing, driving CPU to 100% for extended periods.


		## Summary

		Elastic Defend performs real-time file hashing, digital signature verification, memory scanning, behavioral rule evaluation, and event enrichment. Each protection layer adds processing overhead, and the cumulative effect depends on the workload profile of the host. The most common drivers of high CPU on Windows are third-party security product conflicts creating mutual scanning loops, high-volume security events (logon/logoff) overwhelming the rules engine, file-intensive operations triggering repeated hashing of large binaries, and VDI/Citrix environments where the file metadata cache is empty on every session.


		Elastic Defend hashes and verifies digital signatures of executables and DLLs when they are loaded. Large binaries like `msedge.dll` (195 MB) can take 10–15 seconds per hash operation. On hosts running browsers, Office applications, or developer tools that load many large DLLs, this creates sustained CPU spikes.

		The endpoint maintains a file metadata cache to avoid re-hashing known files. On versions prior to 8.0.1, the cache size (`FILE_OBJECT_CACHE_SIZE`) was limited to 500 entries, leading to frequent cache evictions and redundant hashing. Upgrading to 8.0.1+ significantly improves caching behavior.


		The endpoint maintains a file metadata cache to avoid re-hashing known files. On versions prior to 8.0.1, the cache size (`FILE_OBJECT_CACHE_SIZE`) was limited to 500 entries, leading to frequent cache evictions and redundant hashing. Upgrading to 8.0.1+ significantly improves caching behavior.

		For immediate relief on older versions, set `windows.advanced.kernel.asyncimageload` and `windows.advanced.kernel.syncimageload` to `false` in advanced policy settings. This reduces CPU by approximately 85% for DLL load processing at the cost of reduced visibility into library load events.