Skip to content

Add plaintext log detection via timestamp and log-level pattern density analysis#158

Open
hemantkumar15438 wants to merge 1 commit into
openrelik:mainfrom
hemantkumar15438:feature/log-discovery-release
Open

Add plaintext log detection via timestamp and log-level pattern density analysis#158
hemantkumar15438 wants to merge 1 commit into
openrelik:mainfrom
hemantkumar15438:feature/log-discovery-release

Conversation

@hemantkumar15438

Copy link
Copy Markdown

Overview

This Pull Request introduces an analysis task to openrelik-worker-analyzer-logs that detects unparsed, rotated, or extensionless plaintext log files by measuring the density of timestamps and log-level markers within the file content.

Rather than relying on file extensions or paths, the engine evaluates the raw text stream to calculate a pattern-to-text ratio. To scan raw storage inputs, the worker automatically handles system-level block device mounting to inspect inner filesystems recursively.

Technical Implementation & Mechanics

  1. Timestamp and Log-Level Ratio Math: The engine samples up to the first 500 lines of a target file and tracks lines matching specific structural logging signatures:

    • Timestamps: ISO 8601, RFC 3164/5424 Syslog, and standard date/time string variants.
    • Log Levels: Standard severity markers (e.g., [INFO], ERROR:, WARN, DEBUG).

    The evaluation metric is calculated using a strict ratio:
    $$\text{Density} = \frac{\text{Lines Matching Patterns}}{\text{Total Lines Evaluated}}$$
    Files meeting or exceeding the user-defined threshold (default: 0.15 or 15%) are flagged in the output report.

  2. Block Device Partition Traversal: When processing raw disk images (.dd, .raw, .e01), the task routes execution through OpenRelik’s BlockDevice infrastructure. The worker handles system-level loop device attachment via losetup, maps the partition tables, mounts the underlying filesystems dynamically, and passes the inner file paths directly to the density analysis loop.

Architectural Constraints & Safeguards

  • Memory Boundary Management: Line sampling is hard-capped at 500 lines per file descriptor to prevent memory exhaustion on large data sets.
  • Binary Stream Checks: Implements an early header check for null-bytes (\x00). If detected within the initial block read, the stream is immediately classified as a binary object (compiled executable, media archive, database) and skipped to prevent unnecessary regex processing.
  • Nested Mount Prevention: The file tree traversal explicitly skips any matching storage image extensions (.dd, .img, etc.) discovered inside an active filesystem mount to eliminate recursive loop device allocations or kernel lockups.
  • The block-device mapping layer is encapsulated entirely within a try...finally block. This guarantees that regardless of processing exceptions, BlockDevice.umount() is executed deterministically, eliminating unreleased loop devices or host OS mount-point leaks.

Verification & Testing

  • Unit Validation: Verified regex parsing accuracy and threshold triggers directly against standard text streams and extensionless UNIX log samples.
  • Integration Validation: Executed successful end-to-end integration cycles within a local containerized deployment against raw .dd targets, verifying host kernel module utilization (nbd), loop device mapping, system tree traversal, and final artifact generation.

@hacktobeer

Copy link
Copy Markdown
Contributor

@hemantkumar15438 Is this PR ready for review?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants