pinchbench · olearycrew · May 5, 2026 · May 1, 2026
diff --git a/tasks/manifest.yaml b/tasks/manifest.yaml
@@ -110,6 +110,30 @@ categories:
     - task_log_apache_critical
     - task_log_apache_timeline
     - task_log_syslog_boot
+    - task_log_nginx_status_codes
+    - task_log_nginx_traffic
+    - task_log_nginx_slow_requests
+    - task_log_nginx_user_agents
+    - task_log_nginx_errors
+    - task_log_ssh_failed_logins
+    - task_log_ssh_brute_force
+    - task_log_ssh_successful
+    - task_log_ssh_user_activity
+    - task_log_ssh_unusual_times
+    - task_log_hdfs_failures
+    - task_log_hdfs_connections
+    - task_log_hdfs_slow_ops
+    - task_log_hdfs_block_ops
+    - task_log_hdfs_storage
+    - task_log_mapreduce_jobs
+    - task_log_mapreduce_failures
+    - task_log_mapreduce_slow_tasks
+    - task_log_mapreduce_resources
+    - task_log_mapreduce_timeline
+    - task_log_syslog_anomalies
+    - task_log_syslog_services
+    - task_log_syslog_cron
+    - task_log_syslog_auth_failures
 
   meeting_analysis:
     - task_meeting_council_votes

diff --git a/tasks/task_log_hdfs_block_ops.md b/tasks/task_log_hdfs_block_ops.md
@@ -0,0 +1,146 @@
+---
+id: task_log_hdfs_block_ops
+name: HDFS DataNode Log - Block Operations Summary
+category: log_analysis
+grading_type: hybrid
+timeout_seconds: 180
+workspace_files:
+  - dest: "hdfs_datanode.log"
+    source: "logs/hdfs_datanode.log"
+---
+
+# HDFS DataNode Log - Block Operations Summary
+
+## Prompt
+
+Analyze the HDFS DataNode log at `hdfs_datanode.log` and produce a comprehensive summary of all block operations. The log comes from an HDFS cluster and tracks block lifecycle events.
+
+Your report should include:
+
+1. **Block Inventory**: Total unique block IDs in the log, with a full list
+2. **Operation Types**: For each operation type (allocateBlock, Receiving, Received, addStoredBlock, replicate, PacketResponder), count total occurrences
+3. **Block Lifecycle Tracking**: For each block that has a complete lifecycle (allocate → receive → stored), document the full chain
+4. **Replication Chain**: For blocks with replication events, trace the replication path across nodes
+5. **Associated Jobs**: Identify the MapReduce jobs that triggered these block operations (visible in file paths)
+6. **Per-Block Detail Table**: Create a table with columns: Block ID, Size (if known), Allocated Path, Nodes Involved, Replication Count
+
+Write the report to `hdfs_block_ops_report.md` as a well-structured markdown document.
+
+---
+
+## Expected Behavior
+
+The agent should parse 2000 log entries and produce:
+
+**Block Inventory:**
+- ~390 unique block IDs
+
+**Operation Counts:**
+- Receiving block: ~1149
+- allocateBlock: ~385
+- Received block: ~19
+- addStoredBlock: ~19
+- PacketResponder: ~12
+- Replicate: 4
+
+**Complete Block Lifecycles (blocks with full data):**
+- blk_-1608999687919862906: 91178 bytes, allocated for job_200811092030_0001/job.jar
+- blk_7503483334202473044: 233217 bytes, allocated for job_200811092030_0001/job.split
+- blk_-3544583377289625738: 11971 bytes
+- blk_-9073992586687739851: 11977 bytes
+
+**Replication Chain:**
+- blk_-1608999687919862906 was replicated 4 times across the cluster:
+  10.250.14.224 → 10.251.215.16 → 10.251.74.79 → 10.251.31.5 → 10.251.90.64
+
+**Associated Job:**
+- job_200811092030_0001 — MapReduce job, files: job.jar, job.split
+
+Acceptable variations:
+- Block ID lists may be truncated
+- Not all 390 blocks need full detail — just those with complete lifecycle data
+- Table format may vary
+
+---
+
+## Grading Criteria
+
+- [ ] `hdfs_block_ops_report.md` is created in the workspace
+- [ ] Unique block count is provided (~390)
+- [ ] Operation types are counted (receiving, allocate, replicate, etc.)
+- [ ] At least one block lifecycle is fully traced (allocate → receive → stored)
+- [ ] The associated MapReduce job is identified (job_200811092030_0001)
+
+---
+
+## Automated Checks
+
+```python
+def grade(transcript: list, workspace_path: str) -> dict:
+    """Grade the HDFS block operations summary task."""
+    from pathlib import Path
+
+    scores = {}
+    workspace = Path(workspace_path)
+    report_file = workspace / "hdfs_block_ops_report.md"
+
+    if not report_file.exists():
+        return {
+            "output_created": 0.0,
+            "block_count": 0.0,
+            "operations_counted": 0.0,
+            "lifecycle_traced": 0.0,
+            "job_identified": 0.0,
+        }
+
+    scores["output_created"] = 1.0
+    content = report_file.read_text(encoding="utf-8").lower()
+
+    # Check 1: Block count
+    has_count = any(n in content for n in ["390", "~390", "385", "~385", "380", "~400"])
+    scores["block_count"] = (
+        1.0 if has_count else
+        0.5 if any(kw in content for kw in ["hundred", "unique block"]) else 0.0
+    )
+
+    # Check 2: Operations counted
+    op_keywords = ["receiving", "allocate", "replicate", "addstored",
+                   "packetresponder", "received"]
+    ops_found = sum(1 for kw in op_keywords if kw in content)
+    scores["operations_counted"] = (
+        1.0 if ops_found >= 4 else
+        0.5 if ops_found >= 2 else 0.0
+    )
+
+    # Check 3: Block lifecycle traced
+    lifecycle_keywords = ["91178", "233217", "blk_-1608999687919862906",
+                          "blk_7503483334202473044", "lifecycle", "job.jar", "job.split"]
+    scores["lifecycle_traced"] = (
+        1.0 if sum(1 for kw in lifecycle_keywords if kw in content) >= 2 else
+        0.5 if sum(1 for kw in lifecycle_keywords if kw in content) >= 1 else 0.0
+    )
+
+    # Check 4: MapReduce job identified
+    has_job = "job_200811092030_0001" in content or "200811092030" in content
+    has_mapreduce = "mapreduce" in content or "mapred" in content or "map reduce" in content
+    scores["job_identified"] = (
+        1.0 if has_job else
+        0.5 if has_mapreduce else 0.0
+    )
+
+    return scores
+```
+
+---
+
+## Additional Notes
+
+**Key facts from the log:**
+
+- Most blocks only have "Receiving" and "allocateBlock" entries — the cluster was mid-operation
+- Only ~19 blocks have complete lifecycle data with confirmed sizes
+- The 390 block IDs represent a MapReduce job's data being distributed across the cluster
+- Replication is only logged for blk_-1608999687919862906, which is replicated 4 times
+- File paths show this is related to a MapReduce job: `/mnt/hadoop/mapred/system/job_200811092030_0001/`
+
+**Grading weights (equal):** Each of the five criteria contributes 0.2 to the final score.
diff --git a/tasks/task_log_hdfs_connections.md b/tasks/task_log_hdfs_connections.md
@@ -0,0 +1,142 @@
+---
+id: task_log_hdfs_connections
+name: HDFS DataNode Log - Connection Pattern Analysis
+category: log_analysis
+grading_type: hybrid
+timeout_seconds: 180
+workspace_files:
+  - dest: "hdfs_datanode.log"
+    source: "logs/hdfs_datanode.log"
+---
+
+# HDFS DataNode Log - Connection Pattern Analysis
+
+## Prompt
+
+Analyze the HDFS DataNode log at `hdfs_datanode.log` and produce a report on connection and communication patterns between nodes. The log contains entries from DataNode, FSNamesystem, and PacketResponder components.
+
+Your report should include:
+
+1. **Network Topology**: List all unique IP addresses that appear in the log, categorized by their role (source, destination, or both)
+2. **Subnet Analysis**: Group IPs by subnet (e.g., 10.250.x.x vs 10.251.x.x). How many nodes are in each subnet?
+3. **Most Active Nodes**: Top 10 IPs by frequency of appearance (as source or destination)
+4. **Communication Patterns**: Which pairs of nodes communicate most frequently?
+5. **DataNode vs NameSystem**: Separate the activity — what comes from DataNode operations vs FSNamesystem operations?
+6. **Cluster Size Estimate**: Based on the IPs observed, estimate the cluster size
+
+Write the report to `hdfs_connections_report.md` as a well-structured markdown document.
+
+---
+
+## Expected Behavior
+
+The agent should parse 2000 log entries and produce:
+
+**Network Topology:**
+- 202 unique IP addresses observed
+- IPs fall in the 10.250.x.x and 10.251.x.x ranges (private network)
+- All nodes use port 50010 (HDFS DataNode data transfer port)
+
+**Subnet Analysis:**
+- 10.250.x.x subnet — contains some of the most active nodes
+- 10.251.x.x subnet — contains additional DataNode cluster members
+- The split suggests a multi-rack HDFS deployment
+
+**Most Active Nodes:**
+- 10.250.19.102 — extremely active (appears as source in many block transfers)
+- 10.250.10.6, 10.251.215.16, 10.250.14.224 — also very active
+
+**Component Activity:**
+- DataNode$DataXceiver: Block receive operations (~1149 entries)
+- FSNamesystem: Block allocation and storage tracking (~400+ entries)
+- DataNode$PacketResponder: Block receive confirmations with sizes
+
+Acceptable variations:
+- Exact IP counts and rankings may vary by parsing approach
+- Subnet grouping granularity may differ
+- Cluster size estimates will be approximate
+
+---
+
+## Grading Criteria
+
+- [ ] `hdfs_connections_report.md` is created in the workspace
+- [ ] Unique IPs are listed or counted (~202)
+- [ ] IPs are grouped by subnet (10.250.x.x vs 10.251.x.x)
+- [ ] Most active nodes are identified
+- [ ] DataNode vs FSNamesystem activity is distinguished
+
+---
+
+## Automated Checks
+
+```python
+def grade(transcript: list, workspace_path: str) -> dict:
+    """Grade the HDFS connection pattern analysis task."""
+    from pathlib import Path
+
+    scores = {}
+    workspace = Path(workspace_path)
+    report_file = workspace / "hdfs_connections_report.md"
+
+    if not report_file.exists():
+        return {
+            "output_created": 0.0,
+            "ips_listed": 0.0,
+            "subnets_grouped": 0.0,
+            "active_nodes": 0.0,
+            "components_separated": 0.0,
+        }
+
+    scores["output_created"] = 1.0
+    content = report_file.read_text(encoding="utf-8").lower()
+
+    # Check 1: IPs listed/counted
+    has_count = any(n in content for n in ["202", "200", "~200", "over 200"])
+    has_ips = "10.250" in content and "10.251" in content
+    scores["ips_listed"] = (
+        1.0 if has_count and has_ips else
+        0.5 if has_ips else 0.0
+    )
+
+    # Check 2: Subnets grouped
+    subnet_keywords = ["subnet", "10.250", "10.251", "rack", "network segment",
+                       "address range", "ip range"]
+    scores["subnets_grouped"] = (
+        1.0 if "10.250" in content and "10.251" in content and
+              sum(1 for kw in subnet_keywords if kw in content) >= 2 else
+        0.5 if "10.250" in content and "10.251" in content else 0.0
+    )
+
+    # Check 3: Active nodes identified
+    active_ips = ["10.250.19.102", "10.251.215.16", "10.250.14.224", "10.250.10.6"]
+    ips_found = sum(1 for ip in active_ips if ip in content)
+    scores["active_nodes"] = (
+        1.0 if ips_found >= 2 else
+        0.5 if ips_found >= 1 else 0.0
+    )
+
+    # Check 4: Components separated
+    component_keywords = ["dataxceiver", "dataxeceiver", "fsnamesystem",
+                          "packetresponder", "namenode", "datanode"]
+    scores["components_separated"] = (
+        1.0 if sum(1 for kw in component_keywords if kw in content) >= 2 else
+        0.5 if sum(1 for kw in component_keywords if kw in content) >= 1 else 0.0
+    )
+
+    return scores
+```
+
+---
+
+## Additional Notes
+
+**Key facts from the log:**
+
+- 202 unique IPs — this is a large HDFS cluster
+- Two main subnets: 10.250.x.x and 10.251.x.x
+- Port 50010 is used throughout — standard HDFS DataNode port
+- 10.250.19.102 appears as source in a disproportionate number of entries
+- The log captures a burst of activity related to job_200811092030_0001
+
+**Grading weights (equal):** Each of the five criteria contributes 0.2 to the final score.