Skip to content

Agent Intelligence: Reflexion & Episodic Memory #51

@samugit83

Description

@samugit83

Description

Implement a cross-session learning system where the agent generates structured reflections after failures and stores complete engagement episodes. On future sessions, relevant reflections and past experiences are retrieved and injected into the agent's context — so it never repeats the same mistake twice.

Why this matters

Today, every session starts from zero. The agent has no memory of:

  • "Last time I tried CVE-2021-41773 on Apache 2.4.49, it failed because mod_cgi was disabled — I should check mod_cgi first"
  • "Nuclei template xyz always false-positives on Cloudflare targets — skip it"
  • "SSH brute force with admin:admin worked on this target last week — try that first"
  • "sqlmap with --tamper=space2comment bypassed the WAF on similar PHP apps"

A human pentester builds this institutional knowledge over years. The agent discards it after every session. Reflexion + episodic memory closes this gap.

Two components

1. Reflexion (failure-triggered)

After a tool execution fails, the agent generates a structured reflection:

  • Context: what was the situation
  • Action taken: what the agent did
  • Root cause: why it failed (not just "error occurred" but the actual reason)
  • Lesson: what to do differently next time
  • Tags: keywords for retrieval (e.g., ["apache", "CVE-2021-41773", "mod_cgi"])

Reflections are retrieved by tag overlap when the agent encounters similar situations and injected as "LESSONS FROM PAST EXPERIENCE" in the system prompt.

2. Episodic Memory (session-level)

After each engagement, store a complete episode:

  • Target technology stack
  • Strategy used (attack path type)
  • Tools sequence
  • Outcome (success/partial/failure)
  • Key findings
  • Duration (iterations)

Episodes are retrieved by technology stack similarity — "I've seen Apache 2.4 + MySQL + WordPress before, here's what worked."

What already exists

  • Neo4j graph database (can store episodes as nodes with technology relationships)
  • AgentState.execution_trace captures every tool call and result
  • AgentState.target_info accumulates technology/service data
  • Basic failure detection (3 consecutive failures)
  • Tavily web search for external knowledge

What needs to be built

  • ReflectionEntry Pydantic model (context, action, root_cause, lesson, tags)
  • ReflectionMemory class with generate/store/retrieve methods
  • Reflection generation prompt (triggered after tool failure)
  • Tag-based retrieval with injection into think node system prompt
  • Episode model for complete engagement records
  • EpisodicMemory class with Neo4j storage (leverages existing graph infra)
  • Technology-based similarity retrieval for past episodes
  • Integration into think node: retrieve relevant reflections + episodes before each LLM call
  • Storage backend: Neo4j for episodes, JSON or PostgreSQL for reflections

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesthelp wantedExtra attention is needed

    Projects

    Status

    Up for grabs

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions