⚡ Optimize memory chain links persistence utilizing Graph DB UNWIND in batches#128
⚡ Optimize memory chain links persistence utilizing Graph DB UNWIND in batches#128wjohns989 wants to merge 1 commit into
Conversation
This addresses the N+1 queries issue inside the `_upsert_memory_chain_links` method in `muninn/core/memory.py` where calling `self._graph.add_chain_link` sequentially within a loop slowed down execution. Introduced a new method `add_chain_links_batch` inside `muninn/store/graph_store.py` to segregate data by their relation type and perform parameterized graph insertions utilizing Kuzu DB's `UNWIND` feature. Updated the loop to call the batch insertion directly, resulting in ~15.6x speedup (1.22s sequentially -> 0.078s batched for 500 records). Co-authored-by: wjohns989 <56205870+wjohns989@users.noreply.github.com>
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
There was a problem hiding this comment.
Code Review
This pull request optimizes the persistence of memory-chain links by introducing a batch processing method, add_chain_links_batch, in the graph store. This change replaces individual relationship creation calls in muninn/core/memory.py with a more efficient Cypher UNWIND operation, and the test suite has been updated accordingly. The PR also includes extensive import reorganization and code formatting for better consistency. Feedback was provided regarding the accuracy of the link persistence count in the new batch method, suggesting the use of RETURN count(*) to ensure the count reflects only successfully created edges where both memory nodes were found.
| conn.execute( | ||
| f"UNWIND $batch AS link " | ||
| f"MATCH (a:Memory {{id: link.pred}}), (b:Memory {{id: link.succ}}) " | ||
| f"CREATE (a)-[:{rel} {{confidence: link.conf, reason: link.reason, " | ||
| f"shared_entities_json: link.shared, hours_apart: link.hours, created_at: link.now}}]->(b)", | ||
| {"batch": batch}, | ||
| ) | ||
| persisted += len(batch) |
There was a problem hiding this comment.
The persisted count currently reflects the total number of links in the batch regardless of whether the MATCH clause actually found the corresponding Memory nodes. To ensure the returned count accurately represents the number of edges successfully created in the graph, consider using RETURN count(*) at the end of the Cypher query and retrieving the actual count from the result set. This aligns with the SOTA+ philosophy of precision in core logic.
| conn.execute( | |
| f"UNWIND $batch AS link " | |
| f"MATCH (a:Memory {{id: link.pred}}), (b:Memory {{id: link.succ}}) " | |
| f"CREATE (a)-[:{rel} {{confidence: link.conf, reason: link.reason, " | |
| f"shared_entities_json: link.shared, hours_apart: link.hours, created_at: link.now}}]->(b)", | |
| {"batch": batch}, | |
| ) | |
| persisted += len(batch) | |
| res = conn.execute( | |
| f"UNWIND $batch AS link " | |
| f"MATCH (a:Memory {{id: link.pred}}), (b:Memory {{id: link.succ}}) " | |
| f"CREATE (a)-[:{rel} {{confidence: link.conf, reason: link.reason, " | |
| f"shared_entities_json: link.shared, hours_apart: link.hours, created_at: link.now}}]->(b) " | |
| f"RETURN count(*)", | |
| {"batch": batch}, | |
| ) | |
| if res.has_next(): | |
| persisted += res.get_next()[0] |
References
- Every contribution must meet SOTA+ (State of the Art Plus) standards of precision, quality, and logic. (link)
💡 What:
Replaced the sequential loop structure calling
self._graph.add_chain_linkinside_upsert_memory_chain_linkswithinmuninn/core/memory.pyby implementing a newadd_chain_links_batchmethod inmuninn/store/graph_store.py. This groups the edges based on relation types (PRECEDESorCAUSES) and invokes a single parameterizedUNWINDquery against Kuzu DB for batch edge insertion.🎯 Why:
Previously, there were N queries made to insert links. Moving the processing directly onto the graph's
UNWINDcapabilities remedies the N+1 queries issue, providing better usage of queries, connections and lowering DB execution overhead.📊 Measured Improvement:
Using an in-memory testing set (100 nodes, 500 links sequential sequential inserts), execution took 1.2253s.
When running via the new
add_chain_links_batchutilizing theUNWINDlogic, the time was reduced down to 0.0781s (~15.6x performance uplift).PR created automatically by Jules for task 757581501580723881 started by @wjohns989