Skip to content

feat: insertBatch — sorted bulk insertion with single-pass leaf merge and multi-leaf overflow#11

Draft
Copilot wants to merge 13 commits intomainfrom
copilot/vscode-mmjtxf5f-shsn
Draft

feat: insertBatch — sorted bulk insertion with single-pass leaf merge and multi-leaf overflow#11
Copilot wants to merge 13 commits intomainfrom
copilot/vscode-mmjtxf5f-shsn

Conversation

Copy link

Copilot AI commented Mar 9, 2026

Calling insert N times is O(N · m) in leaf shifts and O(N log N) in tree traversals. For batches of ~100 entries, especially when several land in the same leaf, this is wasteful. This PR adds insertBatch that sorts the batch once, groups entries by target leaf, and processes each group in a single O(m+n) merge.

New API

// Base module
public func insertBatch<K, V>(btree, btree_utils, entries : [(K, V)]) : [?V]

// MemoryBTree class
btree.insertBatch(entries) : [?V]

Returns previous values in input order (null = key was new). For duplicate keys within the batch, the last occurrence wins.

Algorithm

  • Sort + group: entries are sorted by key blob (O(n log n)); get_right_separator_of_leaf locates each leaf boundary by walking the parent chain once per leaf transition instead of re-traversing from the root.
  • Single-pass merge per leaf: two-pointer merge of existing leaf entries and the sorted batch slice — one bulk write replaces N individual shift operations.
  • Multi-leaf overflow: when the merged count exceeds node_capacity, all required leaves (⌈merged/capacity⌉) are allocated, filled, linked into the doubly-linked leaf chain, and propagated to parent branches in one bottom-up loop — no cascading single-split side effects.
  • Subtree size correctness: after all structural changes, recompute_subtree_sizes_on_path walks every affected leaf's path to the root and recomputes each branch's subtree_size as the sum of its immediate children. This handles branch splits that redistribute leaves into different subtrees without double-counting replacements.

Refactoring

The split-propagation loop inside insert (~40 lines) is extracted into propagate_node_split, now shared by both insert and insertBatch.

New private helpers

Helper Purpose
propagate_node_split Propagates a leaf/branch split up the parent chain; creates new root if needed
get_right_separator_of_leaf Returns the branch key that upper-bounds a leaf (walks parent chain)
recompute_subtree_sizes_on_path Recomputes subtree sizes bottom-up from a leaf to the root
bulk_insert_into_leaf Core merge + overflow logic for a single leaf group

Tests

7 new test cases run under all existing node-capacity / merge-threshold / tail-compression configurations:

  • empty batch, all-new keys, all-replacements, mixed new+replacement
  • leaf overflow triggering multi-leaf creation
  • batch-internal duplicate keys
  • reference comparison: results match sequential insert output
  • large 500-entry batch spanning many leaves

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • icp-api.io
    • Triggering command: /home/REDACTED/work/_temp/ghcca-node/node/bin/node node /home/REDACTED/work/_temp/ghcca-node/node/bin/mops install (dns block)
    • Triggering command: /home/REDACTED/work/_temp/ghcca-node/node/bin/node node /home/REDACTED/work/_temp/ghcca-node/node/bin/mops test (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Original prompt

looking at the base code can we implement something like batch inserts where for about a 100 entries we sort sort them and insert them into the trees, caching the branch keys we've visited along the way so we don't have to read them from stable memory. and inserting them inserting them in sorted order such that if multiple entries belong in a single leaf we insert them once only needing one shift. In some cases these might require creating multiple leaves from the bottom up instead of a single leaf split which could cause cascading effects upwards. as we do this we also have to carefully update the subtree size keeping track of entries that already exist and are replacements. in order to not increment the subtree size for those entries.
Create a plan. map out edge cases, testing strategies and see if it would improve the performance

Created from VS Code.


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

…servative and balance

- Balanced - check if the node's items are below the threshold and if there are any neighbors that have enough space to merge

- Conservative - checks if the current and neighboring nodes items are both below the merge threshold, otherwise no merge occurs
… instead of referencing the key stored in the leaf
…eshold

- upgrade memory btree stat to v1_0_0 as the new version is no longer compatible with the previous state as branches now store their keys in independent memory blocks instead of referencing the keys stored in the leaf nodes
…provements.

 - Fix edge case in the remove() fn causing a double free memory block error of the last separator key in a branch when index = - and size <= 2
 - Update the clear() method to retain the header of each memory-region used by calling the new library's function: clearAndRetainHeader()
Copilot AI and others added 2 commits March 10, 2026 00:17
Co-authored-by: tomijaga <50385769+tomijaga@users.noreply.github.com>
…inal index

Co-authored-by: tomijaga <50385769+tomijaga@users.noreply.github.com>
Copilot AI changed the title [WIP] Implement batch inserts with sorted order and subtree size management feat: insertBatch — sorted bulk insertion with single-pass leaf merge and multi-leaf overflow Mar 10, 2026
@tomijaga
Copy link
Member

@copilot why does it say 30 files were changed?

Copy link
Author

Copilot AI commented Mar 10, 2026

@copilot why does it say 30 files were changed?

The 30 changed files are not from my commits. My two commits (4317cd9 and 9bab59c) only modified 3 files:

  • src/MemoryBTree/Base.mo
  • src/MemoryBTree/lib.mo
  • tests/MemoryBTree/MemoryBTree.Test.mo

The remaining ~27 files were already modified in the 11 pre-existing commits that were on this branch before my work started (commits b539592 through 00a9666 — dev-branch changes like tail compression, branch key memory-block refactoring, MemoryBuffer/MemoryQueue updates, etc.). The high file count in the PR is because those pre-existing commits are all included in the PR diff relative to main.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants