Skip to content

Limit memory usage when take is finite#4

Open
jonahkagan wants to merge 7 commits intoron-rivest:masterfrom
votingworks:cap-heap-size
Open

Limit memory usage when take is finite#4
jonahkagan wants to merge 7 commits intoron-rivest:masterfrom
votingworks:cap-heap-size

Conversation

@jonahkagan
Copy link
Copy Markdown

@jonahkagan jonahkagan commented Feb 2, 2021

In our real-world usage of consistent_sampler for election audits with Arlo, we found that the sampler's memory usage grew proportionally to the number of ballots in the election at a rate of about 300 bytes per ballot. For elections with millions of ballots, memory usage topped out at a couple GB - not a big deal for a personal computer, but stressful for small cloud-based web servers, which typically have low memory requirements and therefore don't come with big memory banks.

@benadida realized that you don't actually need to keep all of the ballots in memory as you build the min-heap of tickets, you only need to keep as many tickets as you want to sample - the set of tickets with the smallest numbers up to that point. heapq.nsmallest implements this algorithm - it takes in an iterator and returns a list of the smallest n items without loading the entire iterator into memory at once.

This PR uses heapq.nsmallest to build the initial ticket heap in a memory-efficient way (when the desired sample size - take - is finite). Previously, memory usage was O(len(id_list)), with this PR it's O(take).

@jonahkagan jonahkagan changed the title Add capped_heap_push to limit memory usage Limit memory usage when take is finite Feb 2, 2021
@jonahkagan jonahkagan marked this pull request as ready for review February 2, 2021 20:58
Copy link
Copy Markdown
Owner

@ron-rivest ron-rivest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Jonah -- This is interesting, but the resulting data structure is no longer a heap, and can give the unexpected (wrong) results. Suppose the heap is limited to size 2, and is initialized to [5,7]. An attempt to insert 8 will be ignored, as the heap would be too big. But we can extract 5, and then insert 9. The 8 is lost forever; the next two extracts give [7,9] when they should give [7,8]. What are the conditions for this "heap-like" data structure to behave "correctly"?

Copy link
Copy Markdown
Owner

@ron-rivest ron-rivest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused by the comment that "drop+2*take" insertions are enough, when sampling with replacement. Why? A single element can be extracted "take" times from the queue, if it always has the smallest key.

@jonahkagan
Copy link
Copy Markdown
Author

Hey @ron-rivest, thanks for taking the time to review and work through this with us.

I think this method safely creates a heap because it only caps the heap size during the initial building of the heap, during which there are no extractions, only insertions. Once the heap is built, it behaves just like a normal min-heap and it's size is no longer capped (which is good enough performance-wise for our use case).

Thinking through the question about why you need drop+2*take insertions into the initial heap, I think I got that wrong. I calculated that when I thought we were going to be doing capped insertions after the initial building of the heap, which I later changed to be normal heap insertions as I outlined above.

When doing capped initial building followed by normal extractions/re-insertions, I think you only need drop+take insertions into the initial heap.

To explain why I believe drop+take insertions is the safe amount for sampling with replacement, we consider two different cases:

  1. First, consider the case that you gave where a single element min has the smallest value for every extraction and reinsertion. First we do drop extractions. Each extraction yields min and then reinserts min into the heap. Then we do take extractions, with the same result. In this case, we actually only needed an initial heap of size 1, since min would have always been on top of the heap every time we reinserted it no matter what other elements were in the initial heap.
  2. Next, consider the "opposite" case, where every element we extract ends up at the bottom of the heap when reinserted. Initialize the heap, and let max be the bottom element of the heap (with the largest value). Perform drop extractions. Each extraction yields some element with a key smaller than max, and reinserts that element with a key greater than max. Now perform take-1 extractions. Again, each extraction yields some element with a key smaller than max, and reinserts some element with a key greater than max. max should now be at the top of the heap, and the final extraction should yield max. In order for max to not have been cut from the initial heap, the initial heap must have accommodated drop+take-1 elements with keys lesser than max in the initial heap.

Would love to hear your thoughts on this reasoning and if it checks out. I haven't gotten to do this kind of data structure proof-writing since college a decade ago, so I'm probably quite rusty and fully expect there to be some holes in my logic.

@jonahkagan
Copy link
Copy Markdown
Author

@ron-rivest I've updated this PR with some test cases to help show that the behavior of the sampler is not compromised by this optimization. Are there any further tests you would like to see?

@ron-rivest
Copy link
Copy Markdown
Owner

ron-rivest commented Feb 17, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants