Limit memory usage when take is finite by jonahkagan · Pull Request #4 · ron-rivest/consistent_sampler

jonahkagan · 2021-02-02T19:26:31Z

In our real-world usage of consistent_sampler for election audits with Arlo, we found that the sampler's memory usage grew proportionally to the number of ballots in the election at a rate of about 300 bytes per ballot. For elections with millions of ballots, memory usage topped out at a couple GB - not a big deal for a personal computer, but stressful for small cloud-based web servers, which typically have low memory requirements and therefore don't come with big memory banks.

@benadida realized that you don't actually need to keep all of the ballots in memory as you build the min-heap of tickets, you only need to keep as many tickets as you want to sample - the set of tickets with the smallest numbers up to that point. heapq.nsmallest implements this algorithm - it takes in an iterator and returns a list of the smallest n items without loading the entire iterator into memory at once.

This PR uses heapq.nsmallest to build the initial ticket heap in a memory-efficient way (when the desired sample size - take - is finite). Previously, memory usage was O(len(id_list)), with this PR it's O(take).

ron-rivest

Hi Jonah -- This is interesting, but the resulting data structure is no longer a heap, and can give the unexpected (wrong) results. Suppose the heap is limited to size 2, and is initialized to [5,7]. An attempt to insert 8 will be ignored, as the heap would be too big. But we can extract 5, and then insert 9. The 8 is lost forever; the next two extracts give [7,9] when they should give [7,8]. What are the conditions for this "heap-like" data structure to behave "correctly"?

ron-rivest

I'm confused by the comment that "drop+2*take" insertions are enough, when sampling with replacement. Why? A single element can be extracted "take" times from the queue, if it always has the smallest key.

jonahkagan · 2021-02-08T18:07:31Z

Hey @ron-rivest, thanks for taking the time to review and work through this with us.

I think this method safely creates a heap because it only caps the heap size during the initial building of the heap, during which there are no extractions, only insertions. Once the heap is built, it behaves just like a normal min-heap and it's size is no longer capped (which is good enough performance-wise for our use case).

Thinking through the question about why you need drop+2*take insertions into the initial heap, I think I got that wrong. I calculated that when I thought we were going to be doing capped insertions after the initial building of the heap, which I later changed to be normal heap insertions as I outlined above.

When doing capped initial building followed by normal extractions/re-insertions, I think you only need drop+take insertions into the initial heap.

To explain why I believe drop+take insertions is the safe amount for sampling with replacement, we consider two different cases:

First, consider the case that you gave where a single element min has the smallest value for every extraction and reinsertion. First we do drop extractions. Each extraction yields min and then reinserts min into the heap. Then we do take extractions, with the same result. In this case, we actually only needed an initial heap of size 1, since min would have always been on top of the heap every time we reinserted it no matter what other elements were in the initial heap.
Next, consider the "opposite" case, where every element we extract ends up at the bottom of the heap when reinserted. Initialize the heap, and let max be the bottom element of the heap (with the largest value). Perform drop extractions. Each extraction yields some element with a key smaller than max, and reinserts that element with a key greater than max. Now perform take-1 extractions. Again, each extraction yields some element with a key smaller than max, and reinserts some element with a key greater than max. max should now be at the top of the heap, and the final extraction should yield max. In order for max to not have been cut from the initial heap, the initial heap must have accommodated drop+take-1 elements with keys lesser than max in the initial heap.

Would love to hear your thoughts on this reasoning and if it checks out. I haven't gotten to do this kind of data structure proof-writing since college a decade ago, so I'm probably quite rusty and fully expect there to be some holes in my logic.

jonahkagan · 2021-02-17T00:07:49Z

@ron-rivest I've updated this PR with some test cases to help show that the behavior of the sampler is not compromised by this optimization. Are there any further tests you would like to see?

ron-rivest · 2021-02-17T00:20:56Z

Hi Jonah -- Thanks. I'll take a look at these. (May take a couple of days...) One concern is that the routine should work right in a mode where you take a first sample of size n1, but don't get statistically significant results from that sample, so you take a second sample of size n2 (where drop is n1 now), and so on... I'll be in touch soon... Take care, Ron

…

On Tue, Feb 16, 2021 at 7:08 PM Jonah Kagan ***@***.***> wrote: @ron-rivest <https://github.com/ron-rivest> I've updated this PR with some test cases to help show that the behavior of the sampler is not compromised by this optimization. Are there any further tests you would like to see? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABQ765K64RIIZPDPGPSVHYDS7MCGNANCNFSM4W7MSGIQ> .

-- (1) The climate crisis isn't about who's right. It's about who's helping. (2) If you drink from the sewer, you're going to get sick. True for water. True for news.

jonahkagan added 2 commits February 2, 2021 11:25

Add capped_heap_push to limit memory usage

94dc5c7

Use heapq.nsmallest for better performance

1afe896

jonahkagan changed the title ~~Add capped_heap_push to limit memory usage~~ Limit memory usage when take is finite Feb 2, 2021

jonahkagan marked this pull request as ready for review February 2, 2021 20:58

ron-rivest reviewed Feb 7, 2021

View reviewed changes

jonahkagan added 2 commits February 16, 2021 16:05

Change max_tickets to drop+take

8efb81d

Add test_consistent_sampler.py

0183195

jonahkagan mentioned this pull request Feb 17, 2021

Limit memory usage when take is finite votingworks/consistent_sampler#1

Merged

jonahkagan added 3 commits February 17, 2021 11:10

Add test docstrings

a84241d

Update pkg files

a38e543

Copy correct files to pkg

aaf8ad4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limit memory usage when take is finite#4

Limit memory usage when take is finite#4
jonahkagan wants to merge 7 commits intoron-rivest:masterfrom
votingworks:cap-heap-size

jonahkagan commented Feb 2, 2021 •

edited

Loading

Uh oh!

ron-rivest left a comment

Uh oh!

ron-rivest left a comment

Uh oh!

jonahkagan commented Feb 8, 2021

Uh oh!

jonahkagan commented Feb 17, 2021

Uh oh!

ron-rivest commented Feb 17, 2021 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jonahkagan commented Feb 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ron-rivest left a comment

Choose a reason for hiding this comment

Uh oh!

ron-rivest left a comment

Choose a reason for hiding this comment

Uh oh!

jonahkagan commented Feb 8, 2021

Uh oh!

jonahkagan commented Feb 17, 2021

Uh oh!

ron-rivest commented Feb 17, 2021 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jonahkagan commented Feb 2, 2021 •

edited

Loading