Limit memory usage when take is finite#4
Limit memory usage when take is finite#4jonahkagan wants to merge 7 commits intoron-rivest:masterfrom
Conversation
ron-rivest
left a comment
There was a problem hiding this comment.
Hi Jonah -- This is interesting, but the resulting data structure is no longer a heap, and can give the unexpected (wrong) results. Suppose the heap is limited to size 2, and is initialized to [5,7]. An attempt to insert 8 will be ignored, as the heap would be too big. But we can extract 5, and then insert 9. The 8 is lost forever; the next two extracts give [7,9] when they should give [7,8]. What are the conditions for this "heap-like" data structure to behave "correctly"?
ron-rivest
left a comment
There was a problem hiding this comment.
I'm confused by the comment that "drop+2*take" insertions are enough, when sampling with replacement. Why? A single element can be extracted "take" times from the queue, if it always has the smallest key.
|
Hey @ron-rivest, thanks for taking the time to review and work through this with us. I think this method safely creates a heap because it only caps the heap size during the initial building of the heap, during which there are no extractions, only insertions. Once the heap is built, it behaves just like a normal min-heap and it's size is no longer capped (which is good enough performance-wise for our use case). Thinking through the question about why you need When doing capped initial building followed by normal extractions/re-insertions, I think you only need To explain why I believe
Would love to hear your thoughts on this reasoning and if it checks out. I haven't gotten to do this kind of data structure proof-writing since college a decade ago, so I'm probably quite rusty and fully expect there to be some holes in my logic. |
|
@ron-rivest I've updated this PR with some test cases to help show that the behavior of the sampler is not compromised by this optimization. Are there any further tests you would like to see? |
|
Hi Jonah --
Thanks. I'll take a look at these. (May take a couple of days...)
One concern is that the routine should work right in a mode where you
take a first sample of size n1, but don't get statistically significant
results from
that sample, so you take a second sample of size n2 (where drop is n1 now),
and
so on...
I'll be in touch soon...
Take care,
Ron
…On Tue, Feb 16, 2021 at 7:08 PM Jonah Kagan ***@***.***> wrote:
@ron-rivest <https://github.com/ron-rivest> I've updated this PR with
some test cases to help show that the behavior of the sampler is not
compromised by this optimization. Are there any further tests you would
like to see?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABQ765K64RIIZPDPGPSVHYDS7MCGNANCNFSM4W7MSGIQ>
.
--
(1) The climate crisis isn't about who's right. It's about who's helping.
(2) If you drink from the sewer, you're going to get sick. True for water.
True for news.
|
In our real-world usage of consistent_sampler for election audits with Arlo, we found that the sampler's memory usage grew proportionally to the number of ballots in the election at a rate of about 300 bytes per ballot. For elections with millions of ballots, memory usage topped out at a couple GB - not a big deal for a personal computer, but stressful for small cloud-based web servers, which typically have low memory requirements and therefore don't come with big memory banks.
@benadida realized that you don't actually need to keep all of the ballots in memory as you build the min-heap of tickets, you only need to keep as many tickets as you want to sample - the set of tickets with the smallest numbers up to that point.
heapq.nsmallestimplements this algorithm - it takes in an iterator and returns a list of the smallestnitems without loading the entire iterator into memory at once.This PR uses
heapq.nsmallestto build the initial ticket heap in a memory-efficient way (when the desired sample size -take- is finite). Previously, memory usage was O(len(id_list)), with this PR it's O(take).