Limit memory usage when take is finite#1
Conversation
| @@ -373,35 +373,42 @@ def next_ticket(ticket): | |||
| ticket.generation+1) | |||
There was a problem hiding this comment.
one thing I'm not sure about is whether i need to copy this file to pkg as well in order to import it in arlo
| ) | ||
| self.assertNotEqual( | ||
| list(sampler(ids(n), 12345, take=n, with_replacement=True)), | ||
| list(sampler(ids(n), 12346, take=n, with_replacement=True)), |
There was a problem hiding this comment.
should we also be testing these things without replacement?
There was a problem hiding this comment.
yeah, that's the assertions just above this. i think i tested with and without replacement for every test case
| for i in range(1, 10): | ||
| for j in range(1, i): | ||
| self.assertEqual( | ||
| list(sampler(ids(10), 12345))[:j], |
There was a problem hiding this comment.
is there a reason we aren't passing ids(10) as a fixture? Not that it's a huge performance hit with a small number, but...
There was a problem hiding this comment.
uhh just didn't know how to do fixtures with unittest. plus it shuffles the list of ids every time it generates them so i think that's a reason to create it on the fly every time
| for i in range(1, n): | ||
| K = random.sample(ids(n), random.randint(1, i)) | ||
| J = random.sample(K, random.randint(1, len(K))) | ||
| self.assertEqual( |
There was a problem hiding this comment.
just for readability, it might be nice to have a docstring explaining what we're doing here (also in the other tests)
There was a problem hiding this comment.
yes good point these make little sense without the explanatory math
| J = random.sample(K, random.randint(1, len(K))) | ||
| self.assertEqual( | ||
| list(sampler(J, 12345, output="id")), | ||
| [k for k in list(sampler(K, 12345, output="id")) if k in J], |
There was a problem hiding this comment.
is the list cast needed? Isn't the output of sampler an iterator?
There was a problem hiding this comment.
not needed, you're right
In our real-world usage of consistent_sampler for election audits with Arlo, we found that the sampler's memory usage grew proportionally to the number of ballots in the election at a rate of about 300 bytes per ballot. For elections with millions of ballots, memory usage topped out at a couple GB - not a big deal for a personal computer, but stressful for small cloud-based web servers, which typically have low memory requirements and therefore don't come with big memory banks. @benadida realized that you don't actually need to keep all of the ballots in memory as you build the min-heap of tickets, you only need to keep as many tickets as you want to sample - the set of tickets with the smallest numbers up to that point. heapq.nsmallest implements this algorithm - it takes in an iterator and returns a list of the smallest n items without loading the entire iterator into memory at once. This PR uses heapq.nsmallest to build the initial ticket heap in a memory-efficient way (when the desired sample size - take - is finite). Previously, memory usage was O(len(id_list)), with this PR it's O(take).
In our real-world usage of consistent_sampler for election audits with Arlo, we found that the sampler's memory usage grew proportionally to the number of ballots in the election at a rate of about 300 bytes per ballot. For elections with millions of ballots, memory usage topped out at a couple GB - not a big deal for a personal computer, but stressful for small cloud-based web servers, which typically have low memory requirements and therefore don't come with big memory banks.
@benadida realized that you don't actually need to keep all of the ballots in memory as you build the min-heap of tickets, you only need to keep as many tickets as you want to sample - the set of tickets with the smallest numbers up to that point. heapq.nsmallest implements this algorithm - it takes in an iterator and returns a list of the smallest n items without loading the entire iterator into memory at once.
This PR uses heapq.nsmallest to build the initial ticket heap in a memory-efficient way (when the desired sample size - take - is finite). Previously, memory usage was O(len(id_list)), with this PR it's O(take).
See ron-rivest#4