Skip to content

Limit memory usage when take is finite#1

Merged
jonahkagan merged 6 commits intomasterfrom
cap-heap-size
Feb 17, 2021
Merged

Limit memory usage when take is finite#1
jonahkagan merged 6 commits intomasterfrom
cap-heap-size

Conversation

@jonahkagan
Copy link
Copy Markdown
Collaborator

In our real-world usage of consistent_sampler for election audits with Arlo, we found that the sampler's memory usage grew proportionally to the number of ballots in the election at a rate of about 300 bytes per ballot. For elections with millions of ballots, memory usage topped out at a couple GB - not a big deal for a personal computer, but stressful for small cloud-based web servers, which typically have low memory requirements and therefore don't come with big memory banks.

@benadida realized that you don't actually need to keep all of the ballots in memory as you build the min-heap of tickets, you only need to keep as many tickets as you want to sample - the set of tickets with the smallest numbers up to that point. heapq.nsmallest implements this algorithm - it takes in an iterator and returns a list of the smallest n items without loading the entire iterator into memory at once.

This PR uses heapq.nsmallest to build the initial ticket heap in a memory-efficient way (when the desired sample size - take - is finite). Previously, memory usage was O(len(id_list)), with this PR it's O(take).

See ron-rivest#4

@@ -373,35 +373,42 @@ def next_ticket(ticket):
ticket.generation+1)
Copy link
Copy Markdown
Collaborator Author

@jonahkagan jonahkagan Feb 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one thing I'm not sure about is whether i need to copy this file to pkg as well in order to import it in arlo

)
self.assertNotEqual(
list(sampler(ids(n), 12345, take=n, with_replacement=True)),
list(sampler(ids(n), 12346, take=n, with_replacement=True)),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also be testing these things without replacement?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that's the assertions just above this. i think i tested with and without replacement for every test case

for i in range(1, 10):
for j in range(1, i):
self.assertEqual(
list(sampler(ids(10), 12345))[:j],
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason we aren't passing ids(10) as a fixture? Not that it's a huge performance hit with a small number, but...

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uhh just didn't know how to do fixtures with unittest. plus it shuffles the list of ids every time it generates them so i think that's a reason to create it on the fly every time

for i in range(1, n):
K = random.sample(ids(n), random.randint(1, i))
J = random.sample(K, random.randint(1, len(K)))
self.assertEqual(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just for readability, it might be nice to have a docstring explaining what we're doing here (also in the other tests)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes good point these make little sense without the explanatory math

Comment thread code/test_consistent_sampler.py Outdated
J = random.sample(K, random.randint(1, len(K)))
self.assertEqual(
list(sampler(J, 12345, output="id")),
[k for k in list(sampler(K, 12345, output="id")) if k in J],
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the list cast needed? Isn't the output of sampler an iterator?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed, you're right

@jonahkagan jonahkagan merged commit 9e83c26 into master Feb 17, 2021
jonahkagan added a commit that referenced this pull request Feb 17, 2021
In our real-world usage of consistent_sampler for election audits with Arlo, we found that the sampler's memory usage grew proportionally to the number of ballots in the election at a rate of about 300 bytes per ballot. For elections with millions of ballots, memory usage topped out at a couple GB - not a big deal for a personal computer, but stressful for small cloud-based web servers, which typically have low memory requirements and therefore don't come with big memory banks.

@benadida realized that you don't actually need to keep all of the ballots in memory as you build the min-heap of tickets, you only need to keep as many tickets as you want to sample - the set of tickets with the smallest numbers up to that point. heapq.nsmallest implements this algorithm - it takes in an iterator and returns a list of the smallest n items without loading the entire iterator into memory at once.

This PR uses heapq.nsmallest to build the initial ticket heap in a memory-efficient way (when the desired sample size - take - is finite). Previously, memory usage was O(len(id_list)), with this PR it's O(take).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants