Limit memory usage when take is finite by jonahkagan · Pull Request #1 · votingworks/consistent_sampler

jonahkagan · 2021-02-17T00:09:26Z

In our real-world usage of consistent_sampler for election audits with Arlo, we found that the sampler's memory usage grew proportionally to the number of ballots in the election at a rate of about 300 bytes per ballot. For elections with millions of ballots, memory usage topped out at a couple GB - not a big deal for a personal computer, but stressful for small cloud-based web servers, which typically have low memory requirements and therefore don't come with big memory banks.

@benadida realized that you don't actually need to keep all of the ballots in memory as you build the min-heap of tickets, you only need to keep as many tickets as you want to sample - the set of tickets with the smallest numbers up to that point. heapq.nsmallest implements this algorithm - it takes in an iterator and returns a list of the smallest n items without loading the entire iterator into memory at once.

This PR uses heapq.nsmallest to build the initial ticket heap in a memory-efficient way (when the desired sample size - take - is finite). Previously, memory usage was O(len(id_list)), with this PR it's O(take).

See ron-rivest#4

jonahkagan · 2021-02-17T18:08:09Z

@@ -373,35 +373,42 @@ def next_ticket(ticket):
                  ticket.generation+1)


one thing I'm not sure about is whether i need to copy this file to pkg as well in order to import it in arlo

umbernhard · 2021-02-17T18:09:27Z

+            )
+            self.assertNotEqual(
+                list(sampler(ids(n), 12345, take=n, with_replacement=True)),
+                list(sampler(ids(n), 12346, take=n, with_replacement=True)),


should we also be testing these things without replacement?

yeah, that's the assertions just above this. i think i tested with and without replacement for every test case

umbernhard · 2021-02-17T18:10:56Z

+        for i in range(1, 10):
+            for j in range(1, i):
+                self.assertEqual(
+                    list(sampler(ids(10), 12345))[:j],


is there a reason we aren't passing ids(10) as a fixture? Not that it's a huge performance hit with a small number, but...

uhh just didn't know how to do fixtures with unittest. plus it shuffles the list of ids every time it generates them so i think that's a reason to create it on the fly every time

umbernhard · 2021-02-17T18:14:05Z

+        for i in range(1, n):
+            K = random.sample(ids(n), random.randint(1, i))
+            J = random.sample(K, random.randint(1, len(K)))
+            self.assertEqual(


just for readability, it might be nice to have a docstring explaining what we're doing here (also in the other tests)

yes good point these make little sense without the explanatory math

umbernhard · 2021-02-17T18:16:18Z

+            J = random.sample(K, random.randint(1, len(K)))
+            self.assertEqual(
+                list(sampler(J, 12345, output="id")),
+                [k for k in list(sampler(K, 12345, output="id")) if k in J],


is the list cast needed? Isn't the output of sampler an iterator?

not needed, you're right

@benadida

In our real-world usage of consistent_sampler for election audits with Arlo, we found that the sampler's memory usage grew proportionally to the number of ballots in the election at a rate of about 300 bytes per ballot. For elections with millions of ballots, memory usage topped out at a couple GB - not a big deal for a personal computer, but stressful for small cloud-based web servers, which typically have low memory requirements and therefore don't come with big memory banks. @benadida realized that you don't actually need to keep all of the ballots in memory as you build the min-heap of tickets, you only need to keep as many tickets as you want to sample - the set of tickets with the smallest numbers up to that point. heapq.nsmallest implements this algorithm - it takes in an iterator and returns a list of the smallest n items without loading the entire iterator into memory at once. This PR uses heapq.nsmallest to build the initial ticket heap in a memory-efficient way (when the desired sample size - take - is finite). Previously, memory usage was O(len(id_list)), with this PR it's O(take).

jonahkagan added 4 commits February 2, 2021 11:25

Add capped_heap_push to limit memory usage

94dc5c7

Use heapq.nsmallest for better performance

1afe896

Change max_tickets to drop+take

8efb81d

Add test_consistent_sampler.py

0183195

jonahkagan requested review from benadida and umbernhard February 17, 2021 00:09

jonahkagan commented Feb 17, 2021

View reviewed changes

umbernhard approved these changes Feb 17, 2021

View reviewed changes

jonahkagan added 2 commits February 17, 2021 11:10

Add test docstrings

a84241d

Update pkg files

a38e543

jonahkagan merged commit 9e83c26 into master Feb 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limit memory usage when take is finite#1

Limit memory usage when take is finite#1
jonahkagan merged 6 commits intomasterfrom
cap-heap-size

jonahkagan commented Feb 17, 2021

Uh oh!

jonahkagan Feb 17, 2021 •

edited

Loading

Uh oh!

umbernhard Feb 17, 2021

Uh oh!

jonahkagan Feb 17, 2021

Uh oh!

umbernhard Feb 17, 2021

Uh oh!

jonahkagan Feb 17, 2021

Uh oh!

umbernhard Feb 17, 2021

Uh oh!

jonahkagan Feb 17, 2021

Uh oh!

umbernhard Feb 17, 2021

Uh oh!

jonahkagan Feb 17, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -373,35 +373,42 @@ def next_ticket(ticket):
		ticket.generation+1)

Conversation

jonahkagan commented Feb 17, 2021

Uh oh!

jonahkagan Feb 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jonahkagan Feb 17, 2021 •

edited

Loading