016 - Spark Migration #430
eveleighoj
started this conversation in
Open design proposal
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Introduction
When creating dataset packages in sqlite files we have been hitting the limits of what can be achieved on a single machine. Specifically for large datasets such as title boundaries. In order to improve this we are moving to using spark via EMR serverless.
Status
Draft
Summary And Content
Spark allows us to tackle much larger datasets a lot more quickly by using distributed compute however it does have some major impacts on how we currently do things. Initially we'll just focus on the collection workflow but we may be able to optimise other workflows.
The current workflow looks like:
The new workflow alters things to:
Key differences:
Upsides:
Downsides:
Beta Was this translation helpful? Give feedback.
All reactions