016 - Spark Migration #430

eveleighoj · 2026-05-13T14:52:21Z

eveleighoj
May 13, 2026
Maintainer

Introduction

When creating dataset packages in sqlite files we have been hitting the limits of what can be achieved on a single machine. Specifically for large datasets such as title boundaries. In order to improve this we are moving to using spark via EMR serverless.

Status

Draft

Summary And Content

Spark allows us to tackle much larger datasets a lot more quickly by using distributed compute however it does have some major impacts on how we currently do things. Initially we'll just focus on the collection workflow but we may be able to optimise other workflows.

The current workflow looks like:

The new workflow alters things to:

Key differences:

instead of our silver primarily being up of sqlite databases that are viewed via datasette we replace with a delta-lake layer. delta lake is a parquet way of storing files in s3 that allows easier access as well as being very powerful for analytical queries.
still create dataset packages and allow access via datasette but we consider them as part of the gold layer with the hope of reducing what's in then and removing dependencies on datasette in the future. Instead with spark and by upgrading the RDS database we can surface more data through the postgis database.
the old assemble code can be retired instead focussed on packaging data from parquet datasets in s3.

Upsides:

for assembly step we will be utilising spark. this means that entity calculations will be possible for any dataset. It's also generally easier code to write especially as calculations get more complex

Downsides:

we will need to consider how to share history outside of the postgis if someone wants to download it as not all datasets will have a sqlite package. although the volume of data was becoming too large anyway.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

016 - Spark Migration #430

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

016 - Spark Migration #430

Uh oh!

eveleighoj May 13, 2026 Maintainer

Introduction

Status

Summary And Content

Replies: 0 comments

eveleighoj
May 13, 2026
Maintainer