Feat: add big data workflow capabilities

Currently, the library focuses completely on the GeoPandas `GeoDataFrames` and requires the whole dataset from start to finish to fit on the machine. This isn't ideal, since working with bigger areas requires higher RAM usage. In this issue, we should decide which framework/library to use in the final pipeline.

Any insight from people who used those tools with any tips will be very helpful 😄 

Currently available options:

- [`dask-geopandas`](https://github.com/geopandas/dask-geopandas) - GeoPandas extension for Dask
- [`Apache Sedona`](https://github.com/apache/sedona) - dedicated wrapper over Apache Spark and Flink for spatial operations
- [`duckdb-spatial`](https://github.com/duckdb/duckdb_spatial) - fast in-memory db with spatial extension
- [`geoarrow-python`](https://github.com/geoarrow/geoarrow-python) - currently developed standard for Apache Arrow for storing spatial objects
- [`GeoPolars`](https://github.com/geopolars/geopolars) - geospatial extension for Polars, written in Rust

We should also decide if the library will depend on a single framework only, or if it will be open for extensions and implement multiple backends - similar to the [`ibis`](https://github.com/ibis-project/ibis) project. Since we write our code with abstract API, we should be able to implement multiple backends, but we will have to make sure that all results are consistent (high-quality tests) and with different backends, outputs will be either different (dask-dataframe, duckdb relation, sedona object, geodataframe, geoparquet/geofeather file path) or we will have to write an abstraction around each object to make it consistent and backends-agnostic.

We could also finish each operation with a calculated geo-parquet/arrow/feather file and work on files instead of loading them into memory.


Additional tools worth mentioning:
- https://github.com/rapidsai/cuspatial - CUDA accelerated spatial operations

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: add big data workflow capabilities #396

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Feat: add big data workflow capabilities #396

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions