Currently, the library focuses completely on the GeoPandas GeoDataFrames and requires the whole dataset from start to finish to fit on the machine. This isn't ideal, since working with bigger areas requires higher RAM usage. In this issue, we should decide which framework/library to use in the final pipeline.
Any insight from people who used those tools with any tips will be very helpful 😄
Currently available options:
dask-geopandas - GeoPandas extension for Dask
Apache Sedona - dedicated wrapper over Apache Spark and Flink for spatial operations
duckdb-spatial - fast in-memory db with spatial extension
geoarrow-python - currently developed standard for Apache Arrow for storing spatial objects
GeoPolars - geospatial extension for Polars, written in Rust
We should also decide if the library will depend on a single framework only, or if it will be open for extensions and implement multiple backends - similar to the ibis project. Since we write our code with abstract API, we should be able to implement multiple backends, but we will have to make sure that all results are consistent (high-quality tests) and with different backends, outputs will be either different (dask-dataframe, duckdb relation, sedona object, geodataframe, geoparquet/geofeather file path) or we will have to write an abstraction around each object to make it consistent and backends-agnostic.
We could also finish each operation with a calculated geo-parquet/arrow/feather file and work on files instead of loading them into memory.
Additional tools worth mentioning:
Currently, the library focuses completely on the GeoPandas
GeoDataFramesand requires the whole dataset from start to finish to fit on the machine. This isn't ideal, since working with bigger areas requires higher RAM usage. In this issue, we should decide which framework/library to use in the final pipeline.Any insight from people who used those tools with any tips will be very helpful 😄
Currently available options:
dask-geopandas- GeoPandas extension for DaskApache Sedona- dedicated wrapper over Apache Spark and Flink for spatial operationsduckdb-spatial- fast in-memory db with spatial extensiongeoarrow-python- currently developed standard for Apache Arrow for storing spatial objectsGeoPolars- geospatial extension for Polars, written in RustWe should also decide if the library will depend on a single framework only, or if it will be open for extensions and implement multiple backends - similar to the
ibisproject. Since we write our code with abstract API, we should be able to implement multiple backends, but we will have to make sure that all results are consistent (high-quality tests) and with different backends, outputs will be either different (dask-dataframe, duckdb relation, sedona object, geodataframe, geoparquet/geofeather file path) or we will have to write an abstraction around each object to make it consistent and backends-agnostic.We could also finish each operation with a calculated geo-parquet/arrow/feather file and work on files instead of loading them into memory.
Additional tools worth mentioning: