Skip to content

Faster parsing #263

@michaelkirk

Description

@michaelkirk

I'm investigating #223. Recall we currently parse to a serde_json::JsonValue and convert that to our geojson types. This entails parsing every coordinate to a Vec (which, as of #222, then gets converted to a TinyVec). It seems obvious we'd do better to parse directly to TinyVec.

I thought to myself, "Why do we need to have manual deserialization at all? Why can't we derive all of our deserialization with serde?"

It required some structural juggling, but wasn't very hard and is quite concise! Here's a POC I've been messing with:
https://github.com/michaelkirk/geojson_fast

Unfortunately, it's not necessarily any faster, and can even be slower (despite the "_fast" branding in my crate name 😞). This surprised me at first. How could it be slower if we're avoiding the intermediate JSONValue representation (and it's per coordinate Vecs!)

parse GeoJSON Object (geojson crate) (countries.geojson)
                        time:   [1.5592 ms 1.5626 ms 1.5661 ms]

parse GeoJSON Object (geojson_fast) (countries.geojson)                                                                                     
                        time:   [1.8901 ms 1.9085 ms 1.9356 ms]                      

I think it's mostly due to serde-rs/serde#1495

When parsing adjacently tagged enums (e.g. the geojson::GeoJSON enum), (I think) all the fields are effectively buffered in an intermediate representation until deserialization is complete.

My understanding is that the generic serde deserializaer can't know which variant it's encountered until it has completed all the fields in the variant. As a contrived example:

#[serde(tag)]
enum MyEnum {
   Variant1 { coordinates: Vec<TinyVec<f64 ,2>> },
   Variant2 { coordinates: Vec<Vec<f64>> }
}

let input = '{ "coordinates": [[1.0 2.0]], type: "variant2" }';
//                                                       ^________ You have to parse all the way to here 
//                                                                 before you know whether coordinates should have been a
//                                                                 Vec<TinyVec<f64 ,2>> or a Vec<Vec<f64>>

A more concrete example for geojson could be:

{ "coordinates": [[...]], "type": "Polygon" }

## coordinates have same dimensionality
{ "coordinates": [[...]], "type": "MultiLineString" }

Conceptually if we could guarantee the tag came first, we could optimize things with a finite amount of peeking:

{"type": "Polygon" , "coordinates": [[...]]}

## coordinates have same dimensionality
{"type": "MultiLineString", "coordinates": [[...]]}

But we can make no such guarantees, and serde is a generic deserializer, so I don't think we can get any tricks like this for free. It might be possible to write a custom deserializer that is smarter about this. Conceptually it seems like it should be possible to "early terminate" or perhaps provide some ambiguous intermediate structs to help make deserializing geojson optimal.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions