I'm investigating #223. Recall we currently parse to a serde_json::JsonValue and convert that to our geojson types. This entails parsing every coordinate to a Vec (which, as of #222, then gets converted to a TinyVec). It seems obvious we'd do better to parse directly to TinyVec.
I thought to myself, "Why do we need to have manual deserialization at all? Why can't we derive all of our deserialization with serde?"
It required some structural juggling, but wasn't very hard and is quite concise! Here's a POC I've been messing with:
https://github.com/michaelkirk/geojson_fast
Unfortunately, it's not necessarily any faster, and can even be slower (despite the "_fast" branding in my crate name 😞). This surprised me at first. How could it be slower if we're avoiding the intermediate JSONValue representation (and it's per coordinate Vecs!)
parse GeoJSON Object (geojson crate) (countries.geojson)
time: [1.5592 ms 1.5626 ms 1.5661 ms]
parse GeoJSON Object (geojson_fast) (countries.geojson)
time: [1.8901 ms 1.9085 ms 1.9356 ms]
I think it's mostly due to serde-rs/serde#1495
When parsing adjacently tagged enums (e.g. the geojson::GeoJSON enum), (I think) all the fields are effectively buffered in an intermediate representation until deserialization is complete.
My understanding is that the generic serde deserializaer can't know which variant it's encountered until it has completed all the fields in the variant. As a contrived example:
#[serde(tag)]
enum MyEnum {
Variant1 { coordinates: Vec<TinyVec<f64 ,2>> },
Variant2 { coordinates: Vec<Vec<f64>> }
}
let input = '{ "coordinates": [[1.0 2.0]], type: "variant2" }';
// ^________ You have to parse all the way to here
// before you know whether coordinates should have been a
// Vec<TinyVec<f64 ,2>> or a Vec<Vec<f64>>
A more concrete example for geojson could be:
{ "coordinates": [[...]], "type": "Polygon" }
## coordinates have same dimensionality
{ "coordinates": [[...]], "type": "MultiLineString" }
Conceptually if we could guarantee the tag came first, we could optimize things with a finite amount of peeking:
{"type": "Polygon" , "coordinates": [[...]]}
## coordinates have same dimensionality
{"type": "MultiLineString", "coordinates": [[...]]}
But we can make no such guarantees, and serde is a generic deserializer, so I don't think we can get any tricks like this for free. It might be possible to write a custom deserializer that is smarter about this. Conceptually it seems like it should be possible to "early terminate" or perhaps provide some ambiguous intermediate structs to help make deserializing geojson optimal.
I'm investigating #223. Recall we currently parse to a
serde_json::JsonValueand convert that to our geojson types. This entails parsing every coordinate to aVec(which, as of #222, then gets converted to aTinyVec). It seems obvious we'd do better to parse directly to TinyVec.I thought to myself, "Why do we need to have manual deserialization at all? Why can't we derive all of our deserialization with serde?"
It required some structural juggling, but wasn't very hard and is quite concise! Here's a POC I've been messing with:
https://github.com/michaelkirk/geojson_fast
Unfortunately, it's not necessarily any faster, and can even be slower (despite the "_fast" branding in my crate name 😞). This surprised me at first. How could it be slower if we're avoiding the intermediate JSONValue representation (and it's per coordinate
Vecs!)I think it's mostly due to serde-rs/serde#1495
When parsing adjacently tagged enums (e.g. the geojson::GeoJSON enum), (I think) all the fields are effectively buffered in an intermediate representation until deserialization is complete.
My understanding is that the generic serde deserializaer can't know which variant it's encountered until it has completed all the fields in the variant. As a contrived example:
A more concrete example for geojson could be:
Conceptually if we could guarantee the tag came first, we could optimize things with a finite amount of peeking:
But we can make no such guarantees, and serde is a generic deserializer, so I don't think we can get any tricks like this for free. It might be possible to write a custom deserializer that is smarter about this. Conceptually it seems like it should be possible to "early terminate" or perhaps provide some ambiguous intermediate structs to help make deserializing geojson optimal.