Faster parsing

I'm investigating #223. Recall we currently parse to a `serde_json::JsonValue` and convert that to our geojson types. This entails parsing every coordinate to a `Vec` (which, as of #222, then gets converted to a `TinyVec`). It seems obvious we'd do better to parse directly to TinyVec.

I thought to myself, "Why do we need to have manual deserialization at all? Why can't we derive all of our deserialization with serde?"

It required some structural juggling, but wasn't very hard and is quite concise! Here's a POC I've been messing with: 
https://github.com/michaelkirk/geojson_fast

Unfortunately, it's not necessarily any faster, and can even be slower (despite the "_fast" branding in my crate name 😞). This surprised me at first. How could it be slower if we're avoiding the intermediate JSONValue representation (and it's per coordinate `Vec`s!)

```
parse GeoJSON Object (geojson crate) (countries.geojson)
                        time:   [1.5592 ms 1.5626 ms 1.5661 ms]

parse GeoJSON Object (geojson_fast) (countries.geojson)                                                                                     
                        time:   [1.8901 ms 1.9085 ms 1.9356 ms]                      
```

I think it's mostly due to https://github.com/serde-rs/serde/issues/1495

When parsing adjacently tagged enums (e.g. the geojson::GeoJSON enum), (I think) all the fields are effectively buffered in an intermediate representation until deserialization is complete.

My understanding is that the generic serde deserializaer can't know which variant it's encountered until it has completed all the fields in the variant. As a contrived example:
```
#[serde(tag)]
enum MyEnum {
   Variant1 { coordinates: Vec<TinyVec<f64 ,2>> },
   Variant2 { coordinates: Vec<Vec<f64>> }
}

let input = '{ "coordinates": [[1.0 2.0]], type: "variant2" }';
//                                                       ^________ You have to parse all the way to here 
//                                                                 before you know whether coordinates should have been a
//                                                                 Vec<TinyVec<f64 ,2>> or a Vec<Vec<f64>>
```

A more concrete example for geojson could be:
```
{ "coordinates": [[...]], "type": "Polygon" }

## coordinates have same dimensionality
{ "coordinates": [[...]], "type": "MultiLineString" }
```

Conceptually if we could guarantee the tag came first, we could optimize things with a finite amount of peeking:
```
{"type": "Polygon" , "coordinates": [[...]]}

## coordinates have same dimensionality
{"type": "MultiLineString", "coordinates": [[...]]}
```

But we can make no such guarantees, and serde is a generic deserializer, so I don't think we can get any tricks like this for free. It might be possible to write a custom deserializer that is smarter about this. Conceptually it seems like it should be possible to "early terminate" or perhaps provide some ambiguous intermediate structs to help make deserializing geojson optimal.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster parsing #263

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Faster parsing #263

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions