Parser improvements for French/Spanish #195
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The recent work in #180 & #179 introduced some regressions in Spanish parsing when improving the French
bis/terparsing, this never sat well with me.Specifically, the following regression was concerning:
(0.84) ➜ [ { street: 'Calle Principal' }, { housenumber: '20' } ]? vs. (0.90) ➜ [ { street: 'Calle Principal' }, { housenumber: '20 B' } ]At the time this appeared to be a positive change as the unit number seems to be correctly parsed, however it didn't work well when used in the
pelias/apicodebase, the query generation would consider this a strict query for "20 B", returning nothing if it doesn't exist, rather than returning "20 B..." which, in this case, was the beginning of "Barcelona".So in order to avoid reverting the positive change to the
bis/terparsing of French addresses, I had to dig into the code and make some non-trivial changes:There was one the other beneficial change in that PR, which was unplanned:
(0.98) ➜ [ { housenumber: '10' }, { street: 'A Main Street' } ] vs. (0.86) ➜ [ { housenumber: '10 A' }, { street: 'Main Street' } ]This seems to be an improvement which I will unfortauntely revert in this PR, the "A Main Street" parse is definitely not great, I would like to improve this in the future.
For future reference, the "A Main Street" parse should be easier to solve than the "Calle Principal 20 B" parse simply because the unit number designation isn't the final token.
This PR consists of 3 commits:
bis/teras aSubdivision, this new classification is used instead ofStopwordas that classification is too generic to do anything really useful with.StopWordClassificationin far too many places, this is kind of unfortunate but I didn't want to attempt too many changes there. Instead I opted to create a new classificationSingleAlphaClassificationwhich just represents, yeah, a single alpha char, this allowed me to pass the failing tests for terms which relied on "a" being classified a stopword.resolves: #191