Parser improvements for French/Spanish #195

missinglink · 2025-09-18T10:33:26Z

The recent work in #180 & #179 introduced some regressions in Spanish parsing when improving the French bis/ter parsing, this never sat well with me.

Specifically, the following regression was concerning:

(0.84) ➜ [ { street: 'Calle Principal' }, { housenumber: '20' } ]?
vs.
(0.90) ➜ [ { street: 'Calle Principal' }, { housenumber: '20 B' } ]

At the time this appeared to be a positive change as the unit number seems to be correctly parsed, however it didn't work well when used in the pelias/api codebase, the query generation would consider this a strict query for "20 B", returning nothing if it doesn't exist, rather than returning "20 B..." which, in this case, was the beginning of "Barcelona".

So in order to avoid reverting the positive change to the bis/ter parsing of French addresses, I had to dig into the code and make some non-trivial changes:

"1 bis Rue Ballainvilliers 63000 Clermont-Ferrand"
(0.76) ➜ [
  { street: 'Rue Ballainvilliers' },
  { housenumber: '63000' },
  { locality: 'Clermont-Ferrand' }
]
vs.
(0.95) ➜ [
  { housenumber: '1 bis' },
  { street: 'Rue Ballainvilliers' },
  { postcode: '63000' },
  { locality: 'Clermont-Ferrand' }
]

There was one the other beneficial change in that PR, which was unplanned:

(0.98) ➜ [ { housenumber: '10' }, { street: 'A Main Street' } ]
vs.
(0.86) ➜ [ { housenumber: '10 A' }, { street: 'Main Street' } ]

This seems to be an improvement which I will unfortauntely revert in this PR, the "A Main Street" parse is definitely not great, I would like to improve this in the future.

For future reference, the "A Main Street" parse should be easier to solve than the "Calle Principal 20 B" parse simply because the unit number designation isn't the final token.

This PR consists of 3 commits:

Reclassify bis/ter as a Subdivision, this new classification is used instead of Stopword as that classification is too generic to do anything really useful with.
Remove the single letter 'a' from the English stopwords list. This was unfortunately a mistake to have in the first place since it's very short and exists in many languages. In doing so I discovered that our classifiers depend on StopWordClassification in far too many places, this is kind of unfortunate but I didn't want to attempt too many changes there. Instead I opted to create a new classification SingleAlphaClassification which just represents, yeah, a single alpha char, this allowed me to pass the failing tests for terms which relied on "a" being classified a stopword.
Add tests covering any of the behaviour mentioned in this PR

resolves: #191

classifier/scheme/street.js

resources/pelias/dictionaries/libpostal/en/stopwords.txt

…ionClassification

…SingleAlphaClassification

missinglink requested review from Joxit and orangejulius September 18, 2025 10:33

missinglink commented Sep 18, 2025

View reviewed changes

classifier/scheme/street.js Show resolved Hide resolved

missinglink commented Sep 18, 2025

View reviewed changes

resources/pelias/dictionaries/libpostal/en/stopwords.txt Show resolved Hide resolved

missinglink force-pushed the parser-fixes branch from 09b6e97 to 468a284 Compare September 22, 2025 14:46

missinglink mentioned this pull request Sep 22, 2025

Unable to classify Indiana as IN #191

Closed

missinglink added 3 commits September 22, 2025 16:51

refactor(classification): reclassify french bis/ter terms as Subdivis…

b7c1d28

…ionClassification

refactor(classification): remove "a" from English stopwords, add new …

321edf7

…SingleAlphaClassification

test: additional test cases to cover PR behaviour

3b19e89

missinglink force-pushed the parser-fixes branch from 468a284 to 3b19e89 Compare September 22, 2025 14:52

missinglink mentioned this pull request Sep 22, 2025

Potential superfluous classification in classifier/scheme/street.js #196

Open

missinglink merged commit d52fa62 into master Sep 22, 2025
6 checks passed

missinglink deleted the parser-fixes branch September 22, 2025 14:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Parser improvements for French/Spanish #195

Parser improvements for French/Spanish #195

Uh oh!

missinglink commented Sep 18, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Parser improvements for French/Spanish #195

Parser improvements for French/Spanish #195

Uh oh!

Conversation

missinglink commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

missinglink commented Sep 18, 2025 •

edited

Loading