Skip to content

Conversation

@missinglink
Copy link
Member

@missinglink missinglink commented Sep 18, 2025

The recent work in #180 & #179 introduced some regressions in Spanish parsing when improving the French bis/ter parsing, this never sat well with me.

Specifically, the following regression was concerning:

(0.84) ➜ [ { street: 'Calle Principal' }, { housenumber: '20' } ]?
vs.
(0.90) ➜ [ { street: 'Calle Principal' }, { housenumber: '20 B' } ]

At the time this appeared to be a positive change as the unit number seems to be correctly parsed, however it didn't work well when used in the pelias/api codebase, the query generation would consider this a strict query for "20 B", returning nothing if it doesn't exist, rather than returning "20 B..." which, in this case, was the beginning of "Barcelona".

So in order to avoid reverting the positive change to the bis/ter parsing of French addresses, I had to dig into the code and make some non-trivial changes:

"1 bis Rue Ballainvilliers 63000 Clermont-Ferrand"
(0.76) ➜ [
  { street: 'Rue Ballainvilliers' },
  { housenumber: '63000' },
  { locality: 'Clermont-Ferrand' }
]
vs.
(0.95) ➜ [
  { housenumber: '1 bis' },
  { street: 'Rue Ballainvilliers' },
  { postcode: '63000' },
  { locality: 'Clermont-Ferrand' }
]

There was one the other beneficial change in that PR, which was unplanned:

(0.98) ➜ [ { housenumber: '10' }, { street: 'A Main Street' } ]
vs.
(0.86) ➜ [ { housenumber: '10 A' }, { street: 'Main Street' } ]

This seems to be an improvement which I will unfortauntely revert in this PR, the "A Main Street" parse is definitely not great, I would like to improve this in the future.

For future reference, the "A Main Street" parse should be easier to solve than the "Calle Principal 20 B" parse simply because the unit number designation isn't the final token.

This PR consists of 3 commits:

  1. Reclassify bis/ter as a Subdivision, this new classification is used instead of Stopword as that classification is too generic to do anything really useful with.
  2. Remove the single letter 'a' from the English stopwords list. This was unfortunately a mistake to have in the first place since it's very short and exists in many languages. In doing so I discovered that our classifiers depend on StopWordClassification in far too many places, this is kind of unfortunate but I didn't want to attempt too many changes there. Instead I opted to create a new classification SingleAlphaClassification which just represents, yeah, a single alpha char, this allowed me to pass the failing tests for terms which relied on "a" being classified a stopword.
  3. Add tests covering any of the behaviour mentioned in this PR

resolves: #191

@missinglink missinglink merged commit d52fa62 into master Sep 22, 2025
6 checks passed
@missinglink missinglink deleted the parser-fixes branch September 22, 2025 14:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unable to classify Indiana as IN

3 participants