Fix filter_matches skipping parens at doc index 1 but not 0#580
Open
Chessing234 wants to merge 1 commit into
Open
Fix filter_matches skipping parens at doc index 1 but not 0#580Chessing234 wants to merge 1 commit into
Chessing234 wants to merge 1 commit into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Bug
`scispacy/abbreviation.py:filter_matches` iterates the `( X )` matcher output and guards against spans near the start of the doc with:
```python
Ignore spans with more than 8 words in them, and spans at the start of the doc
if end - start > 8 or start == 1:
continue
if end - start > 3:
# Long form is inside the parens.
# Take one word before.
short_form_candidate = doc[start - 2 : start - 1]
long_form_candidate = doc[start:end]
...
else:
...
long_form_candidate = doc[max(start - max_words - 1, 0) : start - 1]
```
The `start == 1` guard exists so that `doc[start - 2 : start - 1]` / `doc[... : start - 1]` don't produce the `doc[-1:0]` empty slice. It only catches `start == 1` and leaves `start == 0` (paren-at-doc-start, e.g. biomedical sentences like `"(ABC) is ..."`) falling through to the same slicing code.
For `start == 0`, the long-form-in-parens branch computes `doc[-2:-1]` — a negative slice that Python silently interprets as "the token second-from-last of the entire doc", so `short_form_candidate` becomes a random unrelated token from the end of the document. In the normal branch, `doc[max(-max_words - 1, 0) : -1]` becomes `doc[0 : -1]`, the entire doc minus the final token, giving an arbitrarily large bogus long-form candidate. Both are silent wrong-answer failures rather than clean skips.
Root cause
The guard was written as `start == 1` instead of a range check. It correctly rejects `start == 1` (where `start - 1 == 0`) but does not reject `start == 0` (where `start - 1 == -1` triggers negative indexing). The two following slices both use `start - 1` (and `start - 2` in the long-form-in-parens branch), so both require `start >= 2` to index forwards from the doc start.
Why the fix is correct
Switching `start == 1` to `start < 2` covers both `start == 0` and `start == 1` with a single check, matching the actual slicing precondition (`start - 2` / `start - 1` must be `>= 0`). No other behavior changes: every `start >= 2` case still enters the existing slicing branches unchanged, and the existing `start == 1` skip is preserved.