Skip to content

Fix filter_matches skipping parens at doc index 1 but not 0#580

Open
Chessing234 wants to merge 1 commit into
allenai:mainfrom
Chessing234:fix/abbreviation-start-zero-wraparound
Open

Fix filter_matches skipping parens at doc index 1 but not 0#580
Chessing234 wants to merge 1 commit into
allenai:mainfrom
Chessing234:fix/abbreviation-start-zero-wraparound

Conversation

@Chessing234
Copy link
Copy Markdown

Bug

`scispacy/abbreviation.py:filter_matches` iterates the `( X )` matcher output and guards against spans near the start of the doc with:

```python

Ignore spans with more than 8 words in them, and spans at the start of the doc

if end - start > 8 or start == 1:
continue
if end - start > 3:
# Long form is inside the parens.
# Take one word before.
short_form_candidate = doc[start - 2 : start - 1]
long_form_candidate = doc[start:end]
...
else:
...
long_form_candidate = doc[max(start - max_words - 1, 0) : start - 1]
```

The `start == 1` guard exists so that `doc[start - 2 : start - 1]` / `doc[... : start - 1]` don't produce the `doc[-1:0]` empty slice. It only catches `start == 1` and leaves `start == 0` (paren-at-doc-start, e.g. biomedical sentences like `"(ABC) is ..."`) falling through to the same slicing code.

For `start == 0`, the long-form-in-parens branch computes `doc[-2:-1]` — a negative slice that Python silently interprets as "the token second-from-last of the entire doc", so `short_form_candidate` becomes a random unrelated token from the end of the document. In the normal branch, `doc[max(-max_words - 1, 0) : -1]` becomes `doc[0 : -1]`, the entire doc minus the final token, giving an arbitrarily large bogus long-form candidate. Both are silent wrong-answer failures rather than clean skips.

Root cause

The guard was written as `start == 1` instead of a range check. It correctly rejects `start == 1` (where `start - 1 == 0`) but does not reject `start == 0` (where `start - 1 == -1` triggers negative indexing). The two following slices both use `start - 1` (and `start - 2` in the long-form-in-parens branch), so both require `start >= 2` to index forwards from the doc start.

Why the fix is correct

Switching `start == 1` to `start < 2` covers both `start == 0` and `start == 1` with a single check, matching the actual slicing precondition (`start - 2` / `start - 1` must be `>= 0`). No other behavior changes: every `start >= 2` case still enters the existing slicing branches unchanged, and the existing `start == 1` skip is preserved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant