Fix filter_matches skipping parens at doc index 1 but not 0 by Chessing234 · Pull Request #580 · allenai/scispacy

Chessing234 · 2026-04-19T04:15:13Z

Bug

`scispacy/abbreviation.py:filter_matches` iterates the `( X )` matcher output and guards against spans near the start of the doc with:

```python

Ignore spans with more than 8 words in them, and spans at the start of the doc

if end - start > 8 or start == 1:
continue
if end - start > 3:
# Long form is inside the parens.
# Take one word before.
short_form_candidate = doc[start - 2 : start - 1]
long_form_candidate = doc[start:end]
...
else:
...
long_form_candidate = doc[max(start - max_words - 1, 0) : start - 1]
```

The `start == 1` guard exists so that `doc[start - 2 : start - 1]` / `doc[... : start - 1]` don't produce the `doc[-1:0]` empty slice. It only catches `start == 1` and leaves `start == 0` (paren-at-doc-start, e.g. biomedical sentences like `"(ABC) is ..."`) falling through to the same slicing code.

For `start == 0`, the long-form-in-parens branch computes `doc[-2:-1]` — a negative slice that Python silently interprets as "the token second-from-last of the entire doc", so `short_form_candidate` becomes a random unrelated token from the end of the document. In the normal branch, `doc[max(-max_words - 1, 0) : -1]` becomes `doc[0 : -1]`, the entire doc minus the final token, giving an arbitrarily large bogus long-form candidate. Both are silent wrong-answer failures rather than clean skips.

Root cause

The guard was written as `start == 1` instead of a range check. It correctly rejects `start == 1` (where `start - 1 == 0`) but does not reject `start == 0` (where `start - 1 == -1` triggers negative indexing). The two following slices both use `start - 1` (and `start - 2` in the long-form-in-parens branch), so both require `start >= 2` to index forwards from the doc start.

Why the fix is correct

Switching `start == 1` to `start < 2` covers both `start == 0` and `start == 1` with a single check, matching the actual slicing precondition (`start - 2` / `start - 1` must be `>= 0`). No other behavior changes: every `start >= 2` case still enters the existing slicing branches unchanged, and the existing `start == 1` skip is preserved.

Fix filter_matches skipping parens at doc start 1 but not 0

9a090a8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix filter_matches skipping parens at doc index 1 but not 0#580

Fix filter_matches skipping parens at doc index 1 but not 0#580
Chessing234 wants to merge 1 commit into
allenai:mainfrom
Chessing234:fix/abbreviation-start-zero-wraparound

Chessing234 commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Chessing234 commented Apr 19, 2026

Bug

Ignore spans with more than 8 words in them, and spans at the start of the doc

Root cause

Why the fix is correct

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant