Skip to content

Close the TSV file when building the UMLS semantic-type tree#581

Open
Chessing234 wants to merge 1 commit into
allenai:mainfrom
Chessing234:fix/umls-semantic-type-tree-file-leak
Open

Close the TSV file when building the UMLS semantic-type tree#581
Chessing234 wants to merge 1 commit into
allenai:mainfrom
Chessing234:fix/umls-semantic-type-tree-file-leak

Conversation

@Chessing234
Copy link
Copy Markdown

Bug

construct_umls_tree_from_tsv in
scispacy/umls_semantic_type_tree.py leaks the TSV file handle on
every call: it iterates the file via

for line in open(cached_path(filepath), "r"):
    ...

and never closes the returned file object. CPython will eventually
close it when reference counting drops to zero, but that's non-trivial
to guarantee in practice (e.g. long-running processes that build the
tree repeatedly, or runtimes without reference counting such as PyPy),
so each call can leave one file descriptor open until a GC cycle runs.

Root cause

The loop stores the open file only in the implicit iterator of the
for, so there's no handle to close. There's also no with
block to tie the lifetime of the file to the parsing.

Why the fix is correct

  • Other scispacy modules (file_cache.py, linking_utils.py, etc.)
    all open files via with open(...) as f:; the new code matches
    that convention.
  • Behaviour is preserved exactly: same file, same iteration order, same
    parsing logic inside the loop; only the lifetime management changes.
  • No call sites need to be updated – the function signature and
    return value are unchanged.

Change

scispacy/umls_semantic_type_tree.py: wrap the TSV iteration in
with open(cached_path(filepath), "r") as tsv_file: and iterate
tsv_file. Small indentation change inside the for block; no
other logic touched.

construct_umls_tree_from_tsv iterated the source TSV with
    for line in open(cached_path(filepath), "r"):
which never closes the handle; the file object is only released when
garbage collection happens to run, and on CPython that means the file
descriptor can stay open well past the function's return. Every other
file open in scispacy uses an explicit context manager (see
file_cache.py, linking_utils.py, etc.), so match that convention by
wrapping the loop in `with open(...) as tsv_file:`. Behaviour is
otherwise unchanged: the same lines are parsed in the same order.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant