Skip to content

Fix memory leak in TEI processing: cache JAXP factories, close streams#20

Merged
lfoppiano merged 2 commits into
devfrom
claude/fix-tei-memory-leak-1gLLI
Apr 16, 2026
Merged

Fix memory leak in TEI processing: cache JAXP factories, close streams#20
lfoppiano merged 2 commits into
devfrom
claude/fix-tei-memory-leak-1gLLI

Conversation

@lfoppiano
Copy link
Copy Markdown
Collaborator

PR #7 fixed a leak in the PDF path by closing DocumentSource; TEI has no DocumentSource, but heap and file-descriptor usage still grew under sustained /processDatasetTEI load. Root causes:

  1. JAXP factory churn. DocumentBuilderFactory.newInstance(), XPathFactory.newInstance(), TransformerFactory.newInstance(), and SAXParserFactory.newInstance() ran on every request (and, in XMLUtilities.segment(), on every sentence). Each call re-runs ServiceLoader discovery and produces factories whose classloader- backed caches are not reclaimed promptly.
  2. DocumentBuilder.parse(File) defers FileInputStream closure to Xerces, accumulating FDs under sustained load.
  3. The parsed DOM Document was left as a local reference until method return, delaying young-gen reclaim of a large node graph.

Fix:

  • Cache factories as private static finals in DatasetParser and XMLUtilities, with synchronized accessors (newDocumentBuilder, newXPath, newSAXParser, newTransformer). Factories' new*() methods are not guaranteed thread-safe; synchronized access keeps contention negligible versus XML parse cost and avoids ThreadLocal leaks in the Dropwizard thread pool.
  • Parse TEI via try-with-resources FileInputStream so the handle is released deterministically.
  • Null the parsed Document reference in finally blocks to aid GC.
  • Remove dead DocumentBuilderFactory allocation in processXML(File).

No PDF-path changes; PR #7's DocumentSource.close() is preserved.

PR #7 fixed a leak in the PDF path by closing DocumentSource; TEI has no
DocumentSource, but heap and file-descriptor usage still grew under
sustained /processDatasetTEI load. Root causes:

1. JAXP factory churn. DocumentBuilderFactory.newInstance(),
   XPathFactory.newInstance(), TransformerFactory.newInstance(), and
   SAXParserFactory.newInstance() ran on every request (and, in
   XMLUtilities.segment(), on every sentence). Each call re-runs
   ServiceLoader discovery and produces factories whose classloader-
   backed caches are not reclaimed promptly.
2. DocumentBuilder.parse(File) defers FileInputStream closure to
   Xerces, accumulating FDs under sustained load.
3. The parsed DOM Document was left as a local reference until
   method return, delaying young-gen reclaim of a large node graph.

Fix:
- Cache factories as private static finals in DatasetParser and
  XMLUtilities, with synchronized accessors (newDocumentBuilder,
  newXPath, newSAXParser, newTransformer). Factories' new*() methods
  are not guaranteed thread-safe; synchronized access keeps contention
  negligible versus XML parse cost and avoids ThreadLocal leaks in
  the Dropwizard thread pool.
- Parse TEI via try-with-resources FileInputStream so the handle is
  released deterministically.
- Null the parsed Document reference in finally blocks to aid GC.
- Remove dead DocumentBuilderFactory allocation in processXML(File).

No PDF-path changes; PR #7's DocumentSource.close() is preserved.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to stop heap and file-descriptor growth under sustained /processDatasetTEI load by reducing JAXP factory churn and ensuring XML inputs/DOMs are released more deterministically during TEI/XML processing.

Changes:

  • Cache JAXP factories (DocumentBuilderFactory/XPathFactory/SAXParserFactory/TransformerFactory) and provide synchronized new*() helpers to avoid repeated ServiceLoader discovery.
  • Parse TEI via explicitly managed streams (try-with-resources) to deterministically close file handles.
  • Remove redundant per-call factory allocations in hot paths (notably XMLUtilities.segment() and parts of DatasetParser).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
src/main/java/org/grobid/core/utilities/XMLUtilities.java Introduces cached JAXP factories and routes multiple XML/XPath/transform operations through synchronized helper constructors.
src/main/java/org/grobid/core/engines/DatasetParser.java Caches JAXP factories, switches TEI parsing to explicit InputStream handling, and reuses cached XPath creation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/main/java/org/grobid/core/engines/DatasetParser.java Outdated
Comment thread src/main/java/org/grobid/core/engines/DatasetParser.java
Comment thread src/main/java/org/grobid/core/utilities/XMLUtilities.java
Addresses Copilot review feedback on the previous commit:

1. XXE/SSRF hardening. Because the cached DocumentBuilderFactory /
   SAXParserFactory / TransformerFactory parse user-supplied XML/TEI,
   the static initializers now apply OWASP-recommended hardening:
   - FEATURE_SECURE_PROCESSING
   - disallow-doctype-decl
   - external-general-entities = false
   - external-parameter-entities = false
   - nonvalidating/load-external-dtd = false
   - setXIncludeAware(false), setExpandEntityReferences(false)
   - TransformerFactory: ACCESS_EXTERNAL_DTD / ACCESS_EXTERNAL_STYLESHEET
     pinned to empty string
   Each feature/attribute is set in its own try/catch so unsupported
   options on a given JAXP implementation do not break class init.

2. Preserve systemId. DocumentBuilder.parse(File) previously set the
   document systemId to the file URI, which matters for relative
   references and error locations. Restore it by setting
   inputSource.setSystemId(file.toURI().toString()) on the InputSource
   wrapper in DatasetParser.processTEI while keeping the try-with-
   resources FileInputStream for deterministic stream closure.
@lfoppiano lfoppiano merged commit d29c1c3 into dev Apr 16, 2026
3 checks passed
@lfoppiano lfoppiano deleted the claude/fix-tei-memory-leak-1gLLI branch April 16, 2026 05:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants