Fix memory leak in TEI processing: cache JAXP factories, close streams#20
Merged
Conversation
PR #7 fixed a leak in the PDF path by closing DocumentSource; TEI has no DocumentSource, but heap and file-descriptor usage still grew under sustained /processDatasetTEI load. Root causes: 1. JAXP factory churn. DocumentBuilderFactory.newInstance(), XPathFactory.newInstance(), TransformerFactory.newInstance(), and SAXParserFactory.newInstance() ran on every request (and, in XMLUtilities.segment(), on every sentence). Each call re-runs ServiceLoader discovery and produces factories whose classloader- backed caches are not reclaimed promptly. 2. DocumentBuilder.parse(File) defers FileInputStream closure to Xerces, accumulating FDs under sustained load. 3. The parsed DOM Document was left as a local reference until method return, delaying young-gen reclaim of a large node graph. Fix: - Cache factories as private static finals in DatasetParser and XMLUtilities, with synchronized accessors (newDocumentBuilder, newXPath, newSAXParser, newTransformer). Factories' new*() methods are not guaranteed thread-safe; synchronized access keeps contention negligible versus XML parse cost and avoids ThreadLocal leaks in the Dropwizard thread pool. - Parse TEI via try-with-resources FileInputStream so the handle is released deterministically. - Null the parsed Document reference in finally blocks to aid GC. - Remove dead DocumentBuilderFactory allocation in processXML(File). No PDF-path changes; PR #7's DocumentSource.close() is preserved.
There was a problem hiding this comment.
Pull request overview
This PR aims to stop heap and file-descriptor growth under sustained /processDatasetTEI load by reducing JAXP factory churn and ensuring XML inputs/DOMs are released more deterministically during TEI/XML processing.
Changes:
- Cache JAXP factories (DocumentBuilderFactory/XPathFactory/SAXParserFactory/TransformerFactory) and provide synchronized
new*()helpers to avoid repeated ServiceLoader discovery. - Parse TEI via explicitly managed streams (try-with-resources) to deterministically close file handles.
- Remove redundant per-call factory allocations in hot paths (notably
XMLUtilities.segment()and parts ofDatasetParser).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
src/main/java/org/grobid/core/utilities/XMLUtilities.java |
Introduces cached JAXP factories and routes multiple XML/XPath/transform operations through synchronized helper constructors. |
src/main/java/org/grobid/core/engines/DatasetParser.java |
Caches JAXP factories, switches TEI parsing to explicit InputStream handling, and reuses cached XPath creation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Addresses Copilot review feedback on the previous commit:
1. XXE/SSRF hardening. Because the cached DocumentBuilderFactory /
SAXParserFactory / TransformerFactory parse user-supplied XML/TEI,
the static initializers now apply OWASP-recommended hardening:
- FEATURE_SECURE_PROCESSING
- disallow-doctype-decl
- external-general-entities = false
- external-parameter-entities = false
- nonvalidating/load-external-dtd = false
- setXIncludeAware(false), setExpandEntityReferences(false)
- TransformerFactory: ACCESS_EXTERNAL_DTD / ACCESS_EXTERNAL_STYLESHEET
pinned to empty string
Each feature/attribute is set in its own try/catch so unsupported
options on a given JAXP implementation do not break class init.
2. Preserve systemId. DocumentBuilder.parse(File) previously set the
document systemId to the file URI, which matters for relative
references and error locations. Restore it by setting
inputSource.setSystemId(file.toURI().toString()) on the InputSource
wrapper in DatasetParser.processTEI while keeping the try-with-
resources FileInputStream for deterministic stream closure.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR #7 fixed a leak in the PDF path by closing DocumentSource; TEI has no DocumentSource, but heap and file-descriptor usage still grew under sustained /processDatasetTEI load. Root causes:
Fix:
No PDF-path changes; PR #7's DocumentSource.close() is preserved.