Instantiate DatasetContextClassifier in DatasetParser and guard null case#19
Merged
Merged
Conversation
…case The datasetContextClassifier field was never initialized — the getInstance factory accepts it as a parameter but the private constructor discarded it, so `this.datasetContextClassifier = datasetContextClassifier` was a no-op field-to-itself assignment. This caused an NPE at DatasetParser.java:1485 on every /service/processDatasetPDF request. Apply the same pattern as #18 (DatasetDisambiguator): eagerly instantiate via DatasetContextClassifier.getInstance(configuration) in the constructor, catch model-loading failures, and null-guard the call site with a warn-once helper so the service degrades gracefully when context models are unavailable. Also bump project version to 0.9.0 in build.gradle, config.yml, config-docker.yml, and Readme.md. https://claude.ai/code/session_01MLUqfjKtqyvzt9ALYKpshr
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
DatasetParser.java:1485on every/service/processDatasetPDFrequest:this.datasetContextClassifierwas permanentlynullbecause thegetInstance(...)factory accepted the classifier as a parameter but the private constructor it delegated to only tookconfiguration, makingthis.datasetContextClassifier = datasetContextClassifiera field-to-itself no-op. The same pattern was already fixed forDatasetDisambiguatorin Fix NPE when disambiguator is null in DatasetParser #18 — this change applies it toDatasetContextClassifier: eager init viaDatasetContextClassifier.getInstance(configuration)in the constructor,try/catchso model-loading failures don't cascade, warn-once helper, and a null-guard at the call site for graceful degradation.this.dataTypeClassifier = dataTypeClassifier;(real init is lazy indataTypeClassifyat line 649-650 and untouched).0.9.0inbuild.gradle,resources/config/config.yml,resources/config/config-docker.yml, andReadme.md(docker tags,GROBID_VERSIONbuild-arg, sample JSON response, build instructions). GROBID dependencies are already at0.9.0since Remove DataseerML, upgrade to grobid 0.9.0 and Gradle 8.5 #5.Compatibility with recent fixes
DatasetParser.getInstance(serviceConfiguration, null, null, null). The three null args were already ignored by the constructor; the classifier is now sourced fromconfigurationdirectly.GrobidEngineInitialiser.preloadModels()still marksdatasetsandcontext_*correctly. On success the laterDatasetContextClassifier.getInstance(configuration)call short-circuits via the singleton guard (no double-load). On context-model failure, thetry/catchin the constructor prevents false-blamingdatasets, and the dedicated preload step re-attempts construction and markscontext_*failed accurately inModelLoadStatus.warnContextClassifierNotAvailableOncefollows the establishedAtomicBoolean.compareAndSetpattern.Test plan
./gradlew compileJava— passes../gradlew run) andcurl -F "input=@<sample>.pdf" http://localhost:8060/service/processDatasetPDF. Expect HTTP 200, no NPE in logs, dataset mentions carryingused/created/sharedcontext labels.context_used/context_creation/context_sharedundergrobid-home/models/, restart, and resend the request. Expect HTTP 200 with mentions (no context labels), a single warn-once log line, no NPE, and/service/healthreporting the context models as failed./service/processDatasetPDFJSON response shows"version": "0.9.0".https://claude.ai/code/session_01MLUqfjKtqyvzt9ALYKpshr