Error handling for Grobid when not responding#35
Conversation
There was a problem hiding this comment.
Pull request overview
Adds explicit error handling for Grobid failures so the Streamlit UI can surface a clear “please try later” message instead of failing ambiguously (issue #11).
Changes:
- Introduce
GrobidServiceErrorand raise it when Grobid errors or returns non-200. - Catch
GrobidServiceErrorin the Streamlit upload/embedding flow and display an error message. - Add a (currently redundant) guard in
DocumentQAEnginefor missing Grobid output.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
streamlit_app.py |
Catches Grobid failures during embedding creation and shows a user-facing error. |
document_qa/grobid_processors.py |
Defines GrobidServiceError and raises it from Grobid processing on failure/non-200. |
document_qa/document_qa_engine.py |
Imports/raises GrobidServiceError when Grobid structure is missing. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
lfoppiano
left a comment
There was a problem hiding this comment.
There is a missing space, the rest looks fine. I did not test it, so please make sure you tested it before merge/squash.
| tmp_file = NamedTemporaryFile() | ||
| tmp_file.write(bytearray(binary)) | ||
| st.session_state['binary'] = binary | ||
|
|
||
| st.session_state['doc_id'] = hash = st.session_state['rqa'][model].create_memory_embeddings( | ||
| tmp_file.name, | ||
| chunk_size=chunk_size, | ||
| perc_overlap=0.1 | ||
| ) |
There was a problem hiding this comment.
Yes, @Sanakhamassi here you need to either add tempFile in the with () or handle that somehow
| st.session_state['doc_id'] = None | ||
| st.session_state['loaded_embeddings'] = False | ||
| st.session_state['uploaded'] = False | ||
| st.error(f"{message} Please try later.") |
| if grobid_url: | ||
| self.grobid_processor = GrobidProcessor(grobid_url) | ||
| self.grobid_processor = GrobidProcessor(grobid_url, ping_server=False) | ||
|
|
| try: | ||
| pdf_file, status, text = self.grobid_client.process_pdf("processFulltextDocument", | ||
| input_path, | ||
| consolidate_header=True, | ||
| consolidate_citations=False, | ||
| segment_sentences=False, | ||
| tei_coordinates=coordinates, | ||
| include_raw_citations=False, | ||
| include_raw_affiliations=False, | ||
| generateIDs=True) | ||
| except Exception as exc: | ||
| raise GrobidServiceError("Grobid service did not respond.") from exc | ||
|
|
||
| if status != 200: | ||
| return | ||
| raise GrobidServiceError( | ||
| f"Grobid service returned status {status}.", | ||
| status_code=status | ||
| ) |
lfoppiano
left a comment
There was a problem hiding this comment.
Better than before, however there are few changes that needs to be done.
| if grobid_url: | ||
| self.grobid_processor = GrobidProcessor(grobid_url) | ||
| self.grobid_processor = GrobidProcessor(grobid_url, ping_server=False) | ||
|
|
| tmp_file = NamedTemporaryFile() | ||
| tmp_file.write(bytearray(binary)) | ||
| st.session_state['binary'] = binary | ||
|
|
||
| st.session_state['doc_id'] = hash = st.session_state['rqa'][model].create_memory_embeddings( | ||
| tmp_file.name, | ||
| chunk_size=chunk_size, | ||
| perc_overlap=0.1 | ||
| ) |
There was a problem hiding this comment.
Yes, @Sanakhamassi here you need to either add tempFile in the with () or handle that somehow
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
Comments suppressed due to low confidence (1)
document_qa/document_qa_engine.py:266
- The type annotation says
Tuple[List[Document], list], but this method actually returns a list of strings (context_as_text) plus coordinates. Update the annotation to match the returned value (e.g.,Tuple[List[str], list]) or returnList[Document]if that’s what callers should get.
def query_storage(self, query: str, doc_id, context_size=4) -> Tuple[List[Document], list]:
"""
Returns the context related to a given query
"""
documents, coordinates = self._get_context(doc_id, query, context_size)
| st.session_state['binary'] = binary | ||
|
|
||
| st.session_state['doc_id'] = hash = st.session_state['rqa'][model].create_memory_embeddings( | ||
| tmp_path, | ||
| chunk_size=chunk_size, | ||
| perc_overlap=0.1 | ||
| ) |
| output_parser=None, | ||
| context_size=4, | ||
| extraction_schema=None, | ||
| verbose=False | ||
| ) -> (Any, str): | ||
| ) -> Tuple[Any, str]: |
| return parsed_output | ||
|
|
||
| def _run_query(self, doc_id, query, context_size=4) -> (List[Document], list): | ||
| def _run_query(self, doc_id, query, context_size=4) -> Tuple[List[Document], list]: |
| include_raw_affiliations=False, | ||
| generateIDs=True) | ||
| except Exception as exc: | ||
| raise GrobidServiceError("Grobid service did not respond.") from exc | ||
|
|
| grobid_url=None, | ||
| memory=None | ||
| memory=None, | ||
| ping_grobid_server: bool = False |
Related to issue #11