Convert a P&ID PDF into a structured graph, cross-reference it against SOP limits, and generate both machine-readable findings and a simple interactive HTML viewer.
- Renders P&ID pages from
data/pid/diagram.pdf - Runs OCR + clustering to detect text regions
- Uses an LLM to normalize OCR text into equipment tags and attributes
- Uses a vision model to infer process connectivity (edges) between confirmed tags and populate missed details
- Builds a graph (
graph.json) with node attributes and edges - Parses SOP records from
data/sop/sop.docx - Audits SOP vs P&ID for:
missing_in_pidpressure_mismatchtemperature_mismatch
- Generates:
outputs/findings.jsonoutputs/report.mdoutputs/graph_viewer.html
src/pid_audit/main.py- pipeline entrypointsrc/pid_audit/ingest.py- PDF rendersrc/pid_audit/ocr.py- OCR token detection + clusteringsrc/pid_audit/ocr_correct.py- LLM OCR correction + tag extractionsrc/pid_audit/vision.py- vision graph extractionsrc/pid_audit/graph_build.py- merges OCR + vision into graphsrc/pid_audit/sop_parse.py- SOP table parsingsrc/pid_audit/audit.py- deterministic SOP/P&ID checkssrc/pid_audit/graph_ui.py- interactive HTML graph viewersrc/pid_audit/report.py- markdown report generation
- Python 3.11+
- Poetry
- Tesseract OCR installed
macOS:
brew install tesseractpoetry installEnsure your .env already contains:
OPENROUTER_API_KEY=your_key_hereModel aliases are set in src/pid_audit/client.py:
TEXT_MODEL(used for OCR correction)VISION_MODEL(used for graph extraction)
poetry run pid-auditThis executes all steps and writes outputs under outputs/.
You can view generated graphs for multiple VLM options at:
Direct viewer links:
- Sonnet 4.6: https://kartik34.github.io/pid-entity-detection-graph-construction/viewers/sonnet-4.6.html
- GPT-5.4: https://kartik34.github.io/pid-entity-detection-graph-construction/viewers/gpt-5.4.html
- Grok 4.2 Beta: https://kartik34.github.io/pid-entity-detection-graph-construction/viewers/grok-4.2-beta.html
- Gemini 3 Flash Preview: https://kartik34.github.io/pid-entity-detection-graph-construction/viewers/gemini-3-flash-preview.html
Notes:
- Grok did a decent job detecting edges and populating relevant details for a relatively low cost per PID (3 page pdf) pipeline run of about ~10 cents in tokens
- Sonnet 4.6 also performed very well, but each pipeline cost about 30-50 cents in tokens
- At scale (assuming 10,000 documents per client) this could cost upto 5-7k USD for a VLM result.
main.py runs 9 steps:
- Ingest documents (render PDF pages)
- Parse SOP
- OCR and clustering
- OCR correction with LLM
- Vision graph extraction
- Graph build
- Audit
- Report
- HTML graph viewer
Core outputs:
outputs/sop_structured.jsonoutputs/confirmed_tags.jsonoutputs/graph.jsonoutputs/findings.jsonoutputs/report.mdoutputs/graph_viewer.html
Debug images:
outputs/debug_1_ocr_raw_page_{n}.pngoutputs/debug_2_ocr_clusters_page_{n}.pngoutputs/debug_3_confirmed_tags_page_{n}.png
graph.json contains:
nodes: each withid(instance key, e.g.TAG@p2)tag(human-readable tag)component_typeequipment_class(major/minor/external)page,bbox,confidence,needs_review,family,source, and extracted attributes
edges: each withsource,targetpipe_labelflow_direction
- Used Tesseract to extract raw text boxes, then clustered nearby detections into candidate tag regions.
- Dense P&ID areas and linework (pipes/symbols) reduced OCR quality and made deterministic parsing brittle.
- Added an LLM correction step with strict JSON schema output validation.
- This normalizes noisy OCR text into ISA-style tags and structured attributes.
- Chose this approach because pure deterministic parsing was not robust enough on messy OCR output.
- Used a multimodal VLM to infer connectivity (edges) between confirmed OCR tags.
- Inputs to the VLM include confirmed tags + bounding boxes, the base page image, an overlay image with highlighted boxes, and image dimensions for spatial context.
- If the VLM returns empty or invalid JSON, confirmed nodes are still emitted with
needs_review=true. - A fully deterministic graph-construction approach was explored first, but required significantly more modeling/tuning for dense diagrams.
- Audit logic is deterministic and rule-based (missing equipment, pressure mismatch, temperature mismatch).
- The same pattern can be extended with configurable rule sets for different plant or spec standards.
- Dense pages can reduce edge recall/precision in vision extraction.
- SOP parsing currently expects key limits from the first table in the DOCX.
- Only implemented audit rules are missing equipment in P&ID, pressure mismatch, and temperature mismatch.
Missing OPENROUTER_API_KEY: add key to.env.- OCR quality is low: tune
PID_AUDIT_OCR_UPSCALE_FACTOR,PID_AUDIT_OCR_CROP_PAD, andPID_AUDIT_OCR_LENIENT_MIN_BOX_AREA, then rerun. - Viewer looks stale: rerun pipeline or regenerate with
pid-audit-graph-ui.