Skip to content

OCR - GLM-OCR upgrade, removing GOOGLE CLOUD VISION API completely #19

@staticvoidmainmaui

Description

@staticvoidmainmaui

https://github.com/zai-org/GLM-OCR

GLM-OCR is a multimodal OCR model for complex document understanding, built on the GLM-V encoder–decoder architecture. It introduces Multi-Token Prediction (MTP) loss and stable full-task reinforcement learning to improve training efficiency, recognition accuracy, and generalization. The model integrates the CogViT visual encoder pre-trained on large-scale image–text data, a lightweight cross-modal connector with efficient token downsampling, and a GLM-0.5B language decoder. Combined with a two-stage pipeline of layout analysis and parallel recognition based on PP-DocLayout-V3, GLM-OCR delivers robust and high-quality OCR performance across diverse document layouts.

A model that we are integrating over using a boring vision API, since we are already using agentic development, this should make our implementation more in depth as we are integrating a repo with a open source license - the professor said this is completely allowed as long as gatorchef doesnt become a payed service.

WHY WOULD YOU TAKE THIS ISSUE-

  • NOVELTY : integrating this in our app is something new something powerful, your gonna love once you basically comandeered the essence of what gator chef was supposed to be
  • RESUME BOOSTER : intergrating this OCR model is a resume booster itself - "engineered advanced modeling pipeline using a state of the art Multi-Token Prediction (MTP) loss and stable full-task reinforcement model for text extraction"
  • THE Pipe Absorber prize : project manager will indulge in a private coding session with you once he likes what he sees 🥇

Metadata

Metadata

Labels

enhancementNew feature or requesthelp wantedExtra attention is needed

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions