Skip to content

bratjay01/ZeroShotVLM

Repository files navigation

Visual Text Grounding with Florence-2

This project implements visual text grounding to identify objects in images based on text descriptions, using Florence-2-large with a Gradio interface for interactivity. I wanted to add performance metrics for project namely FLOPS, FPS, inference time (ms) etc, but due to time constraints I have not added it yet. I will attach articles to justify the edge compatibility of Florence2. I have attached a ChatGPT document with a breif literature survey.

Detection and Grounding Strategy

  • Input Processing: Converts images to RGB, resizes to 512x512 pixels for efficiency, and combines text prompts (e.g., "pen") with <CAPTION_TO_PHRASE_GROUNDING>.
  • Model Inference: Florence-2-large generates bounding box coordinates and labels, rescaled to the original image size.
  • Post-Processing: Draws red bounding boxes and labels on the image using PIL's ImageDraw.
  • Output: Saves annotated image (output_image.jpg), grounding results.

Models and Tools Used

  • Florence-2-large-no-flash-attn: Multimodal model (multimodalart/Florence-2-large-no-flash-attn) for visual grounding, using torch.float16 (GPU) or torch.float32 (CPU).
  • Libraries: transformers (AutoProcessor, AutoModelForCausalLM), torch (tensor operations, memory metrics), Pillow (image handling), Gradio (web interface)

Handling Unseen Objects

  • Capabilities: Florence-2 generalizes to unseen objects via its diverse training data, matching text to visual features. Specific prompts (e.g., "red pen") improve accuracy.
  • Limitations: Unfamiliar objects though not often , may not be detected, returning "No bounding boxes detected." Fine-tuning on custom datasets can enhance performance.

Efficient On-Device Execution

  • Optimization: Uses torch.float16 on CUDA GPUs or torch.float32 on CPU, with torch.no_grad() and beam search (num_beams=3) for efficiency. Images are resized to 512x512 to reduce computational load.
  • Edge Device Suitability: Florence-2’s efficiency for edge devices when optimized: Kindly look at the attached ChatGPT document containing a breif literature survey. https://www.hackster.io/mjrobot/vision-language-models-vlm-at-the-edge-9c6656
  • Recommendations: Apply model quantization, pruning, or use Florence-2-base to reduce memory (~500 MB) and FLOPs for edge compatibility.

Sample Output

About

Zero shot object detection and localization of objects with no prior training using a VLM

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages