Visual Text Grounding with Florence-2

This project implements visual text grounding to identify objects in images based on text descriptions, using Florence-2-large with a Gradio interface for interactivity. I wanted to add performance metrics for project namely FLOPS, FPS, inference time (ms) etc, but due to time constraints I have not added it yet. I will attach articles to justify the edge compatibility of Florence2. I have attached a ChatGPT document with a breif literature survey.

Detection and Grounding Strategy

Input Processing: Converts images to RGB, resizes to 512x512 pixels for efficiency, and combines text prompts (e.g., "pen") with <CAPTION_TO_PHRASE_GROUNDING>.
Model Inference: Florence-2-large generates bounding box coordinates and labels, rescaled to the original image size.
Post-Processing: Draws red bounding boxes and labels on the image using PIL's ImageDraw.
Output: Saves annotated image (output_image.jpg), grounding results.

Models and Tools Used

Florence-2-large-no-flash-attn: Multimodal model (multimodalart/Florence-2-large-no-flash-attn) for visual grounding, using torch.float16 (GPU) or torch.float32 (CPU).
Libraries: transformers (AutoProcessor, AutoModelForCausalLM), torch (tensor operations, memory metrics), Pillow (image handling), Gradio (web interface)

Handling Unseen Objects

Capabilities: Florence-2 generalizes to unseen objects via its diverse training data, matching text to visual features. Specific prompts (e.g., "red pen") improve accuracy.
Limitations: Unfamiliar objects though not often , may not be detected, returning "No bounding boxes detected." Fine-tuning on custom datasets can enhance performance.

Efficient On-Device Execution

Optimization: Uses torch.float16 on CUDA GPUs or torch.float32 on CPU, with torch.no_grad() and beam search (num_beams=3) for efficiency. Images are resized to 512x512 to reduce computational load.
Edge Device Suitability: Florence-2’s efficiency for edge devices when optimized: Kindly look at the attached ChatGPT document containing a breif literature survey. https://www.hackster.io/mjrobot/vision-language-models-vlm-at-the-edge-9c6656
Recommendations: Apply model quantization, pruning, or use Florence-2-base to reduce memory (~500 MB) and FLOPs for edge compatibility.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.DS_Store		.DS_Store
2.0.0		2.0.0
LICENSE		LICENSE
Microsoft Florence-2_ Edge Deployment Advantages.pdf		Microsoft Florence-2_ Edge Deployment Advantages.pdf
VLM_Scenario-Image.jpg		VLM_Scenario-Image.jpg
output_image.jpg		output_image.jpg
output_sample_1.png		output_sample_1.png
output_sample_2.png		output_sample_2.png
readme.md		readme.md
requirements.txt		requirements.txt
vlm.py		vlm.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Visual Text Grounding with Florence-2

Detection and Grounding Strategy

Models and Tools Used

Handling Unseen Objects

Efficient On-Device Execution

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Visual Text Grounding with Florence-2

Detection and Grounding Strategy

Models and Tools Used

Handling Unseen Objects

Efficient On-Device Execution

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages