This project implements visual text grounding to identify objects in images based on text descriptions, using Florence-2-large with a Gradio interface for interactivity. I wanted to add performance metrics for project namely FLOPS, FPS, inference time (ms) etc, but due to time constraints I have not added it yet. I will attach articles to justify the edge compatibility of Florence2. I have attached a ChatGPT document with a breif literature survey.
- Input Processing: Converts images to RGB, resizes to 512x512 pixels for efficiency, and combines text prompts (e.g., "pen") with
<CAPTION_TO_PHRASE_GROUNDING>. - Model Inference: Florence-2-large generates bounding box coordinates and labels, rescaled to the original image size.
- Post-Processing: Draws red bounding boxes and labels on the image using PIL's ImageDraw.
- Output: Saves annotated image (
output_image.jpg), grounding results.
- Florence-2-large-no-flash-attn: Multimodal model (
multimodalart/Florence-2-large-no-flash-attn) for visual grounding, usingtorch.float16(GPU) ortorch.float32(CPU). - Libraries:
transformers(AutoProcessor, AutoModelForCausalLM),torch(tensor operations, memory metrics),Pillow(image handling),Gradio(web interface)
- Capabilities: Florence-2 generalizes to unseen objects via its diverse training data, matching text to visual features. Specific prompts (e.g., "red pen") improve accuracy.
- Limitations: Unfamiliar objects though not often , may not be detected, returning "No bounding boxes detected." Fine-tuning on custom datasets can enhance performance.
- Optimization: Uses
torch.float16on CUDA GPUs ortorch.float32on CPU, withtorch.no_grad()and beam search (num_beams=3) for efficiency. Images are resized to 512x512 to reduce computational load. - Edge Device Suitability: Florence-2’s efficiency for edge devices when optimized: Kindly look at the attached ChatGPT document containing a breif literature survey. https://www.hackster.io/mjrobot/vision-language-models-vlm-at-the-edge-9c6656
- Recommendations: Apply model quantization, pruning, or use Florence-2-base to reduce memory (~500 MB) and FLOPs for edge compatibility.
