Grounding DINO Fine-tuning 🦖

This repository builds upon the original work by IDEA-Research/GroundingDINO and the open-source implementation by Techwolf (Gitee).

🧩 Additional Info:

This implementation introduces the capability to train the model with image-to-text grounding — a crucial feature in applications where textual descriptions must align with image regions.
For instance, when the model is given a caption "a cat on the sofa", it should be able to localize both the "cat" and the "sofa" in the image.

🧠 Author’s Note:

I have extended and customized this implementation for my own experiments on fine-tuning, evaluation, and visualization of the Grounding DINO model on a custom dataset.
All my scripts are saved in the myscripts/ folder.

In addition, since many developers face environment setup issues, I have included my working Conda environment file (gdino_env.yml) in conda_env_file/ folder to make replication and setup easier.

✨ Features:

Fine-tuning DINO: This extension works allows you to fine-tune DINO on your custom dataset.
Bounding Box Regression: Uses Generalized IoU and Smooth L1 loss for improved bounding box prediction.
Position-aware Logit Losses: The model not only learns to detect objects but also their positions in the captions.
NMS: We also implemented phrase based NMS to remove redundant boxes of same objects

⚙️ Installation:

Follow the installation steps from the original GroundingDINO repository.
Ensure all prerequisites are installed before running training or testing.

🧩 Train:

Prepare your dataset with images and associated textual captions. A tiny dataset is given multimodal-data to demonstrate the expected data format.
Run the train.py for training.

python train.py

Test:

Visualize results of training on test images

python test.py

Known Limitations/TO DO:

Currently Support only one image allow batching
Add model evaluations
We did not added auxilary losses as mentioned in the original paper, as we feel we are just finetuning an already trained model but feel free to add auxilary losses and compare results

Visual Results

For Input text "peduncle.fruit." and input test image

Before Fine-tuning

Intially model detects the wring category and does not detect peduncle (green part) of the fruits

After Fine-tuning

After fine tuning the model can detect the right category of objects with high confidence and detect all parts of fruits as mentioned in text.

Contributing

Feel free to open issues, suggest improvements, or submit pull requests. If you found this repository useful, consider giving it a star to make it more visible to others!

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.asset		.asset
.ipynb_checkpoints		.ipynb_checkpoints
build/__editable__.groundingdino-0.1.0-cp39-cp39-win_amd64/groundingdino		build/__editable__.groundingdino-0.1.0-cp39-cp39-win_amd64/groundingdino
conda_env_file		conda_env_file
demo		demo
groundingdino.egg-info		groundingdino.egg-info
groundingdino		groundingdino
multimodal-data		multimodal-data
myscripts		myscripts
vis_Dataset		vis_Dataset
vis_results		vis_results
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py
test.py		test.py
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Grounding DINO Fine-tuning 🦖

🧩 Additional Info:

🧠 Author’s Note:

✨ Features:

⚙️ Installation:

🧩 Train:

Test:

Known Limitations/TO DO:

Visual Results

Before Fine-tuning

After Fine-tuning

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Grounding DINO Fine-tuning 🦖

🧩 Additional Info:

🧠 Author’s Note:

✨ Features:

⚙️ Installation:

🧩 Train:

Test:

Known Limitations/TO DO:

Visual Results

Before Fine-tuning

After Fine-tuning

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages