Charm: The Missing Piece in ViT fine-tuning for Image Aesthetic Assessment

We introduce Charm , a novel tokenization approach that preserves Composition, High-resolution, Aspect Ratio, and Multi-scale information simultaneously. By preserving critical information, Charm works like a charm for image aesthetic and quality assessment 🌟🌟🌟.

Quick Inference

Step 1) Check our GitHub Page and install the requirements.

pip install -r requirements.txt

Step 2) Install Charm tokenizer.

pip install Charm-tokenizer

Step 3) Tokenization + Position embedding preparation

from Charm_tokenizer.ImageProcessor import Charm_Tokenizer

img_path = r"img.png"

charm_tokenizer = Charm_Tokenizer(patch_selection='frequency', training_dataset='tad66k',backbone='facebook/dinov2-small', without_pad_or_dropping=True)
tokens, pos_embed, mask_token = charm_tokenizer.preprocess(img_path)

Charm Tokenizer has the following input args:

patch_selection (str): The method for selecting important patches
- Options: 'saliency', 'random', 'frequency', 'gradient', 'entropy', 'original'.
training_dataset (str): Used to set the number of ViT input tokens to match a specific training dataset from the paper.
- Aesthetic assessment datasets: 'ava', 'aadb', 'tad66k', 'para', 'baid'.
- Quality assessment datasets: 'spaq', 'koniq10k'.
backbone (str): The ViT backbone model (default: 'facebook/dinov2-small' (for all datasets except for AVA) and 'facebook/dinov2-large' (Just for AVA).
factor (float): The downscaling factor for less important patches (default: 0.5).
scales (int): The number of scales used for multiscale processing (default: 2).
random_crop_size (tuple): Used for the 'original' patch selection strategy (default: (224, 224)).
downscale_shortest_edge (int): Used for the 'original' patch selection strategy (default: 256).
without_pad_or_dropping (bool): Whether to avoid padding or dropping patches (default: True).

The output is the preprocessed tokens, their corresponding positional embeddings, and a mask token that indicates which patches are in high resolution and which are in low resolution.

Step 4) Predicting aesthetic/quality score

from Charm_tokenizer.Backbone import backbone

model = backbone(training_dataset='tad66k', device='cpu')
prediction = model.predict(tokens, pos_embed, mask_token)

Note:

While random patch selection during training helps avoid overfitting,for consistent results during inference, fully deterministic patch selection approaches should be used.
For the training code, check our GitHub Page.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Charm: The Missing Piece in ViT fine-tuning for Image Aesthetic Assessment

Quick Inference

FilesExpand file tree

ReadMe_Inference.md

Latest commit

History

ReadMe_Inference.md

File metadata and controls

Charm: The Missing Piece in ViT fine-tuning for Image Aesthetic Assessment

Quick Inference