This repository provides an implementation of MobileViViT (Mobile Video Vision Transformers) — an adaptation of MobileViT designed for higher-dimensional tasks such as video.
⚠️ Note: This is not a research-based project. It is an adaptation of MobileViT with careful consideration to preserve the integrity of the original work.
- MobileViViT-S
- MobileViViT-XS
- MobileViViT-XXS
The implementation is built from modular components that can be reused independently or combined to construct new architectures. Utility modules (e.g., custom training loops, video data generators) are included under utils/ and come with documentation for easier understanding and extension.
This repository includes complete or partial implementations inspired by or directly adapted from the following works:
- Squeeze-and-Excitation Networks
- A Closer Look at Spatiotemporal Convolutions for Action Recognition
- MobileNetV2: Inverted Residuals and Linear Bottlenecks
- Bag of Tricks for Image Classification with CNNs
- Searching for MobileNetV3
- Mish: A Self Regularized Non-Monotonic Neural Activation Function
- Implicit Neural Representations with Periodic Activation Functions
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
- ViViT: A Video Vision Transformer
- MobileViT
All references have been properly acknowledged. If any material violates terms of use, please contact me, and I will promptly address it.
The repository was developed with Python 3.11.x, TensorFlow, and Keras.
Clone the repository and install dependencies:
git clone https://github.com/AliKHaliliT/MobileViViT.git
cd MobileViViT
pip install -r requirements.txt(Optional) Create a virtual environment:
python -m venv venv
source venv/bin/activate # On Linux/Mac
venv\Scripts\activate # On Windows
⚠️ Depending on your system, you may need to install additional packages beyond those listed inrequirements.txt.
MobileViViT/
├── __init__.py
├── assets/
│ ├── __init__.py
│ ├── activations/
│ │ ├── __init__.py
│ │ ├── hard_swish.py
│ │ ├── mish.py
│ │ └── sine.py
│ ├── blocks/
│ │ ├── __init__.py
│ │ ├── mlp.py
│ │ ├── mobilevivit.py
│ │ ├── mvnblock.py
│ │ ├── siren.py
│ │ └── transformer.py
│ ├── layers/
│ │ ├── __init__.py
│ │ ├── conv2plus1d.py
│ │ ├── conv_layer.py
│ │ ├── fc_layer.py
│ │ ├── fold.py
│ │ ├── positional_encoder.py
│ │ ├── sae_3d.py
│ │ ├── sine_layer.py
│ │ ├── ssae_3d.py
│ │ ├── transformer_layer.py
│ │ ├── tubelet_embedding.py
│ │ └── unfold.py
│ └── utils/
│ ├── __init__.py
│ ├── activation_function.py
│ ├── low_resource_training_scheme.py
│ ├── move_column_to_the_beginning.py
│ ├── progress_bar.py
│ ├── sine_layer_initializer.py
│ ├── squeeze_and_excitation.py
│ ├── video_data_generator.py
│ ├── video_file_to_numpy_array.py
│ └── video_frame_unifier.py
├── mobilevivit_s.py
├── mobilevivit_xs.py
└── mobilevivit_xxs.py
import pandas as pd
from MobileViViT.assets.utils.video_data_generator import VideoDataGenerator
from MobileViViT import MobileViViTXXS
# Config
num_output_units = 2
batch_size = 1
epochs = 1
# Sample input video
path_to_video = "util_resources/test_video.mp4"
video_data = pd.DataFrame({
"Address + FileName": [path_to_video],
"0": [0],
"1": [1]
})
# Data generator
data_generator = VideoDataGenerator(dataframe=video_data, batch_size=batch_size)
# Initialize and train model
model = MobileViViTXXS(num_output_units=num_output_units)
model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=["accuracy"])
model.fit(data_generator, epochs=epochs)This work is under an MIT License.