This project evaluates the performance of a Long Short-Term Memory (LSTM) architecture for human movement classification in the UCF50 dataset.
- 1. Introduction
- 2. Architecture
- 3. Methodology
- 4. Module.py methods
- 5. Results and Discussion
- 6. Conclusion
- 7. Future Steps
- 8. Running Locally
- 9. Developer Team
- 10. References
- 11. License
LSTM_Classifier/
βββ code/
| βββ examples/ # videos to test the network
β βββ module.py
β βββ test.py
β βββ train.py
| βββ requirements.txt
βββ statistics/ # statistic results
βββ README.md
βββ LICENSE
Human movements consist of actions that cannot be properly classified by one image alone, but rather by a set of images in a specific sequence. In this context, the goal of this project is to address the problem of identifying movements by using multi-frame containers (videos) and creating a time-series neural network module. To achieve this goal, a Long Short-Term Memory (LSTM) architecture was chosen due to its ability to retain information from previous steps. Furthermore, to evaluate the network, different frame inputs were tested from 15 to 120 frames.
Regarding the dataset, this study utilizes Realistic Action Recognition: UCF50 [1]. The main reasons for this choice are: the variety of human movement and consistency in usage worldwide.
To carry out this study, based on Bleed AI Academyβs Youtube video [2], the following architecture of LSTM was used:
------------------------------------------------
| ConvLSTM2D |
------------------------------------------------
| Filters=4, Kernel=(3,3), Activation=Tanh |
------------------------------------------------
β
------------------------------------------------
| MaxPooling3D |
------------------------------------------------
| Padding=Same, Pool_Size=(1,2,2) |
------------------------------------------------
β
------------------------------------------------
| TimeDistributed + Dropout |
------------------------------------------------
| Dropout=0.2 |
------------------------------------------------
β
------------------------------------------------
| ConvLSTM2D |
------------------------------------------------
| Filters=14, Kernel=(3,3), Activation=Tanh |
------------------------------------------------
β
------------------------------------------------
| MaxPooling3D |
------------------------------------------------
| Padding=Same, Pool_Size=(1,2,2) |
------------------------------------------------
β
------------------------------------------------
| TimeDistributed + Dropout |
------------------------------------------------
| Dropout=0.2 |
------------------------------------------------
β
------------------------------------------------
| ConvLSTM2D |
------------------------------------------------
| Filters=16, Kernel=(3,3), Activation=Tanh |
------------------------------------------------
β
------------------------------------------------
| MaxPooling3D |
------------------------------------------------
| Padding=Same, Pool_Size=(1,2,2) |
------------------------------------------------
β
------------------------------------------------
| Flatten |
------------------------------------------------
β
------------------------------------------------
| Dense |
------------------------------------------------
| 6 classes, Activation=SoftMax |
------------------------------------------------
Initially, to assess which configuration presents the best performance, it was decided to fix the number of classes to seven: WalkingWithDog, Skiing, Swing, Diving, Mixing, HorseRace, and HorseRiding. The classes are encoded with One-Hot Encoded Labels (no need for ordering among themselves).
| Skiing | HorseRiding | Swing | WalkingWithDog |
|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
Example of Skiing, HorseRiding, Swing and WalkingWithDog classes in UCF50 dataset
After that establishment, the next step was to alter the quantity of collected frames from 15 to 120 frames. There, I trained each network in 5 epochs to expect an overall performance, and subsequently selected the more efficient ones for longer training (30 epochs).
For matters of evaluation, metrics such as loss, accuracy, recall, and precision were the backbone to appoint the best network for this context. Finally, the assessment was deemed successful.
As a side effect of this study, I created a structured and oriented module for the LSTM architecture shown above. The main methods are:
| Method | Description |
|---|---|
| create_dataset | creates a dataset from the input path |
| frame_features_extraction | extract features from each class and store to create the dataset |
| architecture | assemble the LSTM architecture |
| predict | predict an input video and store it in an output file |
| train | train the LSTM model (train: 70, val: 15, test: 15) |
| evaluate | generate a .json with loss, accuracy, precision, and recall metrics |
| load_model | load an existent model |
| save_architecture_image | save an image of the LSTM architecture |
| save_metric | save training metrics over epochs in a .csv |
Figure 01 shows that the best performance in terms of accuracy, loss, recall, and precision occurs when 60 frames are collected from each video. Nonetheless, it is notable that for longer videos (more than 10 seconds), improving the frames collection may be desirable to provide a more detailed understanding of the action represented throughout the video.
Regarding the epochs, I chose thirty because the network results start declining after this threshold. Even so, with a patience parameter of 10 epochs, it is perceivable that none of the settings go beyond 23, which means that the training time could be reduced as the number of epochs decreases.
Figure 01: Accuracy, loss, recall, precision graphics
In terms of classification, Figure 02 represents the comparison between the original and the predicted video. Note that there is a minimal delay before video classification, which happens in virtue of the need to receive some frames to make a proper inference.
This behavior in UCF50 has shown that Long-Short Term Memory Networks are a possibile solution to human movement classification problems.
The study demonstrated that LSTMs are a solution to human movement classification problems. Despite using a small and educational dataset, the trained model presented satisfactory results. Furthermore, it is worth noting that, in terms of the UCF50 dataset, the overall best setting happens when 60 frames are captured from each video.
It is worth noting that this repository is only a scratch of LSTM's potential to tackle problems concerning the identification of human movements. For the future, adding the capacity of continuous learning, designing an accessible user terminal to execute functions (such as training, creating a dataset, evaluating performance), and testing different architectures are possible implementations.
π₯ Clone the repository:
git clone https://https://github.com/MarcosTavar3s/LSTM_Classifier.git
cd codeπ Create and activate a virtual environment:
python3 -m venv venv
source venv/bin/activate # Linux or MacOS
venv/Scripts/activate # Windowsπ¦ Install the dependencies:
pip install -r requirements.txtπRun the project:
python train.py # for training
python test.py # for testingπTo use only module.py, import in your python code:
from lstm import classifier_model![]() |
![]() |
|---|---|
| Marcos AurΓ©lio Researcher |
Helton Maia Academic Advisor |
[1] P. Ahmad, "Realistic Action Recognition - UCF50," Kaggle, 2022. [Online]. Available: https://www.kaggle.com/datasets/pypiahmad/realistic-action-recognition-ucf50.
[2] Bleed AI Academy, "Human Activity Recognition using TensorFlow (CNN + LSTM) | 2 Methods", YouTube, 2021. [Online]. Available: https://www.youtube.com/watch?v=QmtSkq3DYko.
This project is licensed under the terms of the MIT License.










