|
--- |
|
license: mit |
|
language: |
|
- en |
|
base_model: |
|
- google/vit-base-patch16-224 |
|
pipeline_tag: video-classification |
|
tags: |
|
- Action |
|
- Vit |
|
- Vit-LSTM |
|
- video |
|
- classification |
|
- sportlabels |
|
- pytorch |
|
- LSTM |
|
--- |
|
**ViT-LSTM Action Recognition** |
|
Overview |
|
This project implements an Action Recognition Model using a ViT-LSTM architecture. It takes a short video as input and predicts the action performed in the video. The model extracts frame-wise ViT features and processes them using an LSTM to capture temporal dependencies. |
|
|
|
**Model Details** |
|
Base Model: ViT-Base-Patch16-224 |
|
Architecture: ViT (Feature Extractor) + LSTM (Temporal Modeling) |
|
Number of Classes: 5 |
|
Dataset: Custom dataset with the following action categories: |
|
BaseballPitch |
|
Basketball |
|
BenchPress |
|
Biking |
|
Billiards |
|
**Working** |
|
Extract Frames β The model extracts up to 16 frames from the uploaded video. |
|
Feature Extraction β Each frame is passed through ViT, and feature vectors are obtained. |
|
Temporal Processing β The LSTM processes these features to capture motion information. |
|
Prediction β The final output is classified into one of the 5 action categories. |
|
|
|
Model Training Details |
|
Feature Dimension: 768 |
|
LSTM Hidden Dimension: 512 |
|
Number of LSTM Layers: 2 (Bidirectional) |
|
Dropout: 0.3 |
|
Optimizer: Adam |
|
Loss Function: Cross-Entropy Loss |
|
Example Usage (Code Snippet) |
|
If you want to use this model locally: |
|
```` |
|
import torch |
|
from transformers import ViTImageProcessor, ViTModel |
|
from PIL import Image |
|
import cv2 |
|
|
|
# Load Pretrained ViT |
|
vit_processor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224") |
|
vit_model = ViTModel.from_pretrained("google/vit-base-patch16-224") |
|
|
|
# Load Custom ViT-LSTM Model |
|
model = torch.load("Vit-LSTM.pth") |
|
model.eval() |
|
|
|
# Process an Example Video |
|
video_path = "example.mp4" |
|
cap = cv2.VideoCapture(video_path) |
|
frames = [] |
|
|
|
while cap.isOpened(): |
|
ret, frame = cap.read() |
|
if not ret: |
|
break |
|
frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) |
|
frames.append(Image.fromarray(frame)) |
|
|
|
cap.release() |
|
|
|
# Extract Features |
|
inputs = vit_processor(images=frames, return_tensors="pt")["pixel_values"] |
|
features = vit_model(inputs).last_hidden_state.mean(dim=1) |
|
|
|
# Predict |
|
features = features.unsqueeze(0) # Add batch dimension |
|
output = model(features) |
|
predicted_class = torch.argmax(output, dim=1).item() |
|
|
|
LABELS = ["BaseballPitch", "Basketball", "BenchPress", "Biking", "Billiards"] |
|
print("Predicted Action:", LABELS[predicted_class]) |
|
```` |
|
|
|
**Contributors** |
|
Saurav Dhiani β Model Development & Deployment |
|
ViT & LSTM β Core ML Architecture |