File size: 2,556 Bytes
3fff5e8 ebde999 3fff5e8 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 |
---
license: mit
language:
- en
base_model:
- google/vit-base-patch16-224
pipeline_tag: video-classification
tags:
- Action
- Vit
- Vit-LSTM
- video
- classification
- sportlabels
- pytorch
- LSTM
---
**ViT-LSTM Action Recognition**
Overview
This project implements an Action Recognition Model using a ViT-LSTM architecture. It takes a short video as input and predicts the action performed in the video. The model extracts frame-wise ViT features and processes them using an LSTM to capture temporal dependencies.
**Model Details**
Base Model: ViT-Base-Patch16-224
Architecture: ViT (Feature Extractor) + LSTM (Temporal Modeling)
Number of Classes: 5
Dataset: Custom dataset with the following action categories:
BaseballPitch
Basketball
BenchPress
Biking
Billiards
**Working**
Extract Frames β The model extracts up to 16 frames from the uploaded video.
Feature Extraction β Each frame is passed through ViT, and feature vectors are obtained.
Temporal Processing β The LSTM processes these features to capture motion information.
Prediction β The final output is classified into one of the 5 action categories.
Model Training Details
Feature Dimension: 768
LSTM Hidden Dimension: 512
Number of LSTM Layers: 2 (Bidirectional)
Dropout: 0.3
Optimizer: Adam
Loss Function: Cross-Entropy Loss
Example Usage (Code Snippet)
If you want to use this model locally:
````
import torch
from transformers import ViTImageProcessor, ViTModel
from PIL import Image
import cv2
# Load Pretrained ViT
vit_processor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224")
vit_model = ViTModel.from_pretrained("google/vit-base-patch16-224")
# Load Custom ViT-LSTM Model
model = torch.load("Vit-LSTM.pth")
model.eval()
# Process an Example Video
video_path = "example.mp4"
cap = cv2.VideoCapture(video_path)
frames = []
while cap.isOpened():
ret, frame = cap.read()
if not ret:
break
frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
frames.append(Image.fromarray(frame))
cap.release()
# Extract Features
inputs = vit_processor(images=frames, return_tensors="pt")["pixel_values"]
features = vit_model(inputs).last_hidden_state.mean(dim=1)
# Predict
features = features.unsqueeze(0) # Add batch dimension
output = model(features)
predicted_class = torch.argmax(output, dim=1).item()
LABELS = ["BaseballPitch", "Basketball", "BenchPress", "Biking", "Billiards"]
print("Predicted Action:", LABELS[predicted_class])
````
**Contributors**
Saurav Dhiani β Model Development & Deployment
ViT & LSTM β Core ML Architecture |