--- license: mit language: - en base_model: - google/vit-base-patch16-224 pipeline_tag: video-classification tags: - Action - Vit - Vit-LSTM - video - classification - sportlabels - pytorch - LSTM --- **ViT-LSTM Action Recognition** Overview This project implements an Action Recognition Model using a ViT-LSTM architecture. It takes a short video as input and predicts the action performed in the video. The model extracts frame-wise ViT features and processes them using an LSTM to capture temporal dependencies. **Model Details** Base Model: ViT-Base-Patch16-224 Architecture: ViT (Feature Extractor) + LSTM (Temporal Modeling) Number of Classes: 5 Dataset: Custom dataset with the following action categories: BaseballPitch Basketball BenchPress Biking Billiards **Working** Extract Frames – The model extracts up to 16 frames from the uploaded video. Feature Extraction – Each frame is passed through ViT, and feature vectors are obtained. Temporal Processing – The LSTM processes these features to capture motion information. Prediction – The final output is classified into one of the 5 action categories. Model Training Details Feature Dimension: 768 LSTM Hidden Dimension: 512 Number of LSTM Layers: 2 (Bidirectional) Dropout: 0.3 Optimizer: Adam Loss Function: Cross-Entropy Loss Example Usage (Code Snippet) If you want to use this model locally: ```` import torch from transformers import ViTImageProcessor, ViTModel from PIL import Image import cv2 # Load Pretrained ViT vit_processor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224") vit_model = ViTModel.from_pretrained("google/vit-base-patch16-224") # Load Custom ViT-LSTM Model model = torch.load("Vit-LSTM.pth") model.eval() # Process an Example Video video_path = "example.mp4" cap = cv2.VideoCapture(video_path) frames = [] while cap.isOpened(): ret, frame = cap.read() if not ret: break frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) frames.append(Image.fromarray(frame)) cap.release() # Extract Features inputs = vit_processor(images=frames, return_tensors="pt")["pixel_values"] features = vit_model(inputs).last_hidden_state.mean(dim=1) # Predict features = features.unsqueeze(0) # Add batch dimension output = model(features) predicted_class = torch.argmax(output, dim=1).item() LABELS = ["BaseballPitch", "Basketball", "BenchPress", "Biking", "Billiards"] print("Predicted Action:", LABELS[predicted_class]) ```` **Contributors** Saurav Dhiani – Model Development & Deployment ViT & LSTM – Core ML Architecture