svsaurav95
/

Action-Detection-Vit-LSTM

Video Classification

Model card Files Files and versions

Action-Detection-Vit-LSTM / README.md

svsaurav95's picture

Update README.md

ebde999 verified 6 months ago

|

history blame contribute delete

2.56 kB

	---
	license: mit
	language:
	- en
	base_model:
	- google/vit-base-patch16-224
	pipeline_tag: video-classification
	tags:
	- Action
	- Vit
	- Vit-LSTM
	- video
	- classification
	- sportlabels
	- pytorch
	- LSTM
	---
	ViT-LSTM Action Recognition
	Overview
	This project implements an Action Recognition Model using a ViT-LSTM architecture. It takes a short video as input and predicts the action performed in the video. The model extracts frame-wise ViT features and processes them using an LSTM to capture temporal dependencies.

	Model Details
	Base Model: ViT-Base-Patch16-224
	Architecture: ViT (Feature Extractor) + LSTM (Temporal Modeling)
	Number of Classes: 5
	Dataset: Custom dataset with the following action categories:
	BaseballPitch
	Basketball
	BenchPress
	Biking
	Billiards
	Working
	Extract Frames – The model extracts up to 16 frames from the uploaded video.
	Feature Extraction – Each frame is passed through ViT, and feature vectors are obtained.
	Temporal Processing – The LSTM processes these features to capture motion information.
	Prediction – The final output is classified into one of the 5 action categories.

	Model Training Details
	Feature Dimension: 768
	LSTM Hidden Dimension: 512
	Number of LSTM Layers: 2 (Bidirectional)
	Dropout: 0.3
	Optimizer: Adam
	Loss Function: Cross-Entropy Loss
	Example Usage (Code Snippet)
	If you want to use this model locally:
	````
	import torch
	from transformers import ViTImageProcessor, ViTModel
	from PIL import Image
	import cv2

	# Load Pretrained ViT
	vit_processor = ViTImageProcessor.from_pretrained("google/vit-base-patch16-224")
	vit_model = ViTModel.from_pretrained("google/vit-base-patch16-224")

	# Load Custom ViT-LSTM Model
	model = torch.load("Vit-LSTM.pth")
	model.eval()

	# Process an Example Video
	video_path = "example.mp4"
	cap = cv2.VideoCapture(video_path)
	frames = []

	while cap.isOpened():
	ret, frame = cap.read()
	if not ret:
	break
	frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
	frames.append(Image.fromarray(frame))

	cap.release()

	# Extract Features
	inputs = vit_processor(images=frames, return_tensors="pt")["pixel_values"]
	features = vit_model(inputs).last_hidden_state.mean(dim=1)

	# Predict
	features = features.unsqueeze(0) # Add batch dimension
	output = model(features)
	predicted_class = torch.argmax(output, dim=1).item()

	LABELS = ["BaseballPitch", "Basketball", "BenchPress", "Biking", "Billiards"]
	print("Predicted Action:", LABELS[predicted_class])
	````

	Contributors
	Saurav Dhiani – Model Development & Deployment
	ViT & LSTM – Core ML Architecture