Scene Classifier - ResNet18

This model is a fine-tuned version of ResNet-18 pretrained on ImageNet. It classifies images into 4 scene categories: cafe, gym, library, and outdoor.

Model description

This model uses a dataset of video frames extracted from recordings of different indoor and outdoor locations. The ResNet-18 architecture was chosen for its balance of accuracy and computational efficiency, using transfer learning from ImageNet pretrained weights. Only the final fully-connected layer was retrained for the 4-class classification task.

The model is part of a larger pipeline that generates contextual music based on scene classification combined with weather and temporal metadata.

Intended uses & limitations

Intended use:

Scene classification for context-aware applications
Image-to-music generation pipelines
Indoor/outdoor scene detection
Educational demonstrations of transfer learning

Limitations:

Limited to 4 specific scene categories (cafe, gym, library, outdoor)
Limited to Carnegie Mellon University (CMU) campus
Trained on relatively small dataset extracted from videos
May not generalize well to significantly different scene compositions
Performance may degrade on low-quality or heavily edited images
Indoor scenes may be confused if they share similar visual features

Training and evaluation data

Dataset: Dataset used is: madhavkarthi/project-1-location-classification-dataset Video frame extraction from 4 scene categories

Classes: cafe, gym, library, outdoor
Source: Personal video recordings of various locations
Extraction: Sampled every 10th frame from videos
Total frames: Approximately 500+ images
Format: JPEG, 224x224 resolution after preprocessing

The dataset represents real-world indoor and outdoor environments with varying lighting conditions, angles, and compositions.

Training procedure

Data preprocessing

Images were preprocessed with resize to 224x224 and converted to tensors.

Model architecture

Base model: ResNet-18 (ImageNet pretrained)
Modified layer: Final fully-connected layer changed from 1000 classes to 4 classes
Transfer learning: All layers except final FC layer retained pretrained weights

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e-4
train_batch_size: 32
optimizer: Adam with default betas=(0.9, 0.999)
loss_function: CrossEntropyLoss
num_epochs: 3
device: CUDA (GPU accelerated)

Training results

Training was conducted over 3 epochs with consistent loss reduction:

Epoch	Training Loss	Status
1	0.3395	✓
2	0.0111	✓
3	0.0041	✓

Note: Formal validation metrics were not computed during training. Model was validated qualitatively on held-out images.

Usage

This can be used to classify any input image into one of four classifiers: Library, Cafe, Gym, Outdoor.

Loading the model

import torch from torchvision import models, transforms from PIL import Image

device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model = models.resnet18(weights=None) model.fc = torch.nn.Linear(model.fc.in_features, 4) model.load_state_dict(torch.load("pytorch_model.pth", map_location=device)) model = model.to(device) model.eval()

transform = transforms.Compose([ transforms.Resize((224, 224)), transforms.ToTensor(), ])

class_labels = ["cafe", "gym", "library", "outdoor"]

Inference

image = Image.open("your_image.jpg") if image.mode != 'RGB': image = image.convert('RGB')

input_tensor = transform(image).unsqueeze(0).to(device)

with torch.no_grad(): outputs = model(input_tensor) predicted_idx = outputs.argmax(dim=1).item() predicted_class = class_labels[predicted_idx] confidence = torch.softmax(outputs, dim=1)[0][predicted_idx].item()

print(f"Predicted: {predicted_class} (confidence: {confidence:.2%})")

Framework versions

PyTorch: 2.0+
Torchvision: 0.15+
Python: 3.8+
Pillow: 9.0+

Model Architecture Details

ResNet-18 Structure:

Input: 3x224x224 RGB image
Convolutional layers with residual connections
Global average pooling
Final FC layer: 512 to 4 classes
Total parameters: approximately 11.7M (only approximately 2K trainable in final layer)

Additional Information

This model was developed as part of a course project (24-679) exploring multimodal AI systems. It serves as the visual classification component in an image-to-music generation pipeline that combines scene recognition, metadata extraction, weather context, and music synthesis.

AI- ChatGPT, Claude were used in the creation of this model and dataset

Downloads last month: 2

madhavkarthi
/

project-1-location-classifier-resnet18