Scene Classifier - ResNet18

This model is a fine-tuned version of ResNet-18 pretrained on ImageNet. It classifies images into 4 scene categories: cafe, gym, library, and outdoor.

Model description

This model uses a dataset of video frames extracted from recordings of different indoor and outdoor locations. The ResNet-18 architecture was chosen for its balance of accuracy and computational efficiency, using transfer learning from ImageNet pretrained weights. Only the final fully-connected layer was retrained for the 4-class classification task.

The model is part of a larger pipeline that generates contextual music based on scene classification combined with weather and temporal metadata.

Intended uses & limitations

Intended use:

  • Scene classification for context-aware applications
  • Image-to-music generation pipelines
  • Indoor/outdoor scene detection
  • Educational demonstrations of transfer learning

Limitations:

  • Limited to 4 specific scene categories (cafe, gym, library, outdoor)
  • Limited to Carnegie Mellon University (CMU) campus
  • Trained on relatively small dataset extracted from videos
  • May not generalize well to significantly different scene compositions
  • Performance may degrade on low-quality or heavily edited images
  • Indoor scenes may be confused if they share similar visual features

Training and evaluation data

Dataset: Dataset used is: madhavkarthi/project-1-location-classification-dataset Video frame extraction from 4 scene categories

  • Classes: cafe, gym, library, outdoor
  • Source: Personal video recordings of various locations
  • Extraction: Sampled every 10th frame from videos
  • Total frames: Approximately 500+ images
  • Format: JPEG, 224x224 resolution after preprocessing

The dataset represents real-world indoor and outdoor environments with varying lighting conditions, angles, and compositions.

Training procedure

Data preprocessing

Images were preprocessed with resize to 224x224 and converted to tensors.

Model architecture

  • Base model: ResNet-18 (ImageNet pretrained)
  • Modified layer: Final fully-connected layer changed from 1000 classes to 4 classes
  • Transfer learning: All layers except final FC layer retained pretrained weights

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 1e-4
  • train_batch_size: 32
  • optimizer: Adam with default betas=(0.9, 0.999)
  • loss_function: CrossEntropyLoss
  • num_epochs: 3
  • device: CUDA (GPU accelerated)

Training results

Training was conducted over 3 epochs with consistent loss reduction:

Epoch Training Loss Status
1 0.3395 โœ“
2 0.0111 โœ“
3 0.0041 โœ“

Note: Formal validation metrics were not computed during training. Model was validated qualitatively on held-out images.

Usage

This can be used to classify any input image into one of four classifiers: Library, Cafe, Gym, Outdoor.

Loading the model

import torch from torchvision import models, transforms from PIL import Image

device = torch.device("cuda" if torch.cuda.is_available() else "cpu") model = models.resnet18(weights=None) model.fc = torch.nn.Linear(model.fc.in_features, 4) model.load_state_dict(torch.load("pytorch_model.pth", map_location=device)) model = model.to(device) model.eval()

transform = transforms.Compose([ transforms.Resize((224, 224)), transforms.ToTensor(), ])

class_labels = ["cafe", "gym", "library", "outdoor"]

Inference

image = Image.open("your_image.jpg") if image.mode != 'RGB': image = image.convert('RGB')

input_tensor = transform(image).unsqueeze(0).to(device)

with torch.no_grad(): outputs = model(input_tensor) predicted_idx = outputs.argmax(dim=1).item() predicted_class = class_labels[predicted_idx] confidence = torch.softmax(outputs, dim=1)[0][predicted_idx].item()

print(f"Predicted: {predicted_class} (confidence: {confidence:.2%})")

Framework versions

  • PyTorch: 2.0+
  • Torchvision: 0.15+
  • Python: 3.8+
  • Pillow: 9.0+

Model Architecture Details

ResNet-18 Structure:

  • Input: 3x224x224 RGB image
  • Convolutional layers with residual connections
  • Global average pooling
  • Final FC layer: 512 to 4 classes
  • Total parameters: approximately 11.7M (only approximately 2K trainable in final layer)

Additional Information

This model was developed as part of a course project (24-679) exploring multimodal AI systems. It serves as the visual classification component in an image-to-music generation pipeline that combines scene recognition, metadata extraction, weather context, and music synthesis.

AI- ChatGPT, Claude were used in the creation of this model and dataset

Downloads last month
2
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Evaluation results