The model code and documentation repository is at https://github.com/RichardScottOZ/comic-analysis

Using transformers multimodal fusion of image and text to make embeddings to query comics for similarity or text.

More more detail the repo above.


language: en tags: - vision - text - multimodal - comics - contrastive-learning - vit - roberta license: mit

ClosureLiteSimple (Version 1 - Comic Panel Encoder)

ClosureLiteSimple is the Version 1 precursor to the Stage 3 panel encoder within the Comic Analysis Framework.

It is a multimodal neural network designed to fuse image crops, textual dialogue, and compositional metadata into a unified 384-dimensional embedding per comic panel, and can also aggregate these panels into a single Page-level embedding using an attention mechanism.

(Note: This model is considered deprecated in favor of the newer comic-panel-encoder-v1 which utilizes SigLIP, ResNet, and an improved Adaptive Fusion Gate).

Model Architecture

The ClosureLiteSimple model consists of the PanelAtomizerLite and a SimpleAttention mechanism:

  1. Vision Encoder (google/vit-base-patch16-224):
    • Extracts features from $224 \times 224$ panel image crops.
    • Outputs projected to $384$-d.
  2. Text Encoder (roberta-base):
    • Encodes panel dialogue, narration, or OCR text.
    • Outputs projected to $384$-d.
  3. Compositional Encoder (MLP):
    • Takes a 7-dimensional vector representing the bounding box geometry (e.g., aspect ratio, relative area, normalized center coordinates).
    • Projects through hidden layers to $384$-d.
  4. Gated Fusion (GatedFusion):
    • Concatenates the three modality outputs and computes a learned softmax gate.
    • Outputs a weighted sum of the Vision, Text, and Composition features, resulting in the final $384$-d Panel Embedding.
  5. Page Aggregation (SimpleAttention):
    • Uses multi-head attention to pool the variable number of Panel Embeddings on a single page into a unified $384$-d Page Embedding.

Usage

The codebase for this model resides in the src/version1/ directory of the repository.

Example: Loading and Inference

import torch
from PIL import Image
import torchvision.transforms as T
from transformers import AutoTokenizer

# Requires cloning the GitHub repo
from closure_lite_simple_framework import ClosureLiteSimple

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# 1. Initialize Model
model = ClosureLiteSimple(d=384, num_heads=4, temperature=0.1).to(device)

# Load weights from Hugging Face
state_dict = torch.hub.load_state_dict_from_url(
    "https://huggingface.co/RichardScottOZ/closure-lite-simple/resolve/main/best_model.pt",
    map_location=device
)
if 'model_state_dict' in state_dict:
    state_dict = state_dict['model_state_dict']
model.load_state_dict(state_dict)
model.eval()

# 2. Prepare Inputs (Example: A page with 2 panels)
transform = T.Compose([
    T.Resize((224, 224)),
    T.ToTensor(),
    T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

# Dummy Image Crops (B=1 page, N=2 panels, C=3, H=224, W=224)
images = torch.stack([
    transform(Image.new('RGB', (224, 224))),
    transform(Image.new('RGB', (224, 224)))
]).unsqueeze(0).to(device)

# Dummy Text
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
text_enc = tokenizer(["Panel 1 text", "Panel 2 text"], return_tensors='pt', padding=True)
input_ids = text_enc['input_ids'].unsqueeze(0).to(device)
attention_mask = text_enc['attention_mask'].unsqueeze(0).to(device)

# Dummy Composition (B=1, N=2, F=7)
comp_feats = torch.zeros((1, 2, 7)).to(device)

# Valid Panel Mask (B=1, N=2)
panel_mask = torch.tensor([[True, True]]).to(device)

# 3. Generate Embeddings
with torch.no_grad():
    panel_embeddings, page_embedding = model(
        images, input_ids, attention_mask, comp_feats, panel_mask
    )

print(f"Panel Embeddings Shape: {panel_embeddings.shape}") # (1, 2, 384)
print(f"Page Embedding Shape: {page_embedding.shape}")     # (1, 384)

Intended Use & Limitations

  • Intended Use: Originally designed for exploring multimodal embedding spaces and building basic visual/textual retrieval prototypes (like CoMiX v1).
  • Limitations:
    • Modality Dominance: Analysis of this model revealed that if one modality (e.g., text) was missing or uninformative during inference, the GatedFusion mechanism struggled to fall back gracefully to the visual features, often resulting in collapsed or non-discriminative embeddings for single-modality queries.
    • Deprecated: This architecture has been superseded by Stage 3 (comic-panel-encoder-v1), which utilizes independent modality projection and a masked Adaptive Fusion gate to solve the dominance issues.

Citation

Please reference the Comic Analysis GitHub Repository when utilizing this architecture.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for RichardScottOZ/comics-analysis-closure-lite-simple

Finetuned
(2170)
this model

Collection including RichardScottOZ/comics-analysis-closure-lite-simple