--- license: apache-2.0 language: - en base_model: - google/bert_uncased_L-4_H-256_A-4 - WinKawaks/vit-small-patch16-224 pipeline_tag: image-to-text library_name: transformers tags: - vit - bert - vision - caption - captioning - image --- An image captioning model, based on bert-mini and vit-small, weighing only 133mb! Works very fast on CPU. ```python from transformers import AutoTokenizer, AutoImageProcessor, VisionEncoderDecoderModel import requests, torch, time from PIL import Image model_path = "cnmoro/mini-image-captioning" device = torch.device("cpu") # load the image captioning model and corresponding tokenizer and image processor model = VisionEncoderDecoderModel.from_pretrained(model_path) tokenizer = AutoTokenizer.from_pretrained(model_path) image_processor = AutoImageProcessor.from_pretrained(model_path) # preprocess an image url = "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/New_york_times_square-terabass.jpg/800px-New_york_times_square-terabass.jpg" image = Image.open(requests.get(url, stream=True).raw) pixel_values = image_processor(image, return_tensors="pt").pixel_values start = time.time() # generate caption - suggested settings generated_ids = model.generate(     pixel_values,     temperature=0.7,     top_p=0.8,     top_k=50,     num_beams=3 # you can use 1 for even faster inference with a small drop in quality ) generated_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] end = time.time() print(generated_text) # a large group of people walking through a busy city. print(f"Time taken: {end - start} seconds") # Time taken: 0.19002342224121094 seconds # on CPU ! ```