timm/MobileCLIP2-S3-OpenCLIP

zero shot image classificationopen_clipopen_clipsafetensorsclipmobileclip2zero-shot-image-classificationarxiv:2508.20691apple-amlr
294.7K

Model card for MobileCLIP2-S3-OpenCLIP

These weights and model card are adapted from the original Apple model at https://huggingface.co/apple/MobileCLIP2-S3. This version uses canonical OpenCLIP configs and weight naming.

MobileCLIP2 was introduced in MobileCLIP2: Improving Multi-Modal Reinforced Training (TMLR August 2025 Featured), by Fartash Faghri, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Alexander T Toshev, Oncel Tuzel, Hadi Pouransari.

This repository contains the MobileCLIP2-S3 checkpoint.

Highlights

  • MobileCLIP2-S4 matches the accuracy of SigLIP-SO400M/14 with 2x fewer parameters and surpasses DFN ViT-L/14 at 2.5x lower latency measured on iPhone12 Pro Max.
  • MobileCLIP-S3/S4 are our new architectures trained on MobileCLIP’s training dataset, DataCompDR-1B (dashed lines).
  • Our smallest variant MobileCLIP-S0 obtains similar zero-shot performance as OpenAI's ViT-B/16 model while being 4.8x faster and 2.8x smaller.
  • MobileCLIP-S2 obtains better avg zero-shot performance than SigLIP's ViT-B/16 model while being 2.3x faster and 2.1x smaller, and trained with 3x less seen samples.
  • MobileCLIP-B (LT) attains zero-shot ImageNet performance of 77.2% which is significantly better than recent works like DFN and SigLIP with similar architectures or even OpenAI's ViT-L/14@336.

Checkpoints and Results (Original Apple links)

Model# Seen
Samples (B)
# Params (M)
(img + txt)
Latency (ms)
(img + txt)
IN-1k Zero-Shot
Top-1 Acc. (%)
Avg. Perf. (%)
on 38 datasets
MobileCLIP2-S01311.4 + 42.41.5 + 1.671.559.7
MobileCLIP2-S21335.7 + 63.43.6 + 3.377.264.1
MobileCLIP2-B1386.3 + 63.410.4 + 3.379.465.8
MobileCLIP2-S313125.1 + 123.68.0 + 6.680.766.8
MobileCLIP2-L/1413304.3 + 123.657.9 + 6.681.967.8
MobileCLIP2-S413321.6 + 123.619.6 + 6.681.967.5
MobileCLIP-S01311.4 + 42.41.5 + 1.667.858.1
MobileCLIP-S11321.5 + 63.42.5 + 3.372.661.3
MobileCLIP-S21335.7 + 63.43.6 + 3.374.463.7
MobileCLIP-B1386.3 + 63.410.4 + 3.376.865.2
MobileCLIP-B (LT)3686.3 + 63.410.4 + 3.377.265.8
MobileCLIP-S313125.1 + 123.68.0 + 6.678.366.3
MobileCLIP-L/1413304.3 + 123.657.9 + 6.679.566.9
MobileCLIP-S413321.6 + 123.619.6 + 6.679.468.1

How to Use

import torch
import open_clip
from PIL import Image
from urllib.request import urlopen
from timm.utils import reparameterize_model

model, _, preprocess = open_clip.create_model_and_transforms('MobileCLIP2-S3', pretrained='dfndr2b')
model.eval()
tokenizer = open_clip.get_tokenizer('MobileCLIP2-S3')

# For inference/model exporting purposes, optionally reparameterize for better performance
model = reparameterize_model(model)

image = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))
image = preprocess(image).unsqueeze(0)
text = tokenizer(["a diagram", "a dog", "a cat", "a doughnut"])

with torch.no_grad(), torch.amp.autocast(image.device.type):
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)
    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)
DEPLOY IN 60 SECONDS

Run MobileCLIP2-S3-OpenCLIP on Runcrate

Deploy on H100, A100, or RTX GPUs. Pay only for what you use. No setup required.