iSEE-Laboratory/llmdet_base

Name: iSEE-Laboratory/llmdet_base
Rating: 5 (9 reviews)
Author: iSEE-Laboratory

zero shot object detectiontransformerstransformerssafetensorsmm-grounding-dinozero-shot-object-detectionvisionarxiv:2501.18954apache-2.0

9

HuggingFace

757.8K

LLMDet (base variant)

LLMDet model was proposed in LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models by Shenghao Fu, Qize Yang, Qijie Mo, Junkai Yan, Xihan Wei, Jingke Meng, Xiaohua Xie, Wei-Shi Zheng.

LLMDet improves upon the MM Grounding DINO and Grounding DINO by co-training the model with a large language model.

You can find all the LLMDet checkpoints under the LLMDet collection. Note that these checkpoints are inference only -- they do not include LLM which was used for training. The inference is identical to that of MM Grounding DINO.

Intended uses

You can use the raw model for zero-shot object detection.

Here's how to use the model for zero-shot object detection:

import torch
from transformers import AutoModelForZeroShotObjectDetection, AutoProcessor
from transformers.image_utils import load_image


# Prepare processor and model
model_id = "iSEE-Laboratory/llmdet_base"
device = "cuda" if torch.cuda.is_available() else "cpu"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForZeroShotObjectDetection.from_pretrained(model_id).to(device)

# Prepare inputs
image_url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = load_image(image_url)
text_labels = [["a cat", "a remote control"]]
inputs = processor(images=image, text=text_labels, return_tensors="pt").to(device)

# Run inference
with torch.no_grad():
    outputs = model(**inputs)

# Postprocess outputs
results = processor.post_process_grounded_object_detection(
    outputs,
    threshold=0.4,
    target_sizes=[(image.height, image.width)]
)

# Retrieve the first image result
result = results[0]
for box, score, labels in zip(result["boxes"], result["scores"], result["labels"]):
    box = [round(x, 2) for x in box.tolist()]
    print(f"Detected {labels} with confidence {round(score.item(), 3)} at location {box}")

Training Data

This model was trained on:

Evaluation results

Here's a table of LLMDet models and their performance on LVIS (results from official repo):

Model	Pre-Train Data	MiniVal APr	MiniVal APc	MiniVal APf	MiniVal AP	Val1.0 APr	Val1.0 APc	Val1.0 APf	Val1.0 AP
llmdet_tiny	(O365,GoldG,GRIT,V3Det) + GroundingCap-1M	44.7	37.3	39.5	50.7	34.9	26.0	30.1	44.3
llmdet_base	(O365,GoldG,V3Det) + GroundingCap-1M	48.3	40.8	43.1	54.3	38.5	28.2	34.3	47.8
llmdet_large	(O365V2,OpenImageV6,GoldG) + GroundingCap-1M	51.1	45.1	46.1	56.6	42.0	31.6	38.8	50.2

BibTeX entry and citation info

@article{fu2025llmdet,
  title={LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models},
  author={Fu, Shenghao and Yang, Qize and Mo, Qijie and Yan, Junkai and Wei, Xihan and Meng, Jingke and Xie, Xiaohua and Zheng, Wei-Shi},
  journal={arXiv preprint arXiv:2501.18954},
  year={2025}
}

Deploy Model on Runcrate

Run this model on powerful GPU infrastructure. Deploy in 60 seconds.

Pay per second

H100, A100, RTX GPUs

Instant deployment

DEPLOY IN 60 SECONDS

Run llmdet_base on Runcrate

Deploy on H100, A100, or RTX GPUs. Pay only for what you use. No setup required.