iSEE-Laboratory/llmdet_base

zero shot object detectiontransformerstransformerssafetensorsmm-grounding-dinozero-shot-object-detectionvisionarxiv:2501.18954apache-2.0
880.9K

LLMDet (base variant)

LLMDet model was proposed in LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models by Shenghao Fu, Qize Yang, Qijie Mo, Junkai Yan, Xihan Wei, Jingke Meng, Xiaohua Xie, Wei-Shi Zheng.

LLMDet improves upon the MM Grounding DINO and Grounding DINO by co-training the model with a large language model.

You can find all the LLMDet checkpoints under the LLMDet collection. Note that these checkpoints are inference only -- they do not include LLM which was used for training. The inference is identical to that of MM Grounding DINO.

Intended uses

You can use the raw model for zero-shot object detection.

Here's how to use the model for zero-shot object detection:

import torch
from transformers import AutoModelForZeroShotObjectDetection, AutoProcessor
from transformers.image_utils import load_image


# Prepare processor and model
model_id = "iSEE-Laboratory/llmdet_base"
device = "cuda" if torch.cuda.is_available() else "cpu"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForZeroShotObjectDetection.from_pretrained(model_id).to(device)

# Prepare inputs
image_url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = load_image(image_url)
text_labels = [["a cat", "a remote control"]]
inputs = processor(images=image, text=text_labels, return_tensors="pt").to(device)

# Run inference
with torch.no_grad():
    outputs = model(**inputs)

# Postprocess outputs
results = processor.post_process_grounded_object_detection(
    outputs,
    threshold=0.4,
    target_sizes=[(image.height, image.width)]
)

# Retrieve the first image result
result = results[0]
for box, score, labels in zip(result["boxes"], result["scores"], result["labels"]):
    box = [round(x, 2) for x in box.tolist()]
    print(f"Detected {labels} with confidence {round(score.item(), 3)} at location {box}")

Training Data

This model was trained on:

Evaluation results

  • Here's a table of LLMDet models and their performance on LVIS (results from official repo):

    ModelPre-Train DataMiniVal APrMiniVal APcMiniVal APfMiniVal APVal1.0 APrVal1.0 APcVal1.0 APfVal1.0 AP
    llmdet_tiny(O365,GoldG,GRIT,V3Det) + GroundingCap-1M44.737.339.550.734.926.030.144.3
    llmdet_base(O365,GoldG,V3Det) + GroundingCap-1M48.340.843.154.338.528.234.347.8
    llmdet_large(O365V2,OpenImageV6,GoldG) + GroundingCap-1M51.145.146.156.642.031.638.850.2

BibTeX entry and citation info

@article{fu2025llmdet,
  title={LLMDet: Learning Strong Open-Vocabulary Object Detectors under the Supervision of Large Language Models},
  author={Fu, Shenghao and Yang, Qize and Mo, Qijie and Yan, Junkai and Wei, Xihan and Meng, Jingke and Xie, Xiaohua and Zheng, Wei-Shi},
  journal={arXiv preprint arXiv:2501.18954},
  year={2025}
}
DEPLOY IN 60 SECONDS

Run llmdet_base on Runcrate

Deploy on H100, A100, or RTX GPUs. Pay only for what you use. No setup required.