deepseek-ai/deepseek-vl2-tiny

image text to texttransformerstransformerssafetensorsdeepseek_vl_v2image-text-to-textarxiv:2412.10302license:otherother
vLLMRunnable with vLLM
217.4K

specify the path to the model

model_path = "deepseek-ai/deepseek-vl2-small" vl_chat_processor: DeepseekVLV2Processor = DeepseekVLV2Processor.from_pretrained(model_path) tokenizer = vl_chat_processor.tokenizer

vl_gpt: DeepseekVLV2ForCausalLM = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True) vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()

single image conversation example

conversation = [ { "role": "<|User|>", "content": "\n<|ref|>The giraffe at the back.<|/ref|>.", "images": ["./images/visual_grounding.jpeg"], }, {"role": "<|Assistant|>", "content": ""}, ]

multiple images (or in-context learning) conversation example

conversation = [

{

"role": "User",

"content": "<image_placeholder>A dog wearing nothing in the foreground, "

"<image_placeholder>a dog wearing a santa hat, "

"<image_placeholder>a dog wearing a wizard outfit, and "

"<image_placeholder>what's the dog wearing?",

"images": [

"images/dog_a.png",

"images/dog_b.png",

"images/dog_c.png",

"images/dog_d.png",

],

},

{"role": "Assistant", "content": ""}

]

load images and prepare for inputs

pil_images = load_pil_images(conversation) prepare_inputs = vl_chat_processor( conversations=conversation, images=pil_images, force_batchify=True, system_prompt="" ).to(vl_gpt.device)

run image encoder to get the image embeddings

inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)

run the model to get the response

outputs = vl_gpt.language_model.generate( inputs_embeds=inputs_embeds, attention_mask=prepare_inputs.attention_mask, pad_token_id=tokenizer.eos_token_id, bos_token_id=tokenizer.bos_token_id, eos_token_id=tokenizer.eos_token_id, max_new_tokens=512, do_sample=False, use_cache=True )

answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True) print(f"{prepare_inputs['sft_format'][0]}", answer)


### Gradio Demo (TODO)


## 4. License

This code repository is licensed under [MIT License](./LICENSE-CODE). The use of DeepSeek-VL2 models is subject to [DeepSeek Model License](./LICENSE-MODEL). DeepSeek-VL2 series supports commercial use.

## 5. Citation

@misc{wu2024deepseekvl2mixtureofexpertsvisionlanguagemodels, title={DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding}, author={Zhiyu Wu and Xiaokang Chen and Zizheng Pan and Xingchao Liu and Wen Liu and Damai Dai and Huazuo Gao and Yiyang Ma and Chengyue Wu and Bingxuan Wang and Zhenda Xie and Yu Wu and Kai Hu and Jiawei Wang and Yaofeng Sun and Yukun Li and Yishi Piao and Kang Guan and Aixin Liu and Xin Xie and Yuxiang You and Kai Dong and Xingkai Yu and Haowei Zhang and Liang Zhao and Yisong Wang and Chong Ruan}, year={2024}, eprint={2412.10302}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2412.10302}, }


## 6. Contact

If you have any questions, please raise an issue or contact us at [service@deepseek.com](mailto:service@deepseek.com).
DEPLOY IN 60 SECONDS

Run deepseek-vl2-tiny on Runcrate

Deploy on H100, A100, or RTX GPUs. Pay only for what you use. No setup required.