omni-research/Tarsier2-Recap-7b

safetensorsvideo LLMarxiv:2501.07888license:apache-2.0region:usapache-2.0
11.6M

Tarsier Model Card

Introduction

Tarsier2-Recap-7b is build upon Qwen2-VL-7B-Instruct by distilling the video description capabilities of Tarsier2-7b. Specifically, we finetuned Qwen2-VL-7B-Instruct on Tarsier2-Recap-585K for 2 epochs with a learning rate of 2e-5. Tarsier2-Recap-7b shares a similar video captioning ability as Tarsier2-7b, reaching an overall F1 score of 40.7% on DREAM-1K, which is only behind Tarsier2-7b (42.0%) and surpasses GPT-4o's 39.2%. See the Tarsier2 technical report for more details.

Note: Please use Tarsier2-7b if you need the full-blooded Tarsier2.

Model details

Model date: Tarsier2-Recap-7b was trained in December 2024.

Paper or resources for more information:

License

Qwen/Qwen2-VL-7B-Instruct license.

Intended use

Primary intended uses: The primary use of Tarsier is research on large multimodal models, especially video description.

Primary intended users: The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.

Model Performance

Video Description

We evaluate Tarsier2-Recap-7b on DREAM-1K, a detailed video description benchmark featuring dynamic and diverse videos, assessing the model’s ability to describe fine-grained actions and events. Here is the evaluation result: images Note: The results of Tarsier2-Recap-7b is different from the results we reported in Table 11 in the Tarsier2 technical report, as Tarsier2-Recap-7b is more fully trained (2 epochs vs 1 epoch).

Video Question-Answering

We evalute Tarsier2-Recap-7b on TVBench, a novel multiple-choice question-answering which requires a high level of temporal understanding. As Tarsier2-Recap-7b is only trained with video caption data, it needs some additional prompt to enduce it to conduct multi-choice question-answering tasks, see TVBench samples as an example. Here is the evaluation result:

TaskTarsier2-Recap-7bTarsier2-7b
Action Antonym91.294.1
Action Count43.140.5
Action Localization42.537.5
Action Sequence70.572.3
Egocentric Sequence22.024.5
Moving Direction37.133.2
Object Count46.662.8
Object Shuffle36.931.6
Scene Transition85.988.1
Unexpected Action28.041.5
OVERALL54.054.7

How to Use

see https://github.com/bytedance/tarsier?tab=readme-ov-file#usage.

Where to send questions or comments about the model: https://github.com/bytedance/tarsier/issues

DEPLOY IN 60 SECONDS

Run Tarsier2-Recap-7b on Runcrate

Deploy on H100, A100, or RTX GPUs. Pay only for what you use. No setup required.