Alibaba-NLP/gte-large-en-v1.5

sentence similaritytransformersentransformersonnxsafetensorsnewfeature-extractionsentence-transformersapache-2.0
4.9M

gte-large-en-v1.5

We introduce gte-v1.5 series, upgraded gte embeddings that support the context length of up to 8192, while further enhancing model performance. The models are built upon the transformer++ encoder backbone (BERT + RoPE + GLU).

The gte-v1.5 series achieve state-of-the-art scores on the MTEB benchmark within the same model size category and prodvide competitive on the LoCo long-context retrieval tests (refer to Evaluation).

We also present the gte-Qwen1.5-7B-instruct, a SOTA instruction-tuned multi-lingual embedding model that ranked 2nd in MTEB and 1st in C-MTEB.

Model list

ModelsLanguageModel SizeMax Seq. LengthDimensionMTEB-enLoCo
gte-Qwen1.5-7B-instructMultiple772032768409667.3487.57
gte-large-en-v1.5English4348192102465.3986.71
gte-base-en-v1.5English137819276864.1187.44

How to Get Started with the Model

Use the code below to get started with the model.

# Requires transformers>=4.36.0

import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer

input_texts = [
    "what is the capital of China?",
    "how to implement quick sort in python?",
    "Beijing",
    "sorting algorithms"
]

model_path = 'Alibaba-NLP/gte-large-en-v1.5'
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModel.from_pretrained(model_path, trust_remote_code=True)

# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=8192, padding=True, truncation=True, return_tensors='pt')

outputs = model(**batch_dict)
embeddings = outputs.last_hidden_state[:, 0]
 
# (Optionally) normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())

It is recommended to install xformers and enable unpadding for acceleration, refer to enable-unpadding-and-xformers.

Use with sentence-transformers:

# Requires sentence_transformers>=2.7.0

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

sentences = ['That is a happy person', 'That is a very happy person']

model = SentenceTransformer('Alibaba-NLP/gte-large-en-v1.5', trust_remote_code=True)
embeddings = model.encode(sentences)
print(cos_sim(embeddings[0], embeddings[1]))

Use with transformers.js:

// npm i @xenova/transformers
import { pipeline, dot } from '@xenova/transformers';

// Create feature extraction pipeline
const extractor = await pipeline('feature-extraction', 'Alibaba-NLP/gte-large-en-v1.5', {
    quantized: false, // Comment out this line to use the quantized version
});

// Generate sentence embeddings
const sentences = [
    "what is the capital of China?",
    "how to implement quick sort in python?",
    "Beijing",
    "sorting algorithms"
]
const output = await extractor(sentences, { normalize: true, pooling: 'cls' });

// Compute similarity scores
const [source_embeddings, ...document_embeddings ] = output.tolist();
const similarities = document_embeddings.map(x => 100 * dot(source_embeddings, x));
console.log(similarities); // [41.86354093370361, 77.07076371259589, 37.02981979677899]

Training Details

Training Data

  • Masked language modeling (MLM): c4-en
  • Weak-supervised contrastive pre-training (CPT): GTE pre-training data
  • Supervised contrastive fine-tuning: GTE fine-tuning data

Training Procedure

To enable the backbone model to support a context length of 8192, we adopted a multi-stage training strategy. The model first undergoes preliminary MLM pre-training on shorter lengths. And then, we resample the data, reducing the proportion of short texts, and continue the MLM pre-training.

The entire training process is as follows:

  • MLM-512: lr 2e-4, mlm_probability 0.3, batch_size 4096, num_steps 300000, rope_base 10000
  • MLM-2048: lr 5e-5, mlm_probability 0.3, batch_size 4096, num_steps 30000, rope_base 10000
  • MLM-8192: lr 5e-5, mlm_probability 0.3, batch_size 1024, num_steps 30000, rope_base 160000
  • CPT: max_len 512, lr 5e-5, batch_size 28672, num_steps 100000
  • Fine-tuning: TODO

Evaluation

MTEB

The results of other models are retrieved from MTEB leaderboard.

The gte evaluation setting: mteb==1.2.0, fp16 auto mix precision, max_length=8192, and set ntk scaling factor to 2 (equivalent to rope_base * 2).

Model NameParam Size (M)DimensionSequence LengthAverage (56)Class. (12)Clust. (11)Pair Class. (3)Reran. (4)Retr. (15)STS (10)Summ. (1)
gte-large-en-v1.54091024819265.3977.7547.9584.6358.5057.9181.4330.91
mxbai-embed-large-v1335102451264.6875.6446.7187.260.1154.398532.71
multilingual-e5-large-instruct560102451464.4177.5647.186.1958.5852.4784.7830.39
bge-large-en-v1.5335102451264.2375.9746.0887.1260.0354.2983.1131.61
gte-base-en-v1.5137768819264.1177.1746.8285.3357.6654.0981.9731.17
bge-base-en-v1.510976851263.5575.5345.7786.5558.8653.2582.431.07

LoCo

Model NameDimensionSequence LengthAverage (5)QsmsumRetrievalSummScreenRetrievalQasperAbastractRetrievalQasperTitleRetrievalGovReportRetrieval
gte-qwen1.5-7b40963276887.5749.3793.1099.6797.5498.21
gte-large-v1.51024819286.7144.5592.6199.8297.8198.74
gte-base-v1.5768819287.4449.9191.7899.8297.1398.58

Citation

If you find our paper or models helpful, please consider citing them as follows:

@article{zhang2024mgte,
  title={mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval},
  author={Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Wen and Dai, Ziqi and Tang, Jialong and Lin, Huan and Yang, Baosong and Xie, Pengjun and Huang, Fei and others},
  journal={arXiv preprint arXiv:2407.19669},
  year={2024}
}


```bibtex
@article{li2023towards,
  title={Towards general text embeddings with multi-stage contrastive learning}

, author={Li, Zehan and Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Pengjun and Zhang, Meishan}, journal={arXiv preprint arXiv:2308.03281}, year={2023} }

DEPLOY IN 60 SECONDS

Run gte-large-en-v1.5 on Runcrate

Deploy on H100, A100, or RTX GPUs. Pay only for what you use. No setup required.