cl-nagoya/ruri-base

sentence similarityjasafetensorsbertsentence-similarityfeature-extractionjadataset:cl-nagoya/ruri-dataset-ftapache-2.0
343.4K

Ruri: Japanese General Text Embeddings

Notes: v3 models are out!
We recommend using the following v3 models going forward.

ID#Param.Max Len.Avg. JMTEB
cl-nagoya/ruri-v3-30m37M819274.51
cl-nagoya/ruri-v3-70m70M819275.48
cl-nagoya/ruri-v3-130m132M819276.55
cl-nagoya/ruri-v3-310m315M819277.24

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers fugashi sentencepiece unidic-lite

Then you can load this model and run inference.

import torch.nn.functional as F
from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("cl-nagoya/ruri-base")

# Don't forget to add the prefix "クエリ: " for query-side or "文章: " for passage-side texts.
sentences = [
    "クエリ: 瑠璃色はどんな色?",
    "文章: 瑠璃色(るりいろ)は、紫みを帯びた濃い青。名は、半貴石の瑠璃(ラピスラズリ、英: lapis lazuli)による。JIS慣用色名では「こい紫みの青」(略号 dp-pB)と定義している[1][2]。",
    "クエリ: ワシやタカのように、鋭いくちばしと爪を持った大型の鳥類を総称して「何類」というでしょう?",
    "文章: ワシ、タカ、ハゲワシ、ハヤブサ、コンドル、フクロウが代表的である。これらの猛禽類はリンネ前後の時代(17~18世紀)には鷲類・鷹類・隼類及び梟類に分類された。ちなみにリンネは狩りをする鳥を単一の目(もく)にまとめ、vultur(コンドル、ハゲワシ)、falco(ワシ、タカ、ハヤブサなど)、strix(フクロウ)、lanius(モズ)の4属を含めている。",
]

embeddings = model.encode(sentences, convert_to_tensor=True)
print(embeddings.size())
# [4, 768]

similarities = F.cosine_similarity(embeddings.unsqueeze(0), embeddings.unsqueeze(1), dim=2)
print(similarities)
# [[1.0000, 0.9421, 0.6844, 0.7167],
#  [0.9421, 1.0000, 0.6626, 0.6863],
#  [0.6844, 0.6626, 1.0000, 0.8785],
#  [0.7167, 0.6863, 0.8785, 1.0000]]

Benchmarks

JMTEB

Evaluated with JMTEB.

Model#Param.Avg.RetrievalSTSClassfificationRerankingClusteringPairClassification
cl-nagoya/sup-simcse-ja-base111M68.5649.6482.0573.4791.8351.7962.57
cl-nagoya/sup-simcse-ja-large337M66.5137.6283.1873.7391.4850.5662.51
cl-nagoya/unsup-simcse-ja-base111M65.0740.2378.7273.0791.1644.7762.44
cl-nagoya/unsup-simcse-ja-large337M66.2740.5380.5674.6690.9548.4162.49
pkshatech/GLuCoSE-base-ja133M70.4459.0278.7176.8291.9049.7866.39
sentence-transformers/LaBSE472M64.7040.1276.5672.6691.6344.8862.33
intfloat/multilingual-e5-small118M69.5267.2780.0767.6293.0346.9162.19
intfloat/multilingual-e5-base278M70.1268.2179.8469.3092.8548.2662.26
intfloat/multilingual-e5-large560M71.6570.9879.7072.8992.9651.2462.15
OpenAI/text-embedding-ada-002-69.4864.3879.0269.7593.0448.3062.40
OpenAI/text-embedding-3-small-70.8666.3979.4673.0692.9251.0662.27
OpenAI/text-embedding-3-large-73.9774.4882.5277.5893.5853.3262.35
Ruri-Small68M71.5369.4182.7976.2293.0051.1962.11
Ruri-Base (this model)111M71.9169.8282.8775.5892.9154.1662.38
Ruri-Large337M73.3173.0283.1377.4392.9951.8262.29

Model Details

Model Description

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Framework Versions

  • Python: 3.10.13
  • Sentence Transformers: 3.0.0
  • Transformers: 4.41.2
  • PyTorch: 2.3.1+cu118
  • Accelerate: 0.30.1
  • Datasets: 2.19.1
  • Tokenizers: 0.19.1

Citation

@misc{
  Ruri,
  title={{Ruri: Japanese General Text Embeddings}}, 
  author={Hayato Tsukagoshi and Ryohei Sasano},
  year={2024},
  eprint={2409.07737},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2409.07737}, 
}

License

This model is published under the Apache License, Version 2.0.

DEPLOY IN 60 SECONDS

Run ruri-base on Runcrate

Deploy on H100, A100, or RTX GPUs. Pay only for what you use. No setup required.