Runnable with vLLMThis repository provides large language models developed by the Research and Development Center for Large Language Models at the National Institute of Informatics.
| Model Variants |
|---|
| llm-jp-3-1.8b |
| llm-jp-3-1.8b-instruct |
| llm-jp-3-3.7b |
| llm-jp-3-3.7b-instruct |
| llm-jp-3-13b |
| llm-jp-3-13b-instruct |
| llm-jp-3-172b-beta1 |
| llm-jp-3-172b-beta1-instruct |
Checkpoints format: Hugging Face Transformers
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("llm-jp/llm-jp-3-3.7b-instruct")
model = AutoModelForCausalLM.from_pretrained("llm-jp/llm-jp-3-3.7b-instruct", device_map="auto", torch_dtype=torch.bfloat16)
chat = [
{"role": "system", "content": "以下は、タスクを説明する指示です。要求を適切に満たす応答を書きなさい。"},
{"role": "user", "content": "自然言語処理とは何か"},
]
tokenized_input = tokenizer.apply_chat_template(chat, add_generation_prompt=True, tokenize=True, return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(
tokenized_input,
max_new_tokens=100,
do_sample=True,
top_p=0.95,
temperature=0.7,
repetition_penalty=1.05,
)[0]
print(tokenizer.decode(output))
| Params | Layers | Hidden size | Heads | Context length | Embedding parameters | Non-embedding parameters |
|---|---|---|---|---|---|---|
| 1.8b | 24 | 2048 | 16 | 4096 | 407,896,064 | 1,459,718,144 |
| 3.7b | 28 | 3072 | 24 | 4096 | 611,844,096 | 3,171,068,928 |
| 13b | 40 | 5120 | 40 | 4096 | 1,019,740,160 | 12,688,184,320 |
The tokenizer of this model is based on huggingface/tokenizers Unigram byte-fallback model.
The vocabulary entries were converted from llm-jp-tokenizer v3.0.
Please refer to README.md of llm-jp-tokenizer for details on the vocabulary construction procedure (the pure SentencePiece training does not reproduce our vocabulary).
The models have been pre-trained using a blend of the following datasets.
| Language | Dataset | Tokens |
|---|---|---|
| Japanese | Wikipedia | 2.6B |
| Common Crawl | 762.8B | |
| WARP/PDF | 237.3B | |
| WARP/HTML | 2.7B | |
| Kaken | 1.8B | |
| English | Wikipedia | 4.7B |
| Dolma/CC-head | 608.5B | |
| Dolma/C4 | 181.6B | |
| Dolma/Reddit | 83.1B | |
| Dolma/PeS2o | 62.9B | |
| Dolma/Gutenberg | 5.5B | |
| Dolma/Wiki | 3.9B | |
| Code | The Stack | 114.1B |
| Chinese | Wikipedia | 0.8B |
| Korean | Wikipedia | 0.3B |
The models have been fine-tuned on the following datasets.
| Language | Dataset | description |
|---|---|---|
| Japanese | ichikara-instruction-004-002 | A manually constructed instruction dataset |
| answer-carefully-002 | A manually constructed instruction dataset focusing on LLMs' safety | |
| ichikara-instruction-format | A small amount of instruction dataset edited from ichikara-instruction, with some constraints on the output format. | |
| AutoMultiTurnByCalm3-22B | A synthetic instruction dataset. | |
| ramdom-to-fixed-multiturn-Calm3 | A synthetic instruction dataset. | |
| wizardlm8x22b-logical-math-coding-sft_additional-ja | A synthetic instruction dataset. | |
| Synthetic-JP-EN-Coding-Dataset-567k | A synthetic instruction dataset. We used sampled one. | |
| English | FLAN | We used sampled one. |
We evaluated the models using 100 examples from the dev split.
| Model name | average | EL | FA | HE | MC | MR | MT | NLI | QA | RC |
|---|---|---|---|---|---|---|---|---|---|---|
| llm-jp-3-1.8b | 0.3767 | 0.3725 | 0.1948 | 0.2350 | 0.2500 | 0.0900 | 0.7730 | 0.3080 | 0.4629 | 0.7040 |
| llm-jp-3-1.8b-instruct | 0.4596 | 0.4280 | 0.1987 | 0.3250 | 0.3300 | 0.4200 | 0.7900 | 0.3520 | 0.4698 | 0.8224 |
| llm-jp-3-3.7b | 0.4231 | 0.3812 | 0.2440 | 0.2200 | 0.1900 | 0.3600 | 0.7947 | 0.3800 | 0.4688 | 0.7694 |
| llm-jp-3-3.7b-instruct | 0.5188 | 0.4191 | 0.2504 | 0.3400 | 0.5000 | 0.5800 | 0.8166 | 0.4500 | 0.4881 | 0.8247 |
| llm-jp-3-13b | 0.5802 | 0.5570 | 0.2593 | 0.4600 | 0.7000 | 0.6300 | 0.8292 | 0.3460 | 0.5937 | 0.8469 |
| llm-jp-3-13b-instruct | 0.6168 | 0.5408 | 0.2757 | 0.4950 | 0.9200 | 0.7100 | 0.8317 | 0.4640 | 0.4642 | 0.8500 |
We evaluated the models using gpt-4-0613. Please see the codes for details.
| Model name | average | coding | extraction | humanities | math | reasoning | roleplay | stem | writing |
|---|---|---|---|---|---|---|---|---|---|
| llm-jp-3-1.8b-instruct | 4.93 | 1.50 | 4.70 | 7.80 | 1.55 | 2.60 | 7.80 | 6.10 | 7.40 |
| llm-jp-3-3.7b-instruct | 5.50 | 1.95 | 4.05 | 8.25 | 2.25 | 4.00 | 8.80 | 7.25 | 7.45 |
| llm-jp-3-13b-instruct | 6.47 | 3.15 | 7.05 | 9.15 | 3.75 | 5.40 | 8.30 | 7.50 | 7.45 |
The models released here are in the early stages of our research and development and have not been tuned to ensure outputs align with human intent and safety considerations.
llm-jp(at)nii.ac.jp
The names are listed in alphabetical order.
Hirokazu Kiyomaru and Takashi Kodama.