RedHatAI/Mistral-7B-Instruct-v0.3-GPTQ-4bit

text generationtransformerstransformerssafetensorsmistraltext-generationconversationalbase_model:mistralai/Mistral-7B-Instruct-v0.3apache-2.0

23

362.9K

Model Card for Mistral-7B-Instruct-v0.3 quantized to 4bit weights

Weight-only quantization of Mistral-7B-Instruct-v0.3 via GPTQ to 4bits with group_size=128
GPTQ optimized for 99.75% accuracy recovery relative to the unquantized model

	Mistral-7B-Instruct-v0.3	Mistral-7B-Instruct-v0.3-GPTQ-4bit (this model)
arc-c 25-shot	63.48	63.40
mmlu 5-shot	61.13	60.89
hellaswag 10-shot	84.49	84.04
winogrande 5-shot	79.16	79.08
gsm8k 5-shot	43.37	45.41
truthfulqa 0-shot	59.65	57.48
Average Accuracy	65.21	65.05
Recovery	100%	99.75%

This model is ready for optimized inference using the Marlin mixed-precision kernels in vLLM: https://github.com/vllm-project/vllm

Simply start this model as an inference server with:

python -m vllm.entrypoints.openai.api_server --model neuralmagic/Mistral-7B-Instruct-v0.3-GPTQ-4bit

Run this model on powerful GPU infrastructure. Deploy in 60 seconds.

Pay per second

H100, A100, RTX GPUs

Instant deployment

DEPLOY IN 60 SECONDS

Deploy on H100, A100, or RTX GPUs. Pay only for what you use. No setup required.