abhishekchohan/gemma-3-12b-it-quantized-W4A16

Name: abhishekchohan/gemma-3-12b-it-quantized-W4A16
Rating: 5 (7 reviews)
Author: abhishekchohan

image text to texttransformerstransformerssafetensorsgemma3image-text-to-textconversationalbase_model:google/gemma-3-12b-itgemma

Runnable with vLLM

7

HuggingFace

195.9K

Gemma 3 Quantized Models

This repository contains W4A16 quantized versions of Google's Gemma 3 instruction-tuned models, making them more accessible for deployment on consumer hardware while maintaining good performance.

Models

abhishekchohan/gemma-3-27b-it-quantized-W4A16
abhishekchohan/gemma-3-12b-it-quantized-W4A16
abhishekchohan/gemma-3-4b-it-quantized-W4A16

Repository Structure

gemma-3-{size}-it-quantized-W4A16/
├── README.md
├── templates/
│   └── chat_template.jinja
├── tools/
│   └── tool_parser.py
└── [model files]

Quantization Details

These models use W4A16 quantization via LLM Compressor:

Weights quantized to 4-bit precision
Activations use 16-bit precision
Significantly reduced memory requirements

Usage with vLLM

vllm serve abhishekchohan/gemma-3-{size}-it-quantized-W4A16 --chat-template templates/chat_template.jinja --enable-auto-tool-choice --tool-call-parser gemma --tool-parser-plugin tools/tool_parser.py

License

These models are subject to the Gemma license. Users must acknowledge and accept the license terms before using the models.

Citation

@article{gemma_2025,
    title={Gemma 3},
    url={https://goo.gle/Gemma3Report},
    publisher={Kaggle},
    author={Gemma Team},
    year={2025}
}

Deploy Model on Runcrate

Run this model on powerful GPU infrastructure. Deploy in 60 seconds.

Pay per second

H100, A100, RTX GPUs

Instant deployment

DEPLOY IN 60 SECONDS

Run gemma-3-12b-it-quantized-W4A16 on Runcrate

Deploy on H100, A100, or RTX GPUs. Pay only for what you use. No setup required.