ibnzterrell/Meta-Llama-3.3-70B-Instruct-AWQ-INT4

Name: ibnzterrell/Meta-Llama-3.3-70B-Instruct-AWQ-INT4
Rating: 5 (29 reviews)
Author: ibnzterrell

text generationtransformersenfrtransformerssafetensorsllamatext-generationllama-3.3metallama3.3

Runnable with vLLM

29

HuggingFace

140.4K

Empty Cache

torch.cuda.empty_cache()

Memory Limits - Set this according to your hardware limits

max_memory = {0: "22GiB", 1: "22GiB", "cpu": "160GiB"}

model_path = "meta-llama/Llama-3.3-70B-Instruct" quant_path = "ibnzterrell/Meta-Llama-3.3-70B-Instruct-AWQ-INT4" quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"

}

Load model - Note: while this loads the layers into the CPU, the GPUs (and the VRAM) are still required for quantization! (Verified with nvida-smi)

model = AutoAWQForCausalLM.from_pretrained( model_path, use_cache=False, max_memory=max_memory, device_map="cpu" )

tokenizer = AutoTokenizer.from_pretrained(model_path)

Quantize

model.quantize( tokenizer, quant_config=quant_config )

Save quantized model

model.save_quantized(quant_path) tokenizer.save_pretrained(quant_path)

print(f'Model is quantized and saved at "{quant_path}"')

Deploy Model on Runcrate

Run this model on powerful GPU infrastructure. Deploy in 60 seconds.

Pay per second

H100, A100, RTX GPUs

Instant deployment

DEPLOY IN 60 SECONDS

Run Meta-Llama-3.3-70B-Instruct-AWQ-INT4 on Runcrate

Deploy on H100, A100, or RTX GPUs. Pay only for what you use. No setup required.