ibnzterrell/Meta-Llama-3.3-70B-Instruct-AWQ-INT4

text generationtransformersenfrtransformerssafetensorsllamatext-generationllama-3.3metallama3.3
vLLMRunnable with vLLM
140.4K

Empty Cache

torch.cuda.empty_cache()

Memory Limits - Set this according to your hardware limits

max_memory = {0: "22GiB", 1: "22GiB", "cpu": "160GiB"}

model_path = "meta-llama/Llama-3.3-70B-Instruct" quant_path = "ibnzterrell/Meta-Llama-3.3-70B-Instruct-AWQ-INT4" quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"

}

Load model - Note: while this loads the layers into the CPU, the GPUs (and the VRAM) are still required for quantization! (Verified with nvida-smi)

model = AutoAWQForCausalLM.from_pretrained( model_path, use_cache=False, max_memory=max_memory, device_map="cpu" )

tokenizer = AutoTokenizer.from_pretrained(model_path)

Quantize

model.quantize( tokenizer, quant_config=quant_config )

Save quantized model

model.save_quantized(quant_path) tokenizer.save_pretrained(quant_path)

print(f'Model is quantized and saved at "{quant_path}"')

DEPLOY IN 60 SECONDS

Run Meta-Llama-3.3-70B-Instruct-AWQ-INT4 on Runcrate

Deploy on H100, A100, or RTX GPUs. Pay only for what you use. No setup required.