GadflyII/GLM-4.7-Flash-NVFP4

Name: GadflyII/GLM-4.7-Flash-NVFP4
Rating: 5 (62 reviews)
Author: GadflyII

text generationtransformersenzhtransformerssafetensorsglm4_moe_litetext-generationmoenvfp4apache-2.0

Runnable with vLLM

62

HuggingFace

372.4K

Note: If you have a multi-GPU SM120 Blackwell system (RTX 50/Pro), try my vLLM fork to resolve P2P / TP=2 issues (Pending PR into upstream).

https://github.com/Gadflyii/vllm/tree/main

GLM-4.7-Flash NVFP4 (Mixed Precision)

This is a mixed precision NVFP4 quantization of zai-org/GLM-4.7-Flash, a 30B-A3B (30B total, 3B active) Mixture-of-Experts model.

Quantization Strategy

This model was made via custom quantization and calibration (128 samples, 2048 max seq len, neuralmagic/calibration, all 64 experts) scripts based on NVIDIA's approach for DeepSeek-V3. It uses mixed precision to preserve accuracy:

Component	Precision	Rationale
MLP Experts	FP4 (E2M1)	64 routed experts, 4 active per token
Dense MLP	FP4 (E2M1)	First layer dense MLP
Attention (MLA)	BF16	Low-rank compressed Q/KV projections are sensitive
Norms, Gates, Embeddings	BF16	Standard practice

Performance

Metric	BF16	Uniform FP4	This Model
MMLU-Pro	24.83%	16.84%	23.55%
Size	62.4 GB	18.9 GB	20.4 GB
Compression	1x	3.3x	3.1x
Accuracy Loss	-	-8.0%	-1.3%

Usage

Requirements

vLLM: 0.14.0+ (for compressed-tensors NVFP4 support)
transformers: 5.0.0+ (for glm4_moe_lite architecture)
GPU: NVIDIA GPU with FP4 tensor core support (Blackwell, Hopper, Ada Lovelace)

Installation

pip install vllm>=0.14.0
pip install git+https://github.com/huggingface/transformers.git

Inference with vLLM

from vllm import LLM, SamplingParams

model = LLM(
    "GadflyII/GLM-4.7-Flash-NVFP4",
    tensor_parallel_size=1,
    max_model_len=4096,
    trust_remote_code=True,
    gpu_memory_utilization=0.85,
)

params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = model.generate(["Explain quantum computing in simple terms."], params)
print(outputs[0].outputs[0].text)

Serving with vLLM

vllm serve GadflyII/GLM-4.7-Flash-NVFP4 \
    --tensor-parallel-size 1 \
    --max-model-len 4096 \
    --trust-remote-code

Model Details

Base Model: zai-org/GLM-4.7-Flash
Architecture: Glm4MoeLiteForCausalLM
Parameters: 30B total, 3B active per token (30B-A3B)
MoE Configuration: 64 routed experts, 4 active, 1 shared expert
Layers: 47
Context Length: 202,752 tokens (max)
Languages: English, Chinese

Quantization Details

Format: compressed-tensors (NVFP4)
Block Size: 16
Scale Format: FP8 (E4M3)
Calibration: 128 samples from neuralmagic/calibration dataset
Full Expert Calibration: All 64 experts calibrated per sample

Evaluation

MMLU-Pro Overall Results

Model	Accuracy	Correct	Total
BF16 (baseline)	24.83%	2988	12032
NVFP4 (this model)	23.55%	2834	12032
Difference	-1.28%	-154	-

MMLU-Pro by Category

Category	BF16	NVFP4	Difference
Social Sciences	32.70%	31.43%	-1.27%
Other	31.57%	30.08%	-1.49%
Humanities	23.78%	22.56%	-1.22%
STEM	19.94%	18.70%	-1.24%

MMLU-Pro by Subject

Subject	BF16	NVFP4	Difference
Biology	50.35%	47.42%	-2.93%
Psychology	44.99%	42.48%	-2.51%
Economics	36.37%	34.48%	-1.89%
Health	35.21%	34.84%	-0.37%
History	33.60%	30.71%	-2.89%
Philosophy	31.46%	30.06%	-1.40%
Other	28.35%	25.87%	-2.48%
Computer Science	26.10%	21.46%	-4.64%
Business	16.35%	16.98%	+0.63%
Law	16.89%	16.35%	-0.54%
Engineering	16.00%	14.04%	-1.96%
Physics	15.32%	14.70%	-0.62%
Math	14.06%	14.29%	+0.23%
Chemistry	14.13%	13.34%	-0.79%

Citation

If you use this model, please cite the original GLM-4.7-Flash:

@misc{glm4flash2025,
  title={GLM-4.7-Flash},
  author={Zhipu AI},
  year={2025},
  howpublished={\url{https://huggingface.co/zai-org/GLM-4.7-Flash}}
}

License

This model inherits the Apache 2.0 license from the base model.

Deploy Model on Runcrate

Run this model on powerful GPU infrastructure. Deploy in 60 seconds.

Pay per second

H100, A100, RTX GPUs

Instant deployment

DEPLOY IN 60 SECONDS

Run GLM-4.7-Flash-NVFP4 on Runcrate

Deploy on H100, A100, or RTX GPUs. Pay only for what you use. No setup required.