GadflyII/GLM-4.7-Flash-NVFP4

text generationtransformersenzhtransformerssafetensorsglm4_moe_litetext-generationmoenvfp4apache-2.0
vLLMRunnable with vLLM
372.4K

Note: If you have a multi-GPU SM120 Blackwell system (RTX 50/Pro), try my vLLM fork to resolve P2P / TP=2 issues (Pending PR into upstream).

https://github.com/Gadflyii/vllm/tree/main

GLM-4.7-Flash NVFP4 (Mixed Precision)

This is a mixed precision NVFP4 quantization of zai-org/GLM-4.7-Flash, a 30B-A3B (30B total, 3B active) Mixture-of-Experts model.

Quantization Strategy

This model was made via custom quantization and calibration (128 samples, 2048 max seq len, neuralmagic/calibration, all 64 experts) scripts based on NVIDIA's approach for DeepSeek-V3. It uses mixed precision to preserve accuracy:

ComponentPrecisionRationale
MLP ExpertsFP4 (E2M1)64 routed experts, 4 active per token
Dense MLPFP4 (E2M1)First layer dense MLP
Attention (MLA)BF16Low-rank compressed Q/KV projections are sensitive
Norms, Gates, EmbeddingsBF16Standard practice

Performance

MetricBF16Uniform FP4This Model
MMLU-Pro24.83%16.84%23.55%
Size62.4 GB18.9 GB20.4 GB
Compression1x3.3x3.1x
Accuracy Loss--8.0%-1.3%

Usage

Requirements

  • vLLM: 0.14.0+ (for compressed-tensors NVFP4 support)
  • transformers: 5.0.0+ (for glm4_moe_lite architecture)
  • GPU: NVIDIA GPU with FP4 tensor core support (Blackwell, Hopper, Ada Lovelace)

Installation

pip install vllm>=0.14.0
pip install git+https://github.com/huggingface/transformers.git

Inference with vLLM

from vllm import LLM, SamplingParams

model = LLM(
    "GadflyII/GLM-4.7-Flash-NVFP4",
    tensor_parallel_size=1,
    max_model_len=4096,
    trust_remote_code=True,
    gpu_memory_utilization=0.85,
)

params = SamplingParams(temperature=0.7, max_tokens=512)
outputs = model.generate(["Explain quantum computing in simple terms."], params)
print(outputs[0].outputs[0].text)

Serving with vLLM

vllm serve GadflyII/GLM-4.7-Flash-NVFP4 \
    --tensor-parallel-size 1 \
    --max-model-len 4096 \
    --trust-remote-code

Model Details

  • Base Model: zai-org/GLM-4.7-Flash
  • Architecture: Glm4MoeLiteForCausalLM
  • Parameters: 30B total, 3B active per token (30B-A3B)
  • MoE Configuration: 64 routed experts, 4 active, 1 shared expert
  • Layers: 47
  • Context Length: 202,752 tokens (max)
  • Languages: English, Chinese

Quantization Details

  • Format: compressed-tensors (NVFP4)
  • Block Size: 16
  • Scale Format: FP8 (E4M3)
  • Calibration: 128 samples from neuralmagic/calibration dataset
  • Full Expert Calibration: All 64 experts calibrated per sample

Evaluation

MMLU-Pro Overall Results

ModelAccuracyCorrectTotal
BF16 (baseline)24.83%298812032
NVFP4 (this model)23.55%283412032
Difference-1.28%-154-

MMLU-Pro by Category

CategoryBF16NVFP4Difference
Social Sciences32.70%31.43%-1.27%
Other31.57%30.08%-1.49%
Humanities23.78%22.56%-1.22%
STEM19.94%18.70%-1.24%

MMLU-Pro by Subject

SubjectBF16NVFP4Difference
Biology50.35%47.42%-2.93%
Psychology44.99%42.48%-2.51%
Economics36.37%34.48%-1.89%
Health35.21%34.84%-0.37%
History33.60%30.71%-2.89%
Philosophy31.46%30.06%-1.40%
Other28.35%25.87%-2.48%
Computer Science26.10%21.46%-4.64%
Business16.35%16.98%+0.63%
Law16.89%16.35%-0.54%
Engineering16.00%14.04%-1.96%
Physics15.32%14.70%-0.62%
Math14.06%14.29%+0.23%
Chemistry14.13%13.34%-0.79%

Citation

If you use this model, please cite the original GLM-4.7-Flash:

@misc{glm4flash2025,
  title={GLM-4.7-Flash},
  author={Zhipu AI},
  year={2025},
  howpublished={\url{https://huggingface.co/zai-org/GLM-4.7-Flash}}
}

License

This model inherits the Apache 2.0 license from the base model.

DEPLOY IN 60 SECONDS

Run GLM-4.7-Flash-NVFP4 on Runcrate

Deploy on H100, A100, or RTX GPUs. Pay only for what you use. No setup required.