NVFP4 model releases | LibertAI Labs

NVFP4 is NVIDIA’s 4-bit floating point format. Blackwell GPUs execute NVFP4 matrix multiplication directly in their tensor cores, and llama.cpp gained support for the format in 2026. We publish GGUF builds that preserve the calibrated NVFP4 feed-forward tensors from NVIDIA ModelOpt sources and quantize the remaining weights to Q8_0 or Q4_K_M, plus safetensors builds for vLLM, SGLang and TensorRT-LLM.

The claim we make is deliberately modest. On dense models, our NVFP4 builds match stock K-quants on end-to-end throughput and gain 5 to 14% on generation throughput on an RTX 5090. On mixture-of-experts models the heavily tuned stock kernels won until June 2026, when upstream optimization flipped the result: our Nex-N2-mini NVFP4 build now wins batched serving by 3 to 5% while stock keeps a small single-stream edge, and the model cards say both. The consolidated numbers across all five models are in the performance writeup, and the original methodology, including the benchmark number we had to correct, is in the launch writeup.

Releases

Download counts are from June 2026.

Repo	What it is	Downloads
Qwen3.6-27B-NVFP4-GGUF	Dense 27B multimodal, NVFP4 FFN	5,384
Qwen3.6-27B-W4A16-G128	GPTQ W4A16 for vLLM	5,067
Qwen3.6-35B-A3B-NVFP4-GGUF	MoE 35B (3B active), NVFP4 experts	3,622
Qwen3.6-27B-NVFP4-MTP-GGUF	27B with multi-token-prediction draft head	3,311
Qwen3.6-35B-A3B-NVFP4-MTP-GGUF	MoE 35B with MTP draft head	2,216
Gemma-4-12B-IT-NVFP4-GGUF	Dense 12B	1,790
Gemma-4-31B-IT-NVFP4-GGUF	Dense 31B multimodal	1,587
Gemma-4-12B-IT-NVFP4	Safetensors for vLLM and TensorRT-LLM	1,322
Nex-N2-mini-NVFP4-GGUF	MoE 35B (3B active), NVFP4 experts, fixed chat template	new
Nex-N2-mini-GGUF	Imatrix K-quants with the fixed chat template, any GPU	new

What is in each GGUF repo

Every NVFP4 GGUF repo follows the same layout. There are three variants that share the calibrated NVFP4 feed-forward tensors and differ in how the remaining weights are stored: BF16 for maximum source fidelity, Q8_0, and Q4_K_M, which is the variant we recommend for serving. Multimodal models additionally ship an F16 projector for vision input, extracted from the official BF16 source.

The interesting property of the Q4_K_M variant is that it lands in the same file-size class as a stock Q4_K_M quant. Benchmarking the two against each other isolates exactly one difference: whether the feed-forward matmul runs through the NVFP4 hardware path or the K-quant path.