
Quantizing Kimi K2.6 to NVFP4 for Blackwell Inference
Wafer and Parasail released wafer-ai/Kimi-K2.6-NVFP4: Blackwell NVFP4 weights for production Kimi K2.6 inference.

Wafer and Parasail released wafer-ai/Kimi-K2.6-NVFP4: Blackwell NVFP4 weights for production Kimi K2.6 inference.

ATOM on 8×MI355X leads public Qwen3.6-35B-A3B on Artificial Analysis decode and sustains ~15k tok/s per node at production latency.

Wafer has raised $4 million in seed funding led by Fifty Years to build AI that optimizes AI infrastructure.

In our NVFP4 KernelArena suite, cpp_extension.load() hid a blind spot: JIT and CPU migration inflated kernel launch times. Here’s the fix.

Ten ways LLMs game GPU kernel benchmarks: timer tricks, garbage reads, caching, and the checks that catch them.

Benchmark AI-generated GPU kernels, with first results on WaferBench NVFP4 (B200) and KernelBench HIP (MI300X).

1:1 kernel mappings across providers. Diff huge vLLM traces fast with clean prefill vs decode.

Give your AI coding assistant direct GPUs, with no manual SSH, Docker, or infra babysitting.

Cloud CUDA builds with PTX/SASS, PyTorch headers, and VS Code integration. No local CUDA install.

A non-kernel expert hit 8× on latency-critical CUDA clustering with Wafer profile-guided optimization.

Profiling in our CLI broke a theory-only plateau: 11.65× on the Kimi Delta Attention kernel.

Our agent optimized AMD’s topk_sigmoid kernel: 9× over PyTorch, step by step.

A fused kernel claimed 104× speedup while reading garbage, and passed checks until we added a determinism guard.

Frontier models wrote HIP kernels for KernelBench on MI300X. We measured which ones were correct and how fast.

GPU docs, trace analysis, and remote kernel eval for your coding agent via the wafer-ai CLI.

The GPU docs tool from our IDE extension, now a standalone web app.

AMD profiling in VS Code and Cursor: metrics, roofline, and kernel stats without leaving the editor.

Open Chrome trace JSON in your IDE with Perfetto: timeline, flamegraphs, SQL, metrics.

Profiling (NCU), compiler explorer, and GPU docs: your GPU stack inside the editor.

Chip Benchmark is our open suite for open-weight LLMs across accelerators, so you can pick hardware with real numbers, not vibes.

MI300X packs 192GB HBM3 and 5.3 TB/s bandwidth, often skipped for LLM inference. We show quantization and tuning that unlock it.