NEWAnnouncing our $4M seedRead
Blog Posts
Red rose floating above a landscape in a pixelated sky

Quantizing Kimi K2.6 to NVFP4 for Blackwell Inference

Wafer and Parasail released wafer-ai/Kimi-K2.6-NVFP4: Blackwell NVFP4 weights for production Kimi K2.6 inference.

Qwen3.6-35B-A3B on AMD MI355X - being the fastest with ATOM on AMD

Qwen3.6-35B-A3B on AMD MI355X: Being The Fastest With ATOM on AMD

ATOM on 8×MI355X leads public Qwen3.6-35B-A3B on Artificial Analysis decode and sustains ~15k tok/s per node at production latency.

Wafer seed round announcement

Announcing our Seed Round

Wafer has raised $4 million in seed funding led by Fifty Years to build AI that optimizes AI infrastructure.

Harbor painting with ASCII text overlay - data obscured by noise, like benchmarks corrupted by cold cache

Where Did My Microseconds Go?

In our NVFP4 KernelArena suite, cpp_extension.load() hid a blind spot: JIT and CPU migration inflated kernel launch times. Here’s the fix.

The Cheat with the Ace of Diamonds (c. 1635) - Georges de La Tour

A Field Guide to Reward Hacking in AI Kernel Generation

Ten ways LLMs game GPU kernel benchmarks: timer tricks, garbage reads, caching, and the checks that catch them.

Interior of the Colosseum, Rome (1832) - Thomas Cole

Introducing KernelArena

Benchmark AI-generated GPU kernels, with first results on WaferBench NVFP4 (B200) and KernelBench HIP (MI300X).

Trace Compare - Compare vLLM traces across platforms

Trace Compare: Compare vLLM traces across platforms

1:1 kernel mappings across providers. Diff huge vLLM traces fast with clean prefill vs decode.

Wafer Workspaces - GPU compute for coding agents

Workspaces: GPU Compute for Your Coding Agent

Give your AI coding assistant direct GPUs, with no manual SSH, Docker, or infra babysitting.

CUDA Compiler Explorer in VS Code

Cloud Compiler Analyzer (PTX/SASS) Inside Your IDE

Cloud CUDA builds with PTX/SASS, PyTorch headers, and VS Code integration. No local CUDA install.

Nordlys Labs case study - 8x faster CUDA kernel optimization

Nordlys Labs: 8x Faster Routing with Wafer-Guided Kernel Optimization

A non-kernel expert hit 8× on latency-critical CUDA clustering with Wafer profile-guided optimization.

Profile-guided GPU kernel optimization with ncu

Profile-Guided GPU Kernel Optimization

Profiling in our CLI broke a theory-only plateau: 11.65× on the Kimi Delta Attention kernel.

The year of the LLM GPU kernel engineer

The Year of the LLM GPU Kernel Engineer

Our agent optimized AMD’s topk_sigmoid kernel: 9× over PyTorch, step by step.

Reward hacking in LLM-generated kernels

Case Study: A 104x (?) Speedup on KernelBench

A fused kernel claimed 104× speedup while reading garbage, and passed checks until we added a determinism guard.

Water lilies painting representing HIP kernel optimization

Which models are the most HIP?

Frontier models wrote HIP kernels for KernelBench on MI300X. We measured which ones were correct and how fast.

wafer-ai CLI - GPU Superpowers for Your Coding Agent

wafer-ai CLI: GPU Superpowers for Your Coding Agent

GPU docs, trace analysis, and remote kernel eval for your coding agent via the wafer-ai CLI.

GPU Docs Web App

GPU Docs: Now Available on the Web

The GPU docs tool from our IDE extension, now a standalone web app.

ROCprofiler Compute in VS Code showing GPU architecture diagram

Introducing ROCprofiler Compute: AMD GPU Profiling in Your IDE

AMD profiling in VS Code and Cursor: metrics, roofline, and kernel stats without leaving the editor.

Wafer Perfetto Trace Viewer in VS Code

Introducing Wafer's Built-in Perfetto Trace Viewer

Open Chrome trace JSON in your IDE with Perfetto: timeline, flamegraphs, SQL, metrics.

Wafer Extension - Your GPU Development Stack

Introducing the Wafer Extension for VS Code and Cursor

Profiling (NCU), compiler explorer, and GPU docs: your GPU stack inside the editor.

Chip Benchmark visualization showing hardware performance comparison

Introducing Chip Benchmark: Hardware-Centric Performance Insights for AI Workloads

Chip Benchmark is our open suite for open-weight LLMs across accelerators, so you can pick hardware with real numbers, not vibes.

AMD MI300X optimization visualization showing performance improvements

Unlocking AMD MI300X for High-Throughput, Low-Cost LLM Inference

MI300X packs 192GB HBM3 and 5.3 TB/s bandwidth, often skipped for LLM inference. We show quantization and tuning that unlock it.