Blog Posts

The Cheat with the Ace of Diamonds (c. 1635) — Georges de La Tour

March 12, 2026

A Field Guide to Reward Hacking in AI Kernel Generation

10 patterns we've tracked where LLMs game GPU kernel benchmarks, manipulating timers, returning garbage, caching results, and more, along with the defenses that catch them.

Interior of the Colosseum, Rome (1832) — Thomas Cole

March 11, 2026

Introducing KernelArena

An open platform for benchmarking AI-generated GPU kernels — with initial results from WaferBench NVFP4 on B200 and KernelBench HIP on MI300X.

February 10, 2026

Trace Compare: Compare vLLM traces across platforms

Get accurate 1:1 kernel mappings across hardware providers. Compare large vLLM traces in seconds with clean prefill vs. decode separation.

Wafer Workspaces - GPU compute for coding agents

February 5, 2026

Workspaces: GPU Compute for Your Coding Agent

Give your AI coding assistant direct access to GPUs. No manual SSH setup, no Docker, or infrastructure management.

February 3, 2026

Cloud Compiler Analyzer (PTX/SASS) Inside Your IDE

Cloud CUDA compilation with PTX/SASS output, PyTorch headers, and VS Code integration. No local CUDA install required.

Nordlys Labs case study - 8x faster CUDA kernel optimization

February 1, 2026

Nordlys Labs: 8x Faster Routing with Wafer-Guided Kernel Optimization

How a non-kernel-expert achieved 8x speedup on latency-critical CUDA clustering code using profile-guided optimization with Wafer.

January 30, 2026

Profile-Guided GPU Kernel Optimization

How adding profiling tools to our CLI helped an agent break through a theory-based optimization plateau, achieving 11.65x speedup on the Kimi Delta Attention kernel.

January 29, 2026

The Year of the LLM GPU Kernel Engineer

We used an AI agent to optimize AMD's topk_sigmoid kernel, achieving a 9x speedup over PyTorch. Here's exactly how our agent did it

January 27, 2026

Case Study: A 104x (?) Speedup on KernelBench

How a fused kernel claiming 104x speedup passed our correctness checks while reading garbage memory, and the determinism check that catches it.

Water lilies painting representing HIP kernel optimization

January 23, 2026

Which models are the most HIP?

LLM-generated kernels are all the rage right now. We used frontier AI models to write HIP kernels for KernelBench and ran them on MI300Xs. Which ones performed the best?

January 20, 2026

wafer-ai CLI: GPU Superpowers for Your Coding Agent

Give your AI coding assistant direct access to GPU documentation, trace analysis, and remote kernel evaluation with the wafer-ai CLI.

January 16, 2026

GPU Docs: Now Available on the Web

The GPU documentation tool that thousands of engineers loved in our IDE extension is now available as a standalone web app.

ROCprofiler Compute in VS Code showing GPU architecture diagram

January 13, 2026

Introducing ROCprofiler Compute: AMD GPU Profiling in Your IDE

Profile AMD GPUs directly in VS Code and Cursor. View hardware metrics, roofline analysis, and kernel stats — all without leaving your editor.

January 8, 2026

Introducing Wafer's Built-in Perfetto Trace Viewer

Open Chrome trace JSON files directly in your IDE with full Perfetto functionality — timeline, flamegraphs, SQL, and metrics.

Wafer Extension - Your GPU Development Stack

December 19, 2025

Introducing the Wafer Extension for VS Code and Cursor

Wafer is the GPU development stack that lives inside your editor: profiling (NCU), compiler explorer, and enhanced GPU docs.

Chip Benchmark visualization showing hardware performance comparison

July 14, 2025

Introducing Chip Benchmark: Hardware-Centric Performance Insights for AI Workloads

As the AI hardware ecosystem rapidly expands, choosing the right accelerator has become increasingly complex. We're excited to introduce Chip Benchmark, an open-source benchmarking suite purpose-built to evaluate the performance of open-weight LLMs across diverse hardware platforms.

AMD MI300X optimization visualization showing performance improvements

July 14, 2025

Unlocking AMD MI300X for High-Throughput, Low-Cost LLM Inference

Large language models are driving a surge in inference workloads. While the AI community often gravitates toward more well-known GPUs, AMD's MI300X quietly stands out. Equipped with 192 GB of HBM3 and memory bandwidth of 5.3 TB/s, we explore how targeted optimization and quantization can unlock its potential.