Blog Posts
Profile-guided GPU kernel optimization with ncu

Profile-Guided GPU Kernel Optimization

How adding profiling tools to our CLI helped an agent break through a theory-based optimization plateau, achieving 11.65x speedup on the Kimi Delta Attention kernel.

The year of the LLM GPU kernel engineer

The Year of the LLM GPU Kernel Engineer

We used an AI agent to optimize AMD's topk_sigmoid kernel, achieving a 9x speedup over PyTorch. Here's exactly how our agent did it

Reward hacking in LLM-generated kernels

Case Study: A 104x (?) Speedup on KernelBench

How a fused kernel claiming 104x speedup passed our correctness checks while reading garbage memory, and the determinism check that catches it.

Water lilies painting representing HIP kernel optimization

Which models are the most HIP?

LLM-generated kernels are all the rage right now. We used frontier AI models to write HIP kernels for KernelBench and ran them on MI300Xs. Which ones performed the best?

wafer-cli - GPU Superpowers for Your Coding Agent

wafer-cli: GPU Superpowers for Your Coding Agent

Give your AI coding assistant direct access to GPU documentation, trace analysis, and remote kernel evaluation with wafer-cli.

GPU Docs Web App

GPU Docs: Now Available on the Web

The GPU documentation tool that thousands of engineers loved in our IDE extension is now available as a standalone web app.

ROCprofiler Compute in VS Code showing GPU architecture diagram

Introducing ROCprofiler Compute: AMD GPU Profiling in Your IDE

Profile AMD GPUs directly in VS Code and Cursor. View hardware metrics, roofline analysis, and kernel stats — all without leaving your editor.

Wafer Perfetto Trace Viewer in VS Code

Introducing Wafer's Built-in Perfetto Trace Viewer

Open Chrome trace JSON files directly in your IDE with full Perfetto functionality — timeline, flamegraphs, SQL, and metrics.

Wafer Extension - Your GPU Development Stack

Introducing the Wafer Extension for VS Code and Cursor

Wafer is the GPU development stack that lives inside your editor: profiling (NCU), compiler explorer, and enhanced GPU docs.

Chip Benchmark visualization showing hardware performance comparison

Introducing Chip Benchmark: Hardware-Centric Performance Insights for AI Workloads

As the AI hardware ecosystem rapidly expands, choosing the right accelerator has become increasingly complex. We're excited to introduce Chip Benchmark, an open-source benchmarking suite purpose-built to evaluate the performance of open-weight LLMs across diverse hardware platforms.

AMD MI300X optimization visualization showing performance improvements

Unlocking AMD MI300X for High-Throughput, Low-Cost LLM Inference

Large language models are driving a surge in inference workloads. While the AI community often gravitates toward more well-known GPUs, AMD's MI300X quietly stands out. Equipped with 192 GB of HBM3 and memory bandwidth of 5.3 TB/s, we explore how targeted optimization and quantization can unlock its potential.