+
+
+
+

From trace to speedup

Profile, diagnose, and optimize your kernels —with an agent built for GPU performance work

or
Backed by
Fifty Years
Fifty Years
Y Combinator
Y Combinator
Liquid 2
Liquid 2
NVIDIA Inception
NVIDIA Inception
Jeff Dean
Jeff DeanChief Scientist at Google
Woj Zaremba
Woj ZarembaCo-Founder at OpenAI
Dan Fu
Dan FuHead of Kernels at Together
Charlie Songhurst
Charlie SonghurstMeta Board of Directors
Arash Ferdowsi
Arash FerdowsiCo-Founder at Dropbox
Kawal Gandhi
Kawal GandhiOffice of the CTO at Google
Fifty Years
Fifty Years
Y Combinator
Y Combinator
Liquid 2
Liquid 2
NVIDIA Inception
NVIDIA Inception
Jeff Dean
Jeff DeanChief Scientist at Google
Woj Zaremba
Woj ZarembaCo-Founder at OpenAI
Dan Fu
Dan FuHead of Kernels at Together
Charlie Songhurst
Charlie SonghurstMeta Board of Directors
Arash Ferdowsi
Arash FerdowsiCo-Founder at Dropbox
Kawal Gandhi
Kawal GandhiOffice of the CTO at Google
Fifty Years
Fifty Years
Y Combinator
Y Combinator
Liquid 2
Liquid 2
NVIDIA Inception
NVIDIA Inception
Jeff Dean
Jeff DeanChief Scientist at Google
Woj Zaremba
Woj ZarembaCo-Founder at OpenAI
Dan Fu
Dan FuHead of Kernels at Together
Charlie Songhurst
Charlie SonghurstMeta Board of Directors
Arash Ferdowsi
Arash FerdowsiCo-Founder at Dropbox
Kawal Gandhi
Kawal GandhiOffice of the CTO at Google
The Optimization Loop
Supports
NVIDIA
AMD
01Profile
KernelTime
matmul_kernel223µs
softmax_fwd89µs
layer_norm45µs
matmul_kernel is 48% of runtime
02Diagnose
Wafer
Wafer Agent
Low L2 hit rate (23%)
Uncoalesced global loads
Use shared memory tiling
03Implement
Wafer
Wafer Agentwriting...
out[tid] = in[tid] * scale;
+__shared__ float tile[256];
+tile[tx] = in[tid];
+__syncthreads();
04Verify
Before
223µs
After
223µs
01 — Profile

Run real profilers from within your IDE or CLI. PyTorch, NSight, ROCProfiler, and more.

NCU Profiler
Summary
Details
Raw
GPU: NVIDIA B200Selected: elementwise_kernel_with_index
IDSpeedupFunction NameDurationComputeMemory
150.0%distribution_elementwise_grid...223.46 µs62.3%5.3%
240.7%distribution_elementwise_grid...12.06 µs38.4%2.9%
398.8%distribution_elementwise_grid...5.92 µs1.5%3.2%
450.0%distribution_elementwise_grid...37.38 µs55.7%4.4%
546.5%unrolled_elementwise_kernel37.76 µs62.2%8.1%
6100%elementwise_kernel_with_index4.80 µs0.0%4.0%
722.2%elementwise_kernel37.92 µs71.0%29.6%
OPTOptimization Opportunities

This kernel grid is too small to fill available resources, resulting in only 0.0 full waves across all SMs.

NCU Profiler
Summary
Details
Raw
GPU Speed Of Light Throughput

High-level overview of throughput for compute and memory resources. Throughput is reported as percentage of theoretical maximum.

Memory Throughput5.25 %
DRAM Throughput0.38 %
L1/TEX Cache Throughput5.57 %
L2 Cache Throughput4.49 %
Compute (SM) Throughput62.31 %
DRAM Frequency3.99 Ghz
SM Frequency1.13 Ghz
Elapsed Cycles257,296 cycle
Duration223.46 us
SM Active Cycles239,788.51 cycle
GPU Throughput
Compute (SM) [%]62.3%
Memory [%]5.3%
PM Sampling

Timeline view of performance monitor metrics sampled periodically over the workload duration.

Summary
Details
Drag to compare
02 — Diagnose

Turn counters into explanations. Ask questions against docs + traces.

Ask anything about GPU optimization...
03 — Patch

Edit with the compiler open. Inspect PTX / SASS / IR. Change a line. See what changed.

1// Generated by NVIDIA NVVM Compiler
2// Compiler Build ID: CL-36424714
3// Cuda compilation tools, release 13.0, V13.0.88
4// Based on NVVM 20.0.0
5 
6.version 9.0
7.target sm_100
8.address_size 64
9 
10// .globl _Z9vectorAddPfPKfS1_i
11// _ZZ9reduceSumPfPKfiE5sdata has been demoted
12 
13.visible .entry _Z9vectorAddPfPKfS1_i(
14 .param .u64 .ptr .align 1 _Z9vectorAddPfPKfS1_i_param_0,
15 .param .u64 .ptr .align 1 _Z9vectorAddPfPKfS1_i_param_1,
16 .param .u64 .ptr .align 1 _Z9vectorAddPfPKfS1_i_param_2,
17 .param .u32 _Z9vectorAddPfPKfS1_i_param_3
18)
19{
20 .reg .pred %p<2>;
21 .reg .b32 %r<6>;
22 .reg .b32 %f<4>;
23 .reg .b64 %rd<11>;
24 .loc 1 4 0
25 
26 ld.param.u64 %rd1, [_Z9vectorAddPfPKfS1_i_param_0];
27 ld.param.u64 %rd2, [_Z9vectorAddPfPKfS1_i_param_1];
28 ld.param.u64 %rd3, [_Z9vectorAddPfPKfS1_i_param_2];
29 ld.param.u32 %r2, [_Z9vectorAddPfPKfS1_i_param_3];
30 .loc 1 5 5
31 mov.u32 %r3, %ctaid.x;
32 mov.u32 %r4, %ntid.x;
33 mov.u32 %r5, %tid.x;
34 mad.lo.s32 %r1, %r3, %r4, %r5;
35 .loc 1 6 5
36 setp.ge.s32 %p1, %r1, %r2;
37 @%p1 bra $L__BB0_2;
38 .loc 1 7 5
39 cvta.to.global.u64 %rd4, %rd2;
40 cvta.to.global.u64 %rd5, %rd3;
41 cvta.to.global.u64 %rd6, %rd1;
42 mul.wide.s32 %rd7, %r1, 4;
43 add.s64 %rd8, %rd4, %rd7;
44 add.s64 %rd9, %rd5, %rd7;
45 ld.global.f32 %f1, [%rd8];
46 ld.global.f32 %f2, [%rd9];
47 add.f32 %f3, %f1, %f2;
48 add.s64 %rd10, %rd6, %rd7;
49 st.global.f32 [%rd10], %f3;
50 
51$L__BB0_2:
52 .loc 1 9 1
53 ret;
54}
04 — Verify

Persistent CPU environment. Spin up GPU only when you run. Save 90% on GPU costs.

Your IDE
PERSISTENT
CPU Container
STANDBY
GPUon-demand
Writing code...
Session:0.0s
GPU:0.0s

Everything you need for the optimization loop. Built for kernel engineers who want to ship faster.

NCU Integration

Run NVIDIA Compute Utility profiles directly from your editor. Get insights without context switching.

ROCprofiler Compute

AMD GPU profiling with the same workflow. One interface for both NVIDIA and AMD hardware.

Documentation Search

Search CUDA programming guides, API references, and optimization best practices instantly.

Compiler Explorer

See PTX and SASS from your CUDA code. Like Godbolt, but for GPU kernels.

Trace Analysis

Ask questions about your profiler output. Get explanations, not just numbers.

Code Diff

Review agent-suggested changes before applying. Accept, reject, or modify.

Benchmark Harness

Reproducible perf measurements. Guard against regressions.

Autonomy Control

Same workflow, different level of hands-on. Pick what works for you.

Simple, transparent pricing

Start free, scale as you need. Credits work for both AI agent calls and GPU compute time.

Start

$0/mo
$5credits/month
Includes
$5 free credits/month
B200s with hardware counters
Access to all tools
Popular

Hacker

$16/mo
$20credits/month
Includes
Everything in Start
$20 in credits/month
Credit top-ups available
Support Slack channel

Pro

$100/mo
$128credits/month
Includes
Everything in Hacker
$128 in credits/month
Direct Slack access to founders
< 2 hour response time
Priority support

Enterprise

Contact Us
Unlimitedcredits/month
Includes
Everything in Pro
Custom credit allocation
Dedicated infrastructure
Custom SLAs
On-premise deployment