+
+
+
+

Your GPU Development Stack

Profile, optimize, and ship GPU kernels faster, all while staying in your own editor

Backed by
Fifty Years
Fifty Years
Y Combinator
Y Combinator
Liquid 2
Liquid 2
NVIDIA Inception
NVIDIA Inception
Jeff Dean
Jeff DeanChief Scientist at Google
Woj Zaremba
Woj ZarembaCo-Founder at OpenAI
Dan Fu
Dan FuHead of Kernels at Together
Charlie Songhurst
Charlie SonghurstMeta Board of Directors
Arash Ferdowsi
Arash FerdowsiCo-Founder at Dropbox
Kawal Gandhi
Kawal GandhiOffice of the CTO at Google
Fifty Years
Fifty Years
Y Combinator
Y Combinator
Liquid 2
Liquid 2
NVIDIA Inception
NVIDIA Inception
Jeff Dean
Jeff DeanChief Scientist at Google
Woj Zaremba
Woj ZarembaCo-Founder at OpenAI
Dan Fu
Dan FuHead of Kernels at Together
Charlie Songhurst
Charlie SonghurstMeta Board of Directors
Arash Ferdowsi
Arash FerdowsiCo-Founder at Dropbox
Kawal Gandhi
Kawal GandhiOffice of the CTO at Google
Fifty Years
Fifty Years
Y Combinator
Y Combinator
Liquid 2
Liquid 2
NVIDIA Inception
NVIDIA Inception
Jeff Dean
Jeff DeanChief Scientist at Google
Woj Zaremba
Woj ZarembaCo-Founder at OpenAI
Dan Fu
Dan FuHead of Kernels at Together
Charlie Songhurst
Charlie SonghurstMeta Board of Directors
Arash Ferdowsi
Arash FerdowsiCo-Founder at Dropbox
Kawal Gandhi
Kawal GandhiOffice of the CTO at Google
GPU Profiling

Profile your code directly in your IDE, easily pass as context to your coding agent.

NCU Profiler
Summary
Details
Raw
GPU: NVIDIA B200Selected: elementwise_kernel_with_index
IDSpeedupFunction NameDurationComputeMemory
150.0%distribution_elementwise_grid...223.46 µs62.3%5.3%
240.7%distribution_elementwise_grid...12.06 µs38.4%2.9%
398.8%distribution_elementwise_grid...5.92 µs1.5%3.2%
450.0%distribution_elementwise_grid...37.38 µs55.7%4.4%
546.5%unrolled_elementwise_kernel37.76 µs62.2%8.1%
6100%elementwise_kernel_with_index4.80 µs0.0%4.0%
722.2%elementwise_kernel37.92 µs71.0%29.6%
OPTOptimization Opportunities

This kernel grid is too small to fill available resources, resulting in only 0.0 full waves across all SMs.

NCU Profiler
Summary
Details
Raw
GPU Speed Of Light Throughput

High-level overview of throughput for compute and memory resources. Throughput is reported as percentage of theoretical maximum.

Memory Throughput5.25 %
DRAM Throughput0.38 %
L1/TEX Cache Throughput5.57 %
L2 Cache Throughput4.49 %
Compute (SM) Throughput62.31 %
DRAM Frequency3.99 Ghz
SM Frequency1.13 Ghz
Elapsed Cycles257,296 cycle
Duration223.46 us
SM Active Cycles239,788.51 cycle
GPU Throughput
Compute (SM) [%]62.3%
Memory [%]5.3%
PM Sampling

Timeline view of performance monitor metrics sampled periodically over the workload duration.

Summary
Details
Drag to compare
GPU Documentation

Fast search over the most complete GPU documentation - in your own editor.

Ask anything about GPU optimization...
Compiler Explorer

Compile CUDA & CuteDSL code directly into PTX & SASS. Mapped to source, all available as agent context.

1// Generated by NVIDIA NVVM Compiler
2// Compiler Build ID: CL-36424714
3// Cuda compilation tools, release 13.0, V13.0.88
4// Based on NVVM 20.0.0
5 
6.version 9.0
7.target sm_100
8.address_size 64
9 
10// .globl _Z9vectorAddPfPKfS1_i
11// _ZZ9reduceSumPfPKfiE5sdata has been demoted
12 
13.visible .entry _Z9vectorAddPfPKfS1_i(
14 .param .u64 .ptr .align 1 _Z9vectorAddPfPKfS1_i_param_0,
15 .param .u64 .ptr .align 1 _Z9vectorAddPfPKfS1_i_param_1,
16 .param .u64 .ptr .align 1 _Z9vectorAddPfPKfS1_i_param_2,
17 .param .u32 _Z9vectorAddPfPKfS1_i_param_3
18)
19{
20 .reg .pred %p<2>;
21 .reg .b32 %r<6>;
22 .reg .b32 %f<4>;
23 .reg .b64 %rd<11>;
24 .loc 1 4 0
25 
26 ld.param.u64 %rd1, [_Z9vectorAddPfPKfS1_i_param_0];
27 ld.param.u64 %rd2, [_Z9vectorAddPfPKfS1_i_param_1];
28 ld.param.u64 %rd3, [_Z9vectorAddPfPKfS1_i_param_2];
29 ld.param.u32 %r2, [_Z9vectorAddPfPKfS1_i_param_3];
30 .loc 1 5 5
31 mov.u32 %r3, %ctaid.x;
32 mov.u32 %r4, %ntid.x;
33 mov.u32 %r5, %tid.x;
34 mad.lo.s32 %r1, %r3, %r4, %r5;
35 .loc 1 6 5
36 setp.ge.s32 %p1, %r1, %r2;
37 @%p1 bra $L__BB0_2;
38 .loc 1 7 5
39 cvta.to.global.u64 %rd4, %rd2;
40 cvta.to.global.u64 %rd5, %rd3;
41 cvta.to.global.u64 %rd6, %rd1;
42 mul.wide.s32 %rd7, %r1, 4;
43 add.s64 %rd8, %rd4, %rd7;
44 add.s64 %rd9, %rd5, %rd7;
45 ld.global.f32 %f1, [%rd8];
46 ld.global.f32 %f2, [%rd9];
47 add.f32 %f3, %f1, %f2;
48 add.s64 %rd10, %rd6, %rd7;
49 st.global.f32 [%rd10], %f3;
50 
51$L__BB0_2:
52 .loc 1 9 1
53 ret;
54}
GPU Workspaces

Develop kernels on GPUs while spending ~95% less. Persistent CPU environment; Spin up GPU when you run code.

Your IDE
PERSISTENT
CPU Container
STANDBY
GPUon-demand
Writing code...
Session:0.0s
GPU:0.0s

10x your GPU engineering productivity

Available as a Cursor and VSCode extension. All your GPU development tools in one place.

Everything you need for GPU development. Built for kernel engineers who want to ship faster.

NCU Integration

Run NVIDIA Compute Utility profiles directly from your editor. Get insights without context switching.

Documentation Search

Search CUDA programming guides, API references, and optimization best practices instantly.

GPU Workspaces

Develop on GPUs while spending ~95% less. Persistent CPU environment; Spin up GPU when needed.

Compiler Explorer

See the generated PTX and SASS from your CUDA code. Like Godbolt, but for GPU kernels.

AI Agent

An agent that reads your profiling data and suggests the next optimization to implement.

Tool Calling

The agent can call NCU, search docs, and run code—same actions you can do, but automated.

Code Diff

Review agent-suggested changes before applying. Accept, reject, or modify the proposed optimizations.

Hyperparameter Tuning

Ask the agent to automatically sweep common kernel hyperparameters like tile sizes, thread counts, and unroll factors.

Simple, transparent pricing

Start free, scale as you need. Credits work for both AI agent calls and GPU compute time.

Start

$0/mo
$5credits/month
Includes
$5 free credits/month
B200s with hardware counters
Access to all tools
Popular

Hacker

$16/mo
$20credits/month
Includes
Everything in Start
$20 in credits/month
Credit top-ups available
Support Slack channel

Pro

$100/mo
$128credits/month
Includes
Everything in Hacker
$128 in credits/month
Direct Slack access to founders
< 2 hour response time
Priority support

Enterprise

Contact Us
Unlimitedcredits/month
Includes
Everything in Pro
Custom credit allocation
Dedicated infrastructure
Custom SLAs
On-premise deployment