Your GPU Development Stack

Profile, optimize, and ship GPU kernels faster, all while staying in your own editor

Install Extension

Backed by

Fifty Years

Y Combinator

Liquid 2

NVIDIA Inception

Jeff DeanChief Scientist at Google

Woj ZarembaCo-Founder at OpenAI

Dan FuHead of Kernels at Together

Charlie SonghurstMeta Board of Directors

Arash FerdowsiCo-Founder at Dropbox

Kawal GandhiOffice of the CTO at Google

Fifty Years

Y Combinator

Liquid 2

NVIDIA Inception

Jeff DeanChief Scientist at Google

Woj ZarembaCo-Founder at OpenAI

Dan FuHead of Kernels at Together

Charlie SonghurstMeta Board of Directors

Arash FerdowsiCo-Founder at Dropbox

Kawal GandhiOffice of the CTO at Google

Fifty Years

Y Combinator

Liquid 2

NVIDIA Inception

Jeff DeanChief Scientist at Google

Woj ZarembaCo-Founder at OpenAI

Dan FuHead of Kernels at Together

Charlie SonghurstMeta Board of Directors

Arash FerdowsiCo-Founder at Dropbox

Kawal GandhiOffice of the CTO at Google

GPU Profiling

Profile your code directly in your IDE, easily pass as context to your coding agent.

NCU Profiler

Summary

Details

Raw

GPU: NVIDIA B200Selected: elementwise_kernel_with_index

IDSpeedupFunction NameDurationComputeMemory

150.0%distribution_elementwise_grid...223.46 µs62.3%5.3%

240.7%distribution_elementwise_grid...12.06 µs38.4%2.9%

398.8%distribution_elementwise_grid...5.92 µs1.5%3.2%

450.0%distribution_elementwise_grid...37.38 µs55.7%4.4%

546.5%unrolled_elementwise_kernel37.76 µs62.2%8.1%

6100%elementwise_kernel_with_index4.80 µs0.0%4.0%

722.2%elementwise_kernel37.92 µs71.0%29.6%

OPTOptimization Opportunities

This kernel grid is too small to fill available resources, resulting in only 0.0 full waves across all SMs.

NCU Profiler

Summary

Details

Raw

GPU Speed Of Light Throughput

High-level overview of throughput for compute and memory resources. Throughput is reported as percentage of theoretical maximum.

Memory Throughput5.25 %

DRAM Throughput0.38 %

L1/TEX Cache Throughput5.57 %

L2 Cache Throughput4.49 %

Compute (SM) Throughput62.31 %

DRAM Frequency3.99 Ghz

SM Frequency1.13 Ghz

Elapsed Cycles257,296 cycle

Duration223.46 us

SM Active Cycles239,788.51 cycle

GPU Throughput

Compute (SM) [%]62.3%

Memory [%]5.3%

PM Sampling

Timeline view of performance monitor metrics sampled periodically over the workload duration.

Summary

Details

Drag to compare

GPU Documentation

Fast search over the most complete GPU documentation - in your own editor.

Ask anything about GPU optimization...

Compiler Explorer

Compile CUDA & CuteDSL code directly into PTX & SASS. Mapped to source, all available as agent context.

1	// Generated by NVIDIA NVVM Compiler
2	// Compiler Build ID: CL-36424714
3	// Cuda compilation tools, release 13.0, V13.0.88
4	// Based on NVVM 20.0.0
5
6	.version 9.0
7	.target sm_100
8	.address_size 64
9
10	// .globl _Z9vectorAddPfPKfS1_i
11	// _ZZ9reduceSumPfPKfiE5sdata has been demoted
12
13	.visible .entry _Z9vectorAddPfPKfS1_i(
14	.param .u64 .ptr .align 1 _Z9vectorAddPfPKfS1_i_param_0,
15	.param .u64 .ptr .align 1 _Z9vectorAddPfPKfS1_i_param_1,
16	.param .u64 .ptr .align 1 _Z9vectorAddPfPKfS1_i_param_2,
17	.param .u32 _Z9vectorAddPfPKfS1_i_param_3
18	)
19	{
20	.reg .pred %p<2>;
21	.reg .b32 %r<6>;
22	.reg .b32 %f<4>;
23	.reg .b64 %rd<11>;
24	.loc 1 4 0
25
26	ld.param.u64 %rd1, [_Z9vectorAddPfPKfS1_i_param_0];
27	ld.param.u64 %rd2, [_Z9vectorAddPfPKfS1_i_param_1];
28	ld.param.u64 %rd3, [_Z9vectorAddPfPKfS1_i_param_2];
29	ld.param.u32 %r2, [_Z9vectorAddPfPKfS1_i_param_3];
30	.loc 1 5 5
31	mov.u32 %r3, %ctaid.x;
32	mov.u32 %r4, %ntid.x;
33	mov.u32 %r5, %tid.x;
34	mad.lo.s32 %r1, %r3, %r4, %r5;
35	.loc 1 6 5
36	setp.ge.s32 %p1, %r1, %r2;
37	@%p1 bra $L__BB0_2;
38	.loc 1 7 5
39	cvta.to.global.u64 %rd4, %rd2;
40	cvta.to.global.u64 %rd5, %rd3;
41	cvta.to.global.u64 %rd6, %rd1;
42	mul.wide.s32 %rd7, %r1, 4;
43	add.s64 %rd8, %rd4, %rd7;
44	add.s64 %rd9, %rd5, %rd7;
45	ld.global.f32 %f1, [%rd8];
46	ld.global.f32 %f2, [%rd9];
47	add.f32 %f3, %f1, %f2;
48	add.s64 %rd10, %rd6, %rd7;
49	st.global.f32 [%rd10], %f3;
50
51	$L__BB0_2:
52	.loc 1 9 1
53	ret;
54	}

GPU Workspaces

Develop kernels on GPUs while spending ~95% less. Persistent CPU environment; Spin up GPU when you run code.

Your IDE

PERSISTENT

CPU Container

STANDBY

GPUon-demand

Writing code...

Session:0.0s

GPU:0.0s

10x your GPU engineering productivity

Available as a Cursor and VSCode extension. All your GPU development tools in one place.

Install Extension

Everything you need for GPU development. Built for kernel engineers who want to ship faster.

NCU Integration

Run NVIDIA Compute Utility profiles directly from your editor. Get insights without context switching.

Documentation Search

Search CUDA programming guides, API references, and optimization best practices instantly.

GPU Workspaces

Develop on GPUs while spending ~95% less. Persistent CPU environment; Spin up GPU when needed.

Compiler Explorer

See the generated PTX and SASS from your CUDA code. Like Godbolt, but for GPU kernels.

AI Agent

An agent that reads your profiling data and suggests the next optimization to implement.

Tool Calling

The agent can call NCU, search docs, and run code—same actions you can do, but automated.

Code Diff

Review agent-suggested changes before applying. Accept, reject, or modify the proposed optimizations.

Hyperparameter Tuning

Ask the agent to automatically sweep common kernel hyperparameters like tile sizes, thread counts, and unroll factors.

Simple, transparent pricing

Start free, scale as you need. Credits work for both AI agent calls and GPU compute time.

Start

$0/mo

$5credits/month

Includes

$5 free credits/month

B200s with hardware counters

Access to all tools

Get Started

Popular

Hacker

$16/mo

$20credits/month

Includes

Everything in Start

$20 in credits/month

Credit top-ups available

Support Slack channel

Pro

$100/mo

$128credits/month

Includes

Everything in Hacker

$128 in credits/month

Direct Slack access to founders

< 2 hour response time

Priority support

Enterprise

Unlimitedcredits/month

Includes

Everything in Pro

Custom credit allocation

Dedicated infrastructure

Custom SLAs

On-premise deployment

Talk to Founders

Your GPU Development Stack

Profile your code directly in your IDE, easily pass as context to your coding agent.

Fast search over the most complete GPU documentation - in your own editor.

Compile CUDA & CuteDSL code directly into PTX & SASS. Mapped to source, all available as agent context.

Develop kernels on GPUs while spending ~95% less. Persistent CPU environment; Spin up GPU when you run code.

10x your GPU engineering productivity

Everything you need for GPU development. Built for kernel engineers who want to ship faster.

Simple, transparent pricing

Start

Hacker

Pro

Enterprise

Resources

Company

Social