Several Best Practices for WebGPU

Slides from the 2022 WebGL & WebGPU Meetup

1 Use the label attribute wherever it can be used

Every object in WebGPU has a label property, whether you pass the label property of the descriptor when you create it, or you can directly access its label property after creation. This property is similar to an id. It makes the object easier to debug and observe. There is almost no cost to write it, but it is very, very cool when debugging.

const projectionMatrixBuffer = gpuDevice.createBuffer({
  label: 'Projection Matrix Buffer',
  size: 12 * Float32Array.BYTES_PER_ELEMENT, // 故意设的 12，实际上矩阵应该要 16
  usage: GPUBufferUsage.VERTEX | GPUBufferUsage.COPY_DST,
})
const projectionMatrixArray = new Float32Array(16)

gpuDevice.queue.writeBuffer(projectionMatrixBuffer, 0, projectionMatrixArray)

The size of the GPUBuffer used by the matrix that the above code deliberately writes wrong will bring the label information when checking the error:

// 控制台输出
Write range (bufferOffset: 0, size: 64) does not fit in [Buffer "Projection Matrix Buffer"] size (48).

2 Using the debug group

The command buffer (CommandBuffer) allows you to add and delete debug groups, which are actually a set of strings that indicate which part of the code is being executed. During error checking, the error message will display the call stack:

// --- 第一个调试点：标记当前帧 ---
commandEncoder.pushDebugGroup('Frame ${frameIndex}');
  // --- 第一个子调试点：标记灯光的更新 ---
  commandEncoder.pushDebugGroup('Clustered Light Compute Pass');
        // 譬如，在这里更新光源
    updateClusteredLights(commandEncoder);
  commandEncoder.popDebugGroup();
  // --- 结束第一个子调试点 ---
  // --- 第二个子调试点：标记渲染通道开始 ---
  commandEncoder.pushDebugGroup('Main Render Pass');
    // 触发绘制
    renderScene(commandEncoder);
  commandEncoder.popDebugGroup();
  // --- 结束第二个子调试点
commandEncoder.popDebugGroup();
// --- 结束第一个调试点 ---

In this way, if there is an error message, it will prompt:

// 控制台输出
Binding sizes are too small for bind group [BindGroup] at index 0

Debug group stack:
> "Main Render Pass"
> "Frame 234"

3 Load texture image from blob

ImageBitmaps created with a Blob achieves the best JPG/PNG texture decoding performance.

/**
 * 根据纹理图片路径异步创建纹理对象，并将纹理数据拷贝至对象中
 * @param {GPUDevice} gpuDevice 设备对象
 * @param {string} url 纹理图片路径
 */
async function createTextureFromImageUrl(gpuDevice, url) {
  const blob = await fetch(url).then((r) => r.blob())
  const source = await createImageBitmap(blob)
  
  const textureDescriptor = {
    label: `Image Texture ${url}`,
    size: {
      width: source.width,
      height: source.height,
    },
    format: 'rgba8unorm',
    usage: GPUTextureUsage.TEXTURE_BINDING | GPUTextureUsage.COPY_DST
  }
  const texture = gpuDevice.createTexture(textureDescriptor)
  gpuDevice.queue.copyExternalImageToTexture(
    { source },
    { texture },
    textureDescriptor.size,
  )
  
  return texture
}

It is more recommended to use texture resources in compressed format

Use it as you can.

WebGPU supports at least 3 compressed texture types:

texture-compression-bc
texture-compression-etc2
texture-compression-astc

The amount of support depends on the hardware capabilities. According to the official discussion ( Github Issue 2083 ), all platforms must support BC format (aka DXT, S3TC), or ETC2, ASTC compression format to ensure that you can use texture compression ability.

It is strongly recommended to use a super-compressed texture format (such as Basis Universal ), the advantage is that it can ignore the device, and it can be converted to a format supported by the device, thus avoiding the need to prepare textures in both formats.

The original author wrote a library for loading compressed textures in WebGL and WebGPU, refer to Github toji/web-texture-tool

WebGL's support for compressed textures is not very good, and now WebGPU supports it natively, so use it as much as possible!

4 Using the glTF processing library gltf-transform

This is an open source library, you can find it on GitHub, which provides command line tools.

For example, you can use this to compress glb textures:

> gltf-transform etc1s paddle.glb paddle2.glb
paddle.glb (11.92 MB) → paddle2.glb (1.73 MB)

It is visually lossless, but the volume of this model exported from Blender can be much smaller. The textures of the original model are 5 2048 x 2048 PNG images.

In addition to compressing textures, this library can also scale textures, resampling, add Google Draco compression to geometric data, and many more. Finally, after optimization, the volume of glb is less than 5% of the original.

> gltf-transform resize paddle.glb paddle2.glb --width 1024 --height 1024
> gltf-transform etc1s paddle2.glb paddle2.glb
> gltf-transform resample paddle2.glb paddle2.glb
> gltf-transform dedup paddle2.glb paddle2.glb
> gltf-transform draco paddle2.glb paddle2.glb

  paddle.glb (11.92 MB) → paddle2.glb (596.46 KB)

5 Buffered data upload

There are many ways to pass data into the buffer in writeBuffer() , the 061f307db7fd32 method is not necessarily the wrong usage. When you call WebGPU in wasm, you should give priority to the writeBuffer() API, which avoids additional buffer copy operations.

const projectionMatrixBuffer = gpuDevice.createBuffer({
  label: 'Projection Matrix Buffer',
  size: 16 * Float32Array.BYTES_PER_ELEMENT,
  usage: GPUBufferUsage.VERTEX | GPUBufferUsage.COPY_DST,
});

// 当投影矩阵改变时（例如 window 改变了大小）
function updateProjectionMatrixBuffer(projectionMatrix) {
  const projectionMatrixArray = projectionMatrix.getAsFloat32Array();
  gpuDevice.queue.writeBuffer(projectionMatrixBuffer, 0, projectionMatrixArray);
}

The original author pointed out that it is mappedAtCreation when creating a buffer, and sometimes it is possible to not map it when creating, such as loading related buffers in glTF.

6 It is recommended to create a pipeline asynchronously

If you are not going to render pipeline or computing pipeline soon, try to use createRenderPipelineAsync and createComputePipelineAsync APIs instead of synchronous creation.

When the pipeline is created synchronously, it is possible to compile the relevant resources of the pipeline at the bottom layer, which will interrupt the GPU-related steps.

For asynchronous creation, the pipeline will not resolve the Promise if it is not ready, that is to say, it is possible to give priority to what the GPU is currently doing first, and then toss the pipeline I need.

Take a look at the comparison code below:

// 同步创建计算管线
const computePipeline = gpuDevice.createComputePipeline({/* ... */})

computePass.setPipeline(computePipeline)
computePass.dispatch(32, 32) // 此时触发调度，着色器可能在编译，会卡

Take a look at the code created asynchronously:

// 异步创建计算管线
const asyncComputePipeline = await gpuDevice.createComputePipelineAsync({/* ... */})

computePass.setPipeline(asyncComputePipeline)
computePass.dispatch(32, 32) // 这个时候着色器早已编译好，没有卡顿，棒棒哒

7. Use implicit pipeline layout with caution

Implicit pipeline layout, especially independent computing pipelines, may be cool when writing js, but doing so brings two potential problems:

Interrupt Shared Resource Binding Group
Something strange happens when updating shaders

If your situation is particularly simple, you can use an implicit pipeline layout, but create a pipeline layout explicitly if you can.

The following is how the so-called implicit pipeline layout is created. The pipeline object is created first, and then the getBindGroupLayout() infer the pipeline layout object required in the shader code.

const computePipeline = await gpuDevice.createComputePipelineAsync({
  // 不传递布局对象
  compute: {
    module: computeModule,
    entryPoint: 'computeMain'
  }
})

const computeBindGroup = gpuDevice.createBindGroup({
  // 获取隐式管线布局对象
  layout: computePipeline.getBindGroupLayout(0),
  entries: [{
    binding: 0,
    resource: { buffer: storageBuffer },
  }]
})

7 Shared resource binding groups and binding group layout objects

If there are some values that do not change but are frequently used during rendering/computing, in this case you can create a simpler resource binding group layout that can be used for any pipeline that uses the same binding group number on the object.

First, create the resource binding group and its layout:

// 创建一个相机 UBO 的资源绑定组布局及其绑定组本体
const cameraBindGroupLayout = device.createBindGroupLayout({
  label: `Camera uniforms BindGroupLayout`,
  entries: [{
    binding: 0,
    visibility: GPUShaderStage.VERTEX | GPUShaderStage.FRAGMENT,
    buffer: {},
  }]
})

const cameraBindGroup = gpu.device.createBindGroup({
  label: `Camera uniforms BindGroup`,
  layout: cameraBindGroupLayout,
  entries: [{
    binding: 0,
    resource: { buffer: cameraUniformsBuffer, },
  }],
})

Then, create two rendering pipelines and notice that both pipelines use two resource binding groups. The difference is that the material resource binding groups used are different, and the camera resource binding groups are shared:

const renderPipelineA = gpuDevice.createRenderPipeline({
  label: `Render Pipeline A`,
  layout: gpuDevice.createPipelineLayout([cameraBindGroupLayout, materialBindGroupLayoutA]),
  /* Etc... */
});

const renderPipelineB = gpuDevice.createRenderPipeline({
  label: `Render Pipeline B`,
  layout: gpuDevice.createPipelineLayout([cameraBindGroupLayout, materialBindGroupLayoutB]),
  /* Etc... */
});

Finally, in each frame of the rendering loop, you only need to set the camera's resource binding group once to reduce CPU ~ GPU data passing:

const renderPass = commandEncoder.beginRenderPass({/* ... */});

// 只设定一次相机的资源绑定组
renderPass.setBindGroup(0, cameraBindGroup);

for (const pipeline of activePipelines) {
  renderPass.setPipeline(pipeline.gpuRenderPipeline)
  for (const material of pipeline.materials) {
      // 而对于管线中的材质资源绑定组，就分别设置了
    renderPass.setBindGroup(1, material.gpuBindGroup)
    
    // 此处设置 VBO 并发出绘制指令，略
    for (const mesh of material.meshes) {
      renderPass.setVertexBuffer(0, mesh.gpuVertexBuffer)
      renderPass.draw(mesh.drawCount)
    }
  }
}

renderPass.endPass()

Original accompanying information

By Brandon Jones, Twitter @Tojiro
Original slide: https://docs.google.com/presentation/d/1Q-RCJrZhw9nlZ5py7QxUVgKSyq61awHr2TyIjXxBmI0/edit#slide=id.p
More additional reading: https://toji.github.io/webgpu-best-practices/
A great native WebGPU tutorial (in English): https://alain.xyz/blog/raw-webgpu
For texture contrast details: https://toji.github.io/webgpu-best-practices/img-textures.html
For details on buffered uploads: https://toji.github.io/webgpu-best-practices/buffer-uploads.html

Several Best Practices for WebGPU

1 Use the label attribute wherever it can be used

2 Using the debug group

3 Load texture image from blob

It is more recommended to use texture resources in compressed format

4 Using the glTF processing library gltf-transform

5 Buffered data upload

6 It is recommended to create a pipeline asynchronously

7. Use implicit pipeline layout with caution

7 Shared resource binding groups and binding group layout objects

Original accompanying information

岭南灯火

引用和评论

CesiumJS 往期博客 - 稳健地在 WebGL 中渲染 Polyline

PyTorch CUDA内存管理优化：深度理解GPU资源分配与缓存机制

GPUDirect RDMA 的演进与实现

算力租赁：人工智能时代的“水电煤”革命——以NVIDIA 4090为例解读下一代算力解决方案

马斯克发布新一代大模型Grok 3：算力支撑下的 AI 跃进

计算加速技术比较分析：GPU、FPGA、ASIC、TPU与NPU的技术特性、应用场景及产业生态

深度解析：通过 AIBrix 多节点部署 DeepSeek-R1 671B 模型