1
Original translation: WebGPU - focus on processing cores (GPU Cores), not drawing canvas (Canvas)

Originally published on March 8, 2022, Portal https://surma.dev/things/webgpu

This article is very long, and there are 1w words regardless of the code characters. It can better understand the concepts in the computing pipeline of WebGPU, and use a simple 2D physics simulation program to understand it. The focus of this article is on the computing pipeline and computing shaders. , the drawing part is done using Canvas2D.


WebGPU is an upcoming WebAPI that you can use to access the Graphics Processing Unit (GPU), which is a low-level interface.

The original author has little experience with graphics programming, he learned WebGL by studying OpenGL tutorials on building game engines, and he also studied shaders by studying the example of Inigo Quilez on ShaderToy. So he can create effects like background animations in PROXX, but he says he's not too happy with WebGL. Don't worry, it will be explained shortly.

When the author started paying attention to WebGPU, most people told him that WebGPU was a lot more boxy than WebGL. He didn't consider these, he had already foreseen the worst case, he tried to find some tutorials and specification documents to see, although there were not many at that time, because when he looked for WebGPU, it was still in the early stage of development. However, after he went deeper, he found that WebGPU did not have more so-called "boxes" than WebGL, but was as familiar as seeing an old friend.

So, this article is here to share what I have learned.

The author clearly pointed out that he will not introduce how to use to draw graphics here, but to introduce how calls GPU for its most primitive calculation (Translator's Note: That is, general computing).

He feels that there are already a lot of information on how to draw with WebGPU, such as austin's example , maybe he will write some drawing articles after he considers it.

He will discuss it in more depth here, and hope that readers can use WebGPU correctly and effectively, but he does not guarantee that you will become a GPU performance expert after reading it.

After the chatter is over, get ready to leave.

1. WebGL

WebGL was released in 2011. So far, it is the only low-level API that can access the GPU on the web. In fact, it is a simple wrapper version of OpenGL ES 2.0 for use on the web. Both WebGL and OpenGL are standardized by the Konas Group. This working group is the W3C of the graphics world, which can be understood as such.

OpenGL itself is a rather historic API, not a very good API by today's standards, centered around an internal global state object. This design minimizes the amount of IO data for a specific call to the GPU. However, this design has a lot of additional overhead costs.

internalstate.27cc7a6d.png

Above: Visualization of WebGL internal global state objects, from WebGL Fundamentals

Internal state objects, to put it bluntly, are mostly pointers. Calling the OpenGL API changes what these pointers point to, so the order in which the states are changed is important, which makes abstracting and writing libraries much more difficult. You must know very clearly what state you need to prepare for the API call you are about to make, and you have to restore the previous value after the call.

He says that he often sees a black canvas (because WebGL reports errors most of the time) and frantically finds which APIs are not being called without setting the global state correctly.

He admits that he doesn't know how ThreeJS implements the state management architecture, but it does a good job, so most people will use ThreeJS instead of native WebGL, which is the main reason.

"Can't agree with WebGL very well" This is only for the original author himself, not for the readers. He said that people smarter than him have done a lot of nice things with WebGL and OpenGL, but he has never been satisfied.

With the advent of machine learning, neural networks, and cryptocurrencies, GPUs proved that they could do more than just draw triangles. Using the GPU for arbitrary data computation is called GPGPU, but WebGL 1.0 is not about that. If you wanted to do this in WebGL 1.0, you had to encode the data into a texture, then decode the texture in the shader, compute it, and re-encode it into a texture. WebGL 2.0 made this a little easier with [transfer feedback](), but Safari won't support WebGL 2.0 until September 2021 (most browsers did in January 2017), so WebGL 2.0 doesn't It's a good choice.

Still, WebGL 2.0 doesn't change the essence of WebGL, which is global state.

2. WebGPU

Outside of the web, new graphics APIs are taking shape. They expose a set of lower-level interfaces for accessing the graphics card to the outside world. These new APIs improve the confines of OpenGL.

Mainly refers to DirectX 12, Vulkan, Metal

On the one hand, GPUs are everywhere now, and even mobile devices have decent GPUs. So, modern graphics programming (3D rendering, ray tracing) and GPGPU will become more and more common.

On the other hand, most devices have multi-core processors, and how to optimize multi-thread interaction with GPU is an important topic.

The WebGPU standard setters have taken note of these status quo, and verification work must be done before preloading the GPU, so as to give WebGPU developers more energy to focus on squeezing the performance of the GPU.

The next generation of the most popular GPU APIs are:

  • Vulkan of the Konas Group
  • Apple's Metal
  • Microsoft DirectX 12

To bring these technologies together and bring them to the Web, WebGPU was born.

WebGL is a shallow wrapper around OpenGL, but WebGPU doesn't. Instead of inheriting from these lower-level APIs, it introduces its own system of abstract concepts, drawing on the advantages of the above GPU APIs.

The reason is very simple, these three APIs are not all common to all platforms, and there are some very low-level concepts of their own, which are not so reasonable for the field of the Web.

On the contrary, the design of WebGPU makes people feel "wow, this is designed for the web", but it is really based on the GPU API of your current machine, the abstract concept is standardized by the W3C, and all browsers have to implement it. Since WebGPU is relatively low-level, its learning curve will be steep, but the author said that it will be broken down as much as possible.

2.1. Adapter and Device

The first abstract concepts of WebGPU that I came into contact with were adapters and devices.

image-20220311111520333.png

Above: The abstraction layer, from the physical GPU to the logical device.

The physical device is the GPU itself, with two built-in GPUs (core graphics cards) and external GPUs (discrete graphics cards). Usually, a device usually has only one GPU, but there are also two or more cases. Microsoft's Surface Laptop, for example, has dual graphics cards so that the operating system can switch between situations.

The operating system uses the drivers provided by the graphics card manufacturer to access the GPU; in turn, the operating system can also use specific APIs (such as Vulkan or Metal) to expose the functions of the GPU externally.

The GPU is a shared resource, it is not only called by various programs, but also responsible for outputting to the display. This looks like something is needed to have multiple processes using the GPU at the same time, so that each process draws its own stuff on the screen.

For each process, it seems that they have the sole control over the GPU, but that is only the appearance, in fact, these complex logics are scheduled by the driver and the operating system.

Adapter (Adapter) is an intermediary between the OS-specific API and WebGPU.

However, since the browser is again a "mini operating system" that can run multiple web programs, a shared adapter is still required at the browser level so that each web program feels like it is uniquely controlling the GPU, so each The Web program gets another abstract concept: Logical Device (Logical Device) .

To access the adapter object, please call navigator.gpu.requestAdapter() . At the time of writing, this method has fewer parameters, allowing you to choose whether to request a high-performance adapter (usually a high-performance standalone display) or a low-power adapter (usually a core display). ).

Translator's Note: This article discusses the code of WebGPU, without special indication, it is the WebGPU JavaScript API on the browser side.

Soft rendering: Some operating systems (such as niche Linux) may not have a GPU or have insufficient GPU capabilities, and will provide a "Fallback Adapter". In fact, this adapter is simulated by pure software, and it may not be very fast. It may be simulated by the CPU, but it can basically meet the system operation.

If you can request a non-null adapter object, then you can continue to asynchronously call adapter.requestDevice() to request the logical device object. Here is sample code:

if (!navigator.gpu) throw Error("WebGPU not supported.");

const adapter = await navigator.gpu.requestAdapter();
if (!adapter) throw Error("Couldn’t request WebGPU adapter.");

const device = await adapter.requestDevice();
if (!device) throw Error("Couldn’t request WebGPU logical device.");

If there are no parameters for the requested device, then requestDevice() will return a device that does not match any device capability requirements, ie a device object that the WebGPU team considers reasonable and common to all GPUs.

See specification for "restrictions" in the process of requesting device objects.

For example, even if my GPU can easily handle 4GB of data, the returned device object only allows a maximum of 1GB of data, and no matter how much you request, it will only return a maximum of 1GB, so even if you switch to another machine to run the code , there won't be too many problems.

You can visit adapter.limits to see the actual limitations of the physical GPU. It is also possible to pass higher limit parameters that you need to verify when requesting the device object.

2.2. Shaders

If you have used WebGL, then you should be familiar with vertex shaders and fragment (fragment) shaders. In fact, it is not too complicated. The conventional technical route is to upload the triangle buffer data to the GPU and tell the GPU how the buffer data constitutes a triangle. Each vertex data of the vertex buffer describes the position of the vertex, and of course other auxiliary content such as color, texture coordinates, normals, etc. are included. Each vertex is processed by a vertex shader to perform operations such as translation, rotation, perspective deformation, etc.

What confuses the original author is the word "shader" because it does something other than coloring. But from a long time ago (late 1980s), the term was very appropriate, and its function on the GPU was to calculate the color value of a pixel. Today, it generally refers to any program that runs on the GPU.

The GPU rasterizes the triangles and calculates how many pixels each triangle occupies on the screen. Each pixel is processed by the fragment shader, which can obtain the pixel coordinates, and of course some auxiliary data can be added to determine the final coloring of the pixel. When used correctly, you can draw stunning 3D effects.

The process of passing the buffer data to the vertex shader, then continuing to the fragment shader, and finally outputting to the screen, can be simply called a pipeline (or pipeline, Pipeline). In WebGPU, the Pipeline must be clearly defined.

2.3. Pipeline

Currently, WebGPU supports two pipelines:

  • render pipeline
  • Compute pipeline

As the name suggests, the rendering pipeline draws something, and the result is a 2D image that doesn't have to be drawn to the screen, but can be rendered directly into memory (called the framebuffer). The compute pipeline is more general, it returns a buffered data object, which means that arbitrary data can be output.

The rest of this article will focus on the introduction of the compute pipeline, because the author considers the rendering pipeline to be a special case of the compute pipeline.

Now we start to reverse the history. The calculation pipeline is actually the "foundation" made first to create the rendering pipeline. These so-called pipelines are actually different physical circuits in the GPU.

Based on the above understanding, if more types of pipelines, such as "ray tracing pipelines", are added to WebGPU in the future, it will be a matter of course.

Using the WebGPU API, the pipeline consists of one or more programmable stages, each defined by a shader module and an entry function. The compute pipeline has a compute shader stage, and the render pipeline has a vertex shader stage and a fragment shader stage, as shown below is a compute shader module and compute pipeline:

const module = device.createShaderModule({
  code: `
    @stage(compute) @workgroup_size(64)
    fn main() {
      // ...
    }
  `,
})

const pipeline = device.createComputePipeline({
  compute: {
    module,
    entryPoint: "main",
  },
})

This is the first appearance of WebGPU's Shading Language (WGSL, pronounced /wig-sal/ ).

swizzling 's initial impression to the author is Rust + GLSL , which has a lot of Rust-like syntax 2 , and also has GLSL-like global functions (such as dot() , norm() , len() , etc.), and types ( vec2 , mat4x4 , etc. some_vec.xxy ).

The browser will compile the WGSL source code into the shader object program of the underlying system, which may be D3D12 of HLSL , Metal of MSL , or Vulkan of SPIR-V .

SPIR-V: It is an open source, binary intermediate format standardized by the Konas Group. You can think of it as LLVM in parallel programming languages, which supports multiple languages compiling into itself and translating itself into other languages.

In the above shader code, only one main function is created and marked as the entry function of the compute shader stage using the @stage(compute) attribute (Attribute, WGSL term).

You can mark multiple @stage(compute) in the shader code, so that you can reuse a shader module object in multiple pipelines, just pass different entryPoint to select different entry functions.

But what is the @workgroup_size(64) feature?

2.4. Parallelism

GPUs optimize data throughput at the expense of latency. If you want to go deeper into this point, you must look at the architecture of the GPU, but the author is not confident about this, so I suggest taking a look at the article of Fabian Giesen .

As we all know, the GPU has a very large number of cores and can perform large-scale parallel operations. However, these cores do not operate relatively independently like CPU parallel programming. First of all, GPU processing cores are grouped in layers, and the design architecture and API of GPUs from different manufacturers are not consistent. Intel has a nice document here with a high-level description of their architecture.

In Intel's technology, the smallest unit is called an "Execution Unit (EU)", and each EU has seven SIMT cores - meaning, it has seven "Lock-step" (Lock-step) cores. ) to run parallel computing cores of the same instruction. Each core has its own registers and pointers to the scheduling cache, and while performing the same operations, the data can be different.

So sometimes it is not recommended to execute the if/else judgment branch on the GPU, because of the EU. Because when EU encounters branch logic, each core has to perform if/else judgment, which loses the advantage of parallel computing.

The same is true for loops. If a core finishes computing ahead of time, it has to pretend to be running, waiting for other cores in the EU to finish computing.

Although the cores are computationally expensive, it takes significantly longer to load data from memory or sample pixels from textures — at least a few hundred clock cycles, says Comrade Fabian. These hours obviously count. In order to take full advantage of these clock cycles, each EU must carry a load.

When the EU is idle, for example, when waiting for the food in the memory to come, it will not be idle all the time, it will immediately invest in the next calculation, and only when the next calculation enters the wait again, it will switch back. The process is very, very short.

GPUs trade throughput optimization at the expense of such technology. The GPU keeps the EU busy all the time through the switching mechanism that schedules these tasks.

intel.078dbef9.jpeg

Above: Intel Iris Xe graphics chip architecture. It is divided into 8 sub-blocks, each with 8 EUs; each EU has 7 SIMT cores.

However, according to the above figure, EU is only the lowest level of Intel graphics card design architecture. Multiple EUs are divided into so-called "SubSlices" by Intel, and all EUs in the sub-blocks can access the shared local cache ( Shared Local Memory, SLM), about 64KB, if the running program has synchronous instructions, it must run in the same sub-block, because then the memory can be shared.

Further up, the sub-blocks make up the slices, which make up the GPU; for GPUs integrated in the CPU, there are about 170 to 700 cores. For discrete graphics, there are 1500 or more cores.

Other manufacturers may use other terms, but the architecture can basically be understood in this way.

In order to take full advantage of the GPU's architectural advantages, special program calls need to be written, so that the GPU's performance can be squeezed to the maximum. Therefore, the graphics API has to expose a similar threading model to invoke computing tasks.

In the WebGPU API, this threading model is called "Workgroup".

2.5. Workgroup

Each vertex is processed once by the vertex shader, and each fragment is processed once by the fragment shader (of course, this is a simple statement, ignoring a lot of details).

In GPGPU, concepts similar to vertices and fragments need to be defined by developers themselves. This concept is called calculation item , and the calculation item will be processed by the calculation shader.

A set of computational items constitutes a "workgroup", which the authors call a "workload". Each compute item in a workgroup is acted upon by a compute shader running concurrently. In WebGPU, a work group can be imagined as a three-dimensional grid. The smallest level is the calculation item, the calculation item constitutes a slightly larger level is the work group, and the higher level constitutes a larger workload.

workgroups.d747bc94.jpeg

Above: This is a workload, where the red cube is made up of 4³ white cubes, the white cubes are computed items, and the red cubes are made up of these 64 white cubes, the workgroup.

Based on the above concepts, we can discuss the @workgroup_size(x, y, z) feature in WGSL. Its function is very simple, which is to tell the GPU how big the working group of the compute shader is. Using the above picture, it is actually the size of the small red cube. x*y*z is the number of calculation items for each workgroup. If a dimension value is not set, it defaults to 1. Therefore, @workgroup_size(64) is equivalent to @workgroup_size(64, 1, 1) .

Of course, the architecture of the actual EU will of course not be a cell within this 3D grid. The purpose of using this diagram to describe the calculation item is to highlight a local nature, that is, it is assumed that adjacent workgroups have a high probability of accessing similar areas in the cache, so when running adjacent workgroups (small red cubes) in sequence, The probability of hitting the existing data in the cache will be higher, and there is no need to run to the video memory for data, which saves a lot of time cycles.

However, most hardware still executes workgroups sequentially, so two different shaders setting @workgroup_size(64) and @workgroup_size(8, 8) are actually not very different. Therefore, this design is slightly redundant.

The workgroup is not infinite dimension, it is constrained by the constraints of the device object, printing device.limits can get related information:

console.log(device.limits)

/*
{
  // ...
  maxComputeInvocationsPerWorkgroup: 256,
  maxComputeWorkgroupSizeX: 256,
  maxComputeWorkgroupSizeY: 256,
  maxComputeWorkgroupSizeZ: 64,
  maxComputeWorkgroupsPerDimension: 65535,
  // ...
}
*/

As you can see, each dimension has a maximum limit, and the cumulative product also has a maximum limit.

Tip: Avoid requesting the maximum number of threads per dimension. Although the GPU is scheduled by the bottom layer of the operating system, if your WebGPU program occupies the GPU for too long, the system may freeze.

So, what is the appropriate workgroup size recommendation? This requires specific analysis of specific issues, depending on what the various dimensions of the working group refer to. The author thinks this answer is ambiguous, so he quotes Corentin : "Use 64 as the workgroup size (multiplying the dimensions) unless you know exactly what you need to call the GPU for."

64 seems to be a decent number of threads, it works fine on most GPUs, and it keeps the EU as full as possible.

2.6. Command

So far, the shader has been written and the pipeline has been set up, all that's left is to call the GPU to execute. Since the GPU can be a discrete graphics card with its own memory, it can be controlled through what is called an "instruction buffer" or "instruction queue".

The instruction queue is a piece of memory (display memory) that encodes the instructions to be executed by the GPU. The encoding is closely tied to the GPU itself and is created by the graphics card driver. WebGPU exposes a "CommandEncoder" API to interface with this term.

const commandEncoder = device.createCommandEncoder()
const passEncoder = commandEncoder.beginComputePass()
passEncoder.setPipeline(pipeline)
passEncoder.dispatch(1)
passEncoder.end()
const commands = commandEncoder.finish()
device.queue.submit([commands])

commandEncoder object has many methods that allow you to copy one piece of video memory to another, or manipulate the video memory corresponding to a texture. It also creates PassEncoder (channel encoder), which configures the pipeline and schedules encoding instructions.

In the above example, the compute pipeline is shown, so a compute channel encoder is created. Call setPipeline() to set up the pipeline, and then call dispatch() method to tell the GPU how many workgroups to create in each dimension for computation.

In other words, the number of calls to a compute shader is equal to the size of each dimension plus the number of calls to that dimension.

For example, a workgroup with three dimensions of size 2, 4, 1 and running 4, 2, 2 times in three dimensions would run the compute shader a total of 2×4 + 4×2 + 1×2 = 18 times.

By the way, the channel encoder is a WebGPU abstraction, and it's a good replacement for the WebGL global state machine the author complained about at the beginning of the article. All data and state required to run the GPU pipeline are passed through the channel encoder.

Abstract: The instruction buffer is just a hook of the graphics card driver or the operating system, which allows programs to call the GPU without interfering with each other and ensure independence from each other. The process of pushing instructions into the instruction queue is actually to save the state of the program so that it can be fetched later when it is used later, because the hardware execution speed is very fast, it seems that each does its own thing without being interfered by other programs.

Run the code, because the workgroup_size feature explicitly specifies 64 work groups, and it is called once in this dimension, so 64 threads are finally generated, although this pipeline does nothing (because no code is written), But at least it works, isn't it cool?

Then, let's get some data to make it work.

3. Data exchange

As mentioned at the beginning of the article, the author does not intend to use WebGPU for graphics rendering directly, but intends to use it for physical simulation and use Canvas2D for simple visualization. Although it's called a physics simulation, it's actually generating a bunch of circular geometry, making them move randomly in a plane and simulating the process of colliding with each other.

To do this, pass some simulation parameters and initial state to the GPU, then run the computation pipeline, and finally read the results.

This is arguably the scariest part of WebGPU because there are a bunch of data terms and operations to learn. However, the author believes that it is precisely these data concepts and data behavior patterns that make WebGPU a high-performance and device-independent API.

3.1. Binding group layout (GPUBindGroupLayout)

In order to exchange data with the GPU, a layout object called a binding group (type GPUBindGroupLayout ) is required to extend the definition of the pipeline.

First of all, let's talk about the binding group (the type is GPUBindGroup ), which is the geometry of each resource when a certain pipeline is executed on the GPU. The resources are Buffer, Texture, and Sampler.

The binding group layout object defined before the binding group records metadata such as the data type and purpose of these resources, so that the GPU can know in advance "Oh, this is the case, tell me in advance that I can run faster" .

The following creates a binding group layout, setting only a buffer resource of storage type ( type: "storage" ) for simplicity:

const bindGroupLayout = device.createBindGroupLayout({
  entries: [{
    binding: 1,
    visibility: GPUShaderStage.COMPUTE,
    buffer: {
      type: "storage",
    }
  }]
})

// 紧接着,传递给管线
const pipeline = device.createComputePipeline({
  layout: device.createPipelineLayout({
    bindGroupLayouts: [bindGroupLayout]
  }),
  compute: {
    module,
    entryPoint: 'main'
  }
})

binding is set to 1 here, which can be set freely (of course in order). Its function is to bind the buffer variable with the same binding value in the WGSL code.

@group(0) @binding(1)
var<storage, write> output: array<f32>;

type field is "storage" , which means that the type of this Buffer is storage. It can also be set to other options. Among them, "read-only-storage" is "read-only storage type", that is, the shader can only read, but cannot write the Buffer. The read-only buffer can optimize some read-write synchronization problems; and "uniform" indicates that the Buffer The type is Uniform, and the function is similar to the storage type (the value is the same in the shader).

At this point, the binding group layout object is created, and then the binding group can be created, which will not be written here; once the corresponding binding group and storage buffer are created, the GPU can start reading data.

Before that, however, there is one more issue to discuss: the scratch buffer.

3.2. Staging Buffer

This section is a bit long, please read it with patience.

The author emphasizes again: GPU highly optimizes data IO performance at the cost of latency. GPUs need to feed data to the kernel fairly quickly. In Fabian's 2011 blog he did some calculations , and concluded that the GPU needs to maintain a speed of 3.3 GB/s to run a 1280x720 resolution texture sampling calculation.

To meet today's graphics demands, GPUs need to be faster. This can only be achieved if the GPU's core and buffer memory are highly integrated, which means that it is difficult to hand over these memory areas to the CPU for reading and writing.

We all know that GPU has its own memory, called video memory, which can be accessed by both CPU and GPU. It is not highly integrated with GPU. Generally, it is next to the circuit board, and its speed is not so fast.

Staging buffers (Staging buffers) is a cache between the video memory and the GPU, which can be mapped to the CPU side for reading and writing. In order to read the data in the GPU, the data must be copied from the cache in the GPU to the scratch buffer, and then the scratch buffer is mapped to the CPU, so that it can be read back to the main memory. The process of passing data to the GPU is similar.

Back in the code, create a writable Buffer and add it to the binding group so that the compute shader can write to it; also create a Buffer of the same size as a scratchpad. When creating these Buffers, use a bitmask to inform their purpose ( usage ), the GPU will apply and create these buffers according to the parameters, and if it does not conform to the WebGPU rules, an error will be thrown:

const BUFFER_SIZE = 1000
const output = device.createBuffer({
  size: BUFFER_SIZE,
  usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC
})
const stagingBuffer = device.createBuffer({
  size: BUFFER_SIZE,
  usage: GPUBufferUsage.MAP_READ | GPUBufferUsage.COPY_DST
})

const bindGroup = device.createBindGroup({
  layout: bindGroupLayout,
  entries: [{
    binding: 1,
    resource: {
      buffer: output
    }
  }]
})

Note that createBuffer() returns the GPUBuffer object, not ArrayBuffer . After the Buffer is created, it cannot be written or read immediately. In order to read and write Buffer, separate API calls are required, and Buffer must have the purpose of GPUBufferUsage.MAP_READ or GPUBufferUsage.MAP_WRITE to read or write.

TypeScript hints: If you want to get TypeScript type hints when the WebGPU API has not been added to each development environment, you also need to install the @webgpu/types package maintained by the Chrome WebGPU team into your project.

So far, there is not only the layout object of the binding group, but also the binding group itself. Now we need to modify the code of the channel encoder part to use this binding group, and then read back the calculated data in the Buffer. JavaScript:

const commandEncoder = device.createCommandEncoder();
const passEncoder = commandEncoder.beginComputePass();
passEncoder.setPipeline(pipeline)
passEncoder.setBindGroup(0, bindGroup)
passEncoder.dispatch(1)
passEncoder.dispatch(Math.ceil(BUFFER_SIZE / 64))
passEncoder.end()
commandEncoder.copyBufferToBuffer(
  output,
  0, // 从哪里开始读取
  stagingBuffer,
  0, // 从哪里开始写
  BUFFER_SIZE
)
const commands = commandEncoder.finish()
device.queue.submit([commands])

await stagingBuffer.mapAsync(
  GPUMapMode.READ,
  0, // 从哪里开始读,偏移量
  BUFFER_SIZE // 读多长
 )
const copyArrayBuffer = stagingBuffer.getMappedRange(0, BUFFER_SIZE)
const data = copyArrayBuffer.slice()
stagingBuffer.unmap()
console.log(new Float32Array(data))

In the previous code, the pipeline object added the local object of the binding group through the pipeline layout, so if the binding group is not set during the channel encoding, the dispatch will fail.

After calculating the channel end() , the instruction encoder immediately triggers a buffer copy method call to copy the data from the output buffer to the stagingBuffer buffer, and finally submits the instruction encoded instruction buffer to the queue.

The GPU will execute along the queue and there is no way to predict when the computation will finish. However, the mapping request buffered by stagingBuffer can be submitted asynchronously; when mapAsync is resolved, the mapping of stagingBuffer is successful, but JavaScript has not yet read it. At this time, call stagingBuffer.getMappedRange() method again to obtain the corresponding required data block and return a ArrayBuffer For JavaScript, the returned buffer array object is the mapping of video memory, which means that if the state of stagingBuffer is unmapped, the returned ArrayBuffer will also be gone, so use the slice() method to make a copy.

Obviously, you can see the output in the console:

emptybuffer.e9fb85f6.jpeg

Above: It's okay, but it shows a problem, that is, the pile of 0's are taken down from the GPU video memory

Perhaps, data outside of manufacturing point 0 would be more convincing. Before doing advanced calculations, put some artificial data into the Buffer to prove that the calculation pipeline is indeed running as expected:

@group(0) @binding(1)
var<storage, write> output: array<f32>;

@stage(compute) @workgroup_size(64)
fn main(

  @builtin(global_invocation_id)
  global_id : vec3<u32>,

  @builtin(local_invocation_id)
  local_id : vec3<u32>,

) {
  output[global_id.x] =
    f32(global_id.x) * 1000. + f32(local_id.x);
}

The first two lines declare a module-scoped variable named output , which is an array of element type f32. Two of its properties declare the source, @group(0) means get the 1 -th binding resource from the first (index 0) binding group. output array is of dynamic length and will automatically reflect the length of the corresponding Buffer.

WGSL variables: Unlike Rust, variables declared by let are immutable, if you want the variable to be mutable, use var declaration

Next look at the main function. Its function signature has two parameters global_id and local_id , of course, the names of these two variables are set by you, and their values depend on the corresponding built-in variables global_invocation_id , local_invocation_id , which refer to this shader call in the workload respectively The global x/y/z coordinates when this shader is called, and the local x/y/z coordinates when this shader is called in working group .

coordinates.beac9b11.jpeg

Above: Three calculated items, a, b, c, are marked with green letters.

The workgroup size used in this figure is @workgroup_size(4, 4, 4) , using the order of the axes in the figure, then for the a, b, and c calculated items in the figure:

  • a:local_id = (x=0, y=0, z=0)global_id = (x=0, y=0, z=0)
  • b:local_id = (x=0, y=0, z=0)global_id = (x=4, y=0, z=0)
  • c:local_id = (x=1, y=1, z=0)global_id = (x=5, y=5, z=0)

For our example, the workgroup size is set to @workgroup_size(64, 1, 1) , so the value range for local_id.x is 0 ~ 63 . In order to be able to check local_id and global_id , the author encodes these two values into a single number; note that the WGSL type is strict, local_id and global_id are both vec3<u32> , so explicitly cast to f32 type to write to the output buffer.

fullbuffer.00fbc049.jpeg

Above: The actual value written by the GPU, note that local_id is 63 as the end point of the loop, while global_id is still continuing to encode

The image above proves that the compute shader does output values to the buffer, but it's easy to see that the numbers appear to be out of order, as this is intentionally left to the GPU.

3.3. Overscheduling

You may notice that calculating the value of the channel encoder's scheduling method scheduling times Math.ceil(BUFFER_SIZE / 64) * 64 gives 1024 :

passEncoder.dispatch(Math.ceil(BUFFER_SIZE / 64))

This directly causes the value of global_id.x in the shader code to be 1024, which is greater than the length of Buffer 1000.

Fortunately, WGSL has a mechanism to protect beyond the range of the array index, that is, once an out-of-bounds write to the array index occurs, the last element will always be written. Although this can avoid memory access errors, it is still possible to generate some invalid data. For example, you print the last 3 elements of the Float32Array returned by the JavaScript side, which are 247055 , 248056 , 608032 ; how to avoid invalid data problems that may occur due to out-of-bounds array indexes? You can return early with a guard statement:

fn main( /* ... */ ) {
  if (global_id.x >= arrayLength(&output)) {
    return;
  }
  
  output[global_id.x] = f32(global_id.x) * 100. + f32(local_id.x)
}

If the reader is interested, you can run this example to see the effect.

3.4. Troubled structures (memory address alignment issues)

Remember Target? It's to move some circles in a 2D Canvas and make them collide passionately.

So, each circle must have a radius parameter and a coordinate parameter, as well as a velocity vector. You can continue to use array<f32> to represent the above data, for example, the first number is the x coordinate, the second number is the y coordinate, and so on.

However, this may seem silly, WGSL allows custom structures to associate multiple pieces of data within a single structure.

Note: If you know what memory alignment is, you can skip this section; if you don't, the author doesn't intend to explain it carefully, he intends to directly show why it is done.

Therefore, define a struct Ball to represent a circle in 2D, and use array<Ball> represent a series of 2D spheres.

With structures, you have to discuss memory alignment issues.

struct Ball {
  radius: f32;
  position: vec2<f32>;
  velocity: vec2<f32>;
}

@group(0) @binding(1)
var<storage, write> output: array<Ball>;

@stage(compute) @workgroup_size(64)
fn main(
  @builtin(global_invocation_id) global_id: vec3<u32>,
  @builtin(local_invocation_id) local_id: vec3<u32>,
) {
  let num_balls = arrayLength(&output);
  if (global_id.x >= num_balls) {
    return;
  }
  
  output[global_id.x].radius = 999.;
  output[global_id.x].position = vec2<f32>(global_id.xy);
  output[global_id.x].velocity = vec2<f32>(local_id.xy);
}

You can run this code , open the console and you can see:

alignment.2543c155.jpeg

Above: Because of memory alignment, this TypedArray has obvious data filling phenomenon

The shader code first writes the data 999.0 into the first field of the radius , in order to observe the separation boundary between the two structures; however, in this printed Float32Array , the two 999 numbers actually span 6 For example, the 0~5 digits in the above figure are 999, 0, 0, 0, 0, 0 , and the 6~11 digits following it are 999, 0, 1, 0, 1, 0 , which means that each structure occupies 6 digits, but the Ball structure obviously only needs 5 numbers can be stored: radius , position.x , position.y , velocity.x and velocity.y . Obviously, each radius has an extra 0 behind it, why?

The reason is memory alignment. Every data type in WGSL strictly enforces the alignment requirement .

If the alignment size of a data data type is N (bytes), it means that data values of this type can only be stored in memory addresses that are multiples of N. For example, f32 has an alignment scale of 4 (i.e. N = 4), and vec2<f32> has an alignment scale of 8 (i.e. N = 8).

Assuming that the memory address of the Ball structure starts from 0, then the storage address of radius can be 0, because 0 is a multiple of 4; then, the next field position is of type vec2<f32> , and the alignment scale is 8, and the problem arises— — The free address of its previous field radius is the 4th byte, which is not a multiple of position alignment scale 8. For alignment, the compiler adds 4 bytes after radius , that is, it is recorded from the 8th byte. position field value. This also explains why the number after 999 is always 0 in the console.

Now that you know how the struct distributes the bytes of data in memory, it's time to move on to the next step in JavaScript.

3.5. Input and output

We've read the data from the GPU, and now we need to decode it in JavaScript, which is to generate the initial state of all 2D circles, and submit it again to the GPU to run the compute shader and make it "animate". Initialization is simple:

let inputBalls = new Float32Array(new ArrayBuffer(BUFFER_SIZE))
for (let i = 0; i < NUM_BALLS; i++) {
  inputBalls[i * 6 + 0] = randomBetween(2, 10) // 半径
  inputBalls[i * 6 + 1] = 0 // 填充用
  inputBalls[i * 6 + 2] = randomBetween(0, ctx.canvas.width) // x坐标
  inputBalls[i * 6 + 3] = randomBetween(0, ctx.canvas.height) // y坐标
  inputBalls[i * 6 + 4] = randomBetween(-100, 100) // x 方向速度分量
  inputBalls[i * 6 + 5] = randomBetween(-100, 100) // y 方向速度分量
}
Tip: If your future programs use more complex data structures, it will be very cumbersome to piece together these bytecodes using JavaScript, you can use Google's buffer-backed-object library to create complex binary data (similar to serialization ).

Remember how to pass the Buffer to the shader? If you don't remember, go back and read the above. You only need to adjust the binding group layout of the computing pipeline to receive new Buffers:

const bindGroupLayout = device.createBindGroupLayout({
  entries: [
    {
      binding: 0,
      visibility: GPUShaderStage.COMPUTE,
      buffer: {
        type: 'read-only-storage'
      }
    },
    {
      binding: 1,
      visibility: GPUShaderStage.COMPUTE,
      buffer: {
        type: 'storage'
      }
    }
  ]
})

Then create a new binding group to pass the initialized 2D sphere data:

const input = device.createBuffer({
  size: BUFFER_SIZE,
  usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_DST
})

const bindGroup = device.createBindGroup({
  layout: bindGroupLayout,
  entries: [
    {
      binding: 0,
      resource: {
        buffer: input // 输入初始化数据
      }
    },
    {
      binding: 1,
      resource: {
        buffer: output
      }
    }
  ]
})

Just like reading data, from a technical point of view, in order to input the initialized 2D sphere data, a mappable scratch buffer input is created as a container for the shader to read the data.

WebGPU provides a simple API for us to write data into the input buffer:

device.queue.writeBuffer(input, 0, inputBalls)

It's that simple, no instruction encoder is needed - that is, no instruction buffer is needed, writeBuffer() works on the queue.

The device.queue object also provides some convenient APIs for manipulating textures.

Now, use a new variable in the shader code to bind to this new input buffer resource:

// ... Ball 结构体定义 ...

@group(0) @binding(0)
var<storage, read> input: array<Ball>;

// ... output Buffer 的定义

let TIME_STEP: f32 = 0.016;

@stage(compute) @workgroup_size(64)
fn main(
  @builtin(global_invocation_id)
  global_id: vec3<u32>
) {
  let num_balls = arrayLength(&output);
  if (global_id.x >= num_balls) {
    return;
  }
  
  // 更新位置
  output[global_id.x].position = 
    input[global_id.x].position +
    input[global_id.x].velocity * TIME_STEP;
}

Hopefully you can understand most of the shader code.

The last thing to do is to read the output buffer back to JavaScript again, and write some Canvas2D visual code to display the Ball's motion effect (requestAnimationFrame is required). You can see the example effect: demo

4. Performance

The code shown at the end of Section 3.5 just makes the Ball move, and there is no particularly complicated calculation. Before doing performance observations, add some proper physics calculations to the shader.

The author does not intend to explain the physical calculation. The blog has been written here for a long time, but he briefly explained the core principle of the physical effect: each Ball performs collision detection calculations with other Balls.

If you really want to know, you can take a look at the final demo code: final-demo , in the WGSL code you can also find the data link for the physical calculation.

The author did not optimize the physics collision algorithm, nor did he optimize the WebGPU code, and even then it worked fine on his MacBook Air (M1 processor).

When there are more than 2500 Balls, the number of frames drops below 60 frames. However, when using Chrome developer tools to observe performance information, the dropped frames are not a problem of WebGPU, but the insufficient rendering performance of Canvas2D - using WebGL or WebGPU to draw This problem will not occur.

performance.6042822a.jpeg

Above: Even with 14,000 Balls, WebGPU takes only 16 milliseconds of single-frame computation time on an MBA notebook with an M1 processor

The author turned off the Canvas2D drawing and added the performance.measure() method to see how many Ball physics calculations can be simulated within 16 milliseconds.

This performance has not been optimized, and the author has been intoxicated by it.

5. Stability and Availability

WebGPU has been in development for quite some time, and the author believes that the speculators want the API to be stable.

That's true, but the WebGPU API currently only runs on Chrome-like browsers and FireFox browsers, so be optimistic about Safari -- although at the time of writing, there's not much to see in Safari TP (Technology Preview).

In terms of stability performance, even during the period of writing the article, there are changes.

For example, feature syntax for WGSL shader code, changed from square brackets to @ notation:

[[stage(compute), workgroup_size(64)]]
↓
@stage(compute) @workgroup_size(64)

For the channel encoder end method, Firefox browsers are still endPass() , while Chrome-like browsers have changed to the latest end() .

There are also some parts of the specification that are not fully implemented on all browsers, such as the API for mobile devices and some shader constants.

Basically, after WebGPU enters the stable stage, it cannot be ruled out that many major changes will occur.

Summarize

The modern API of "going directly to the GPU on the web" looks like a lot of fun. After the initial steep learning curve, the author believes that it is really possible to use JavaScript to call the GPU for massively parallel computing.

wgpu is a WebGPU implemented in Rust, you can use Rust language outside the browser to call the API of the WebGPU specification; wgpu also supports compilation to WebAssembly, you can even use Rust's wgpu to write wasm, and then put it in the browser to run high performance code.

There is another interesting thing: Deno with wgpu, built-in WebGPU support.

If you have any questions, you can go to WebGPU Matrix channel (maybe not very accessible in China) to ask questions, where there are some WebGPU users, browser engineers and speculators.

Thanks to Brandon Jones for proofreading this article, and thanks to the WebGPU Matrix channel for clarification.

Thanks also to the original author for sharing this long article.

岭南灯火
83 声望60 粉丝

一介草民