PyTorch official blog: PyTorch Profiler v1.9 detailed

在这里插入图片描述
The improvement of Profiler v1.9 is mainly for the execution steps that consume the most energy at runtime and/or memory. Colleagues visualize the workload distribution between GPU and CPU.

Profiler v1.9 adds five main functions including:

1. Distributed training view: This helps you master the time and memory consumed in distributed training tasks. Suppose you have a training model. When you split the load into Worker nodes to run in parallel, various problems may appear like a black box. The overall goal of the model is to increase the training speed. This distributed training view helps you diagnose and debug problems within a single node.

2. Memory view: With this view, you can better understand the memory usage. This tool can display the active memory allocation of the program at different stages of operation, so as to help you avoid Out of Memory errors.

3. GPU application visualization: This tool can ensure that the GPU is fully utilized.

4. Cloud storage support: Tensorboard plug-in can now read analytical data from Azure Blob Storage, Amazon S3 and Google Cloud Platform.

5. Jump to source code: This function supports visualization of stack trace information and can jump directly to the source code. This helps you quickly optimize and iterate the code based on the analysis results.

PyTorch Profiler Colab Portal

Chinese version Colab Portal

Colab content at a glance:

Prepare data and model
Use Profiler to record execution events
Run Profiler
Use TensorBoard to view results and analyze model performance
Improve performance with Profiler
Use other advanced features to analyze performance

Start using PyTorch Profiling tool

first:

$ pip install torch-tb-profiler

import torch.profiler as profiler
With profiler.profile(XXXX)

Remarks: about CUDA and CPU analysis, see Here

with torch.profiler.profile( 
activities=[ 
torch.profiler.ProfilerActivity.CPU, 
torch.profiler.ProfilerActivity.CUDA],

profiler.record_function("$NAME"): Allows adding a decorator (decorator, refers to the label related to the name) to the function block.

The Profile_memory=True parameter under profiler.profile can analyze the memory usage of CPU and GPU.

Visualize PyTorch model performance

### Distributed training

The latest progress in deep learning proves the value of large data sets and large models, which also means that model training requires more computing resources.

Distributed Data Parallel (DDP) and Nvidia Doka Communication Framework (NCCL) are widely adopted paradigms in PyTorch to accelerate deep learning training.

In this version of PyTorch Profiler, DDP at the back end of NCCL is now supported.
在这里插入图片描述

Computing/communication overview

In the "Computation/Communication Overview" in the distributed training view, users can observe the calculation and communication ratios of the "load balancer" nodes among all Workers, which are measured in terms of granularity.

Load balancer related links: Here )

Scenario 1:

If the calculation and overlap time of one Worker is longer than that of other Workers, it may indicate that there is a problem in workload balancing, or that one of the nodes is a straggler. The calculation is the sum of the GPU core time, minus the overlap time. Overlap time refers to the time saved through interleaved communication during the calculation process.

The longer the overlap time, the better the parallelism between computing and communication. ideal conditions, calculation and communication completely overlap each other. Communication is the total communication time minus the overlap time.

The following example shows the performance of this situation on Tensorboard.
在这里插入图片描述
straggler example

Scenario 2:

If the batch size is small (that is, there are fewer calculations on all workers), or the data to be transmitted is large, the calculation communication ratio may also be small. In the Profiler, you can see that the GPU utilization is low and the waiting time is long.

Based on this calculation/communication view review code, users can reduce the communication by using gradient accumulation, or reduce the communication proportion by increasing the batch size. The DDP communication time depends on the model size. The batch size has nothing to do with the model size. Therefore, increasing the batch size can make the calculation time longer and the calculation communication case larger.

### Sync/communication overview

In the synchronization/communication view, users can observe the communication efficiency. This is calculated by subtracting the calculation and communication time from the step time. The synchronization time is part of the total communication time waiting and synchronizing with other workers. The synchronization/communication view includes initialization, data loader, CPU calculation, etc.

From this view, we can know: total communication volume of 1613092a492ddb is really used to exchange data, and what is the idle time waiting for other workers to provide data.

在这里插入图片描述

For example, if there is an inefficient workload balancing or straggler problem, it can be found in the synchronization/communication view. This view will show that some workers wait longer than others.

From the above table, we can know the detailed statistics of all communication operators in each node. Through this table, you can know which operator types are called, how many times each operator is called, what is the size of the data transmitted by each operator, and so on.

### Memory View

Using this tool, you can understand the hardware resource consumption of the operators in the model. Understanding the time and memory consumption at the operator level can help solve performance bottlenecks and speed up the running of the model. In view of the limited GPU memory size, optimizing memory usage efficiency helps:

Allows to run larger-scale models and perform better on terminal-level tasks.
Allows for larger batch sizes and improves training speed.

Profiler records all memory allocations during the profiler interval. Select "Device" to see the memory usage details of each operator on the GPU side or the host side.

Note: Profile_memory=True must be enabled to generate the following memory data.

Related links: Here

With torch.profiler.profile(
Profiler_memory=True # this will take 1 – 2 minutes to complete. 
)

key definitions:

"Size Increase" displays the sum of all allocated bytes, minus all memory released bytes.
"Allocation Size" shows the sum of all allocated bytes excluding memory release.
"Self" means that the allocated memory does not come from any child operator, but is allocated by the operator itself.

### GPU metrics on the timeline

With this feature, you can easily debug performance issues when one or more GPUs are not fully utilized. Ideally, your program should have high GPU utilization (as much as possible to achieve 100% GPU utilization), the communication cost from CPU to GPU is the lowest, and there is no power consumption.

Overview: The overview page highlights the results of three important GPU usage indicators (that is, GPU Utilization, Est. SM Efficiency, and Est. Achieved Occupancy) at different levels.

Essentially, each GPU has many SMs, and each SM has many Warps, which can execute many threads at the same time. Warp executes more threads because the number depends on the GPU. From a higher perspective, the GPU indicators on the timeline can help developers have a global view of the entire stack, which is very important.

If the GPU utilization is very low, it indicates a potential problem with the model. The common reasons are as follows:

Insufficient parallelism in the kernel, that is, the batch size is too small
Call the small kernel in a loop, that is, start overhead without being amortized
CPU or I/O bottlenecks lead to insufficient work content and low GPU utilization

In the overview page, the performance recommendations are some feasible recommendations that can improve GPU utilization. In this example, GPU utilization is very low, so the performance recommendation is to increase the batch size. According to performance recommendations, increasing the batch size from 4 to 32 increases GPU utilization by 60.68%.

GPU utilization: In the Profiler, when the GPU engine executes a workload, there will be a step interval time (step interval time). The higher the utilization percentage, the better. The performance bottleneck is judged only by GPU utilization, and the result is not accurate. You can’t tell how many Streaming Multiprocessors are running.

Note that although this indicator is very helpful for detecting idle periods, a high value does not mean that GPU utilization is high. For example, a single-threaded continuously running kernel will have a GPU utilization rate of 100%.

Estimated Stream Processor Efficiency (Est. SM Efficiency) is a more detailed indicator. represents the percentage of SM in use during the whole tracking process, representing the percentage of time that there is at least one active wrap on the SM, and Those idle warps.

NVIDIA document: Here

Est. SM Efficiency also has limitations. For example, a core with only one thread per block cannot fully utilize all SMs. It is not possible to know the utilization rate of each SM based only on SM Efficiency, only the operations that each SM is doing, including pauses while waiting for the results of memory loading.

In order to maintain the high utilization of SM, a sufficient number of ready wraps must be ensured, and it can run as long as there is a stall.

For performance diagnosis problems, the estimated realization occupancy rate (Est. Achieved Occupancy) is more accurate than Est. SM Efficiency and GPU utilization. The estimated realized occupancy rate indicates how many warps can be active at the same time for each SM. Having a sufficient number of active warps is usually the key to achieving good throughput. Unlike GPU utilization and SM Efficiency, making this value as high as possible is not the ultimate goal.

From an empirical point of view, by increasing this indicator to 15% or above, good throughput gains can be obtained. But at some point, diminishing returns will also be encountered. For example, if the value has reached 30%, the next profit becomes uncertain. This indicator shows the average value of all warp schedulers during the execution of the kernel

NVIDIA document: Here

The larger the value of Est. Achieve Occupancy, the better.

detailed details: Resnet50_batchsize4

detailed details: Resnet50_batchsize32

Kernel view: The kernel has "Blocks per SM" and "Est. Achieved Occupancy".

Est. Achieved Occupancy is a useful tool for comparing the running status of models.

Mean Blocks per SM:

The number of blocks per SM = the number of blocks of the core / the number of SMs of the GPU. If this number is less than 1, it indicates that the GPU multiprocessor is not fully utilized. "Mean Blocks per SM" is the weighted average of all runs of this kernel name, using the length of each run as the weight.

Average Est. Achieved Occupancy (Mean Est. Achieved Occupancy:

The definition of Est. Achieved Occupancy is the same as that outlined above. Mean Est. Achieved Occupancy is the weighted average of all runs of this kernel name, using the duration of each run as the weight.

trace view:

The trace view shows a timeline that represents the duration of the operators in the model and which system performed the operation. This view can help you identify high cost and long execution if it is caused by input or model training. Currently, the trace view can display GPU utilization and Est. SM Efficiency in a timeline.

In the above example, the GPU utilization of "ProfilerStep5" during thread 28022 is higher than that of "Optimizer.step". You can zoom in to see the related reasons.

As can be seen from the above figure, the former has a longer core than the latter. The latter's kernel execution time is too short, resulting in reduced GPU utilization.

Est. SM Efficiency: has a calculated EST. SM Efficiency, which is between 0-100%. For example, the following kernel has only 64 blocks, and the SM of this GPU is 80, then its "Est. SM Efficiency" is 64/80, which is 0.8.

### Cloud storage support

After running pip install tensorboard, in order to read the data through the cloud provider, you can run:

torch-tb-profiler[blob] 
torch-tb-profiler[gs] 
torch-tb-profiler[s3]

With the help of pip install torch-tb-profiler[blob], pip install torch-tb-profiler[gs], or pip install torch-tb-profiler[S3], data can be read through cloud service providers.

For more information, please refer to: Here

### Jump to source code

One of the major benefits of integrating TensorBoard and PyTorch Profiler directly into Visual Studio Code (VS Code) is that you can jump directly from the profiler's stack trace to the source code (files and lines). The VS Code Python extension now supports TensorBoard integration.

Jump to source code is only available when Tensorboard is running in VS Code. If profiling with_stack=True, stack trace will appear on the plug-in UI. Click the stack trace in PyTorch Profiler, VS Code will open the corresponding file and jump directly to the corresponding code for debugging. In this way, the code can be optimized and modified quickly based on the analysis results and suggestions.

Use Visual Studio Code Plug In UI to jump to the source code

For how to optimize batch size performance, please check the detailed tutorial: Here

PyTorch Profiler can also be integrated with PyTorch Lightning, just use trainer.profiler=pytorch to start the lightning training task to generate the trace.

Detailed example: Here

Original address: Here

PyTorch official blog: PyTorch Profiler v1.9 detailed

Start using PyTorch Profiling tool

Visualize PyTorch model performance

Computing/communication overview

超神经HyperAI

引用和评论

【Triton 教程】triton_language.arange

一文掌握 MCP 上下文协议：从理论到实践

LRU算法，你别跑，我就要吃透你

AI Agent爆火后，MCP协议为什么如此重要！

2025年医疗大模型各医疗场景赋能实践研究报告130+份汇总解读|附PDF下载

AdventureX 2025 正式启动：五天四夜，120小时极限创造！一起在杭州点燃青年创新之火！

MCP 协议为何不如你想象的安全？从技术专家视角解读