How to use GPU hardware layer acceleration to optimize the game fluency of Android system

As a VR real-time operation game App, we need to monitor the angle of the mobile phone in real time according to the gravity sensing system, and render the VR image of the corresponding position. Therefore, between different Android devices, due to the use of chipsets and GPUs of different architectures, Game performance will be affected as a result. For example: the game may be rendered at 60fps on the Galaxy S20+, but its performance on the HUAWEI P50 Pro may be very different from the former. Because the new version of the mobile phone has a good configuration, and the game needs to consider the operation based on the underlying hardware.

If players experience a drop in frame rate or slow loading time, they will quickly lose interest in the game.
If the game runs out of battery power or the device overheats, we will also lose gamers on long journeys.
If unnecessary game materials are pre-rendered in advance, it will greatly increase the startup time of the game and cause players to lose patience.
If the frame rate is not compatible with the mobile phone, it will crash due to the mobile phone's self-protection mechanism during operation, resulting in a very poor gaming experience.

Based on this, we need to optimize the code to adapt to the different frame rates of different mobile phones on the market.

Challenges encountered

First, we use Streamline obtain the configuration file of the game running on the Android device. When running the test scenario, we visualize the CPU and GPU performance counter activities to accurately understand the device's processing CPU and GPU workload, so as to locate the main problem of the frame rate drop .

The frame rate analysis chart below shows how the application runs over time.

In the figure below, we can see the correlation between the execution engine cycle and the FPS drop. Obviously the GPU is busy with arithmetic operations, and the shaders may be too complicated.

In order to test the frame rate in different devices, use U-Meng+U-APM test the freezing conditions on different models. It is found that the onSurfaceCreated occurs when rendering in the 0617a1948c55b7 function. The previous analysis should be verified. Determine if the GPU is stuck during arithmetic operations:

Because different devices have different performance expectations, it is necessary to set their own performance budget for each device. For example, if the highest frequency of the GPU in the device is known and the target frame rate is provided, the absolute limit of the GPU cost per frame can be calculated.

Mathematical formula: $ GPU cost per frame = GPU maximum frequency / target frame rate $

There are certain constraints in the scheduling of CPU to GPU. Due to the limitation in scheduling, we cannot reach the target frame rate.

In addition, due to the serialization of the workload on the CPU-GPU interface, the rendering process is performed asynchronously. The CPU puts the new rendering work in the queue, which is later processed by the GPU.

`Data resource issues`

The CPU controls the rendering process and provides the latest data in real time, such as the transformation and light position of each frame. However, GPU processing is asynchronous. This means that data resources will be referenced by queued commands and stay in the command stream for a period of time. The OpenGL ES in the program needs to be rendered to reflect the state of the resources when the draw call is made, so the resources cannot be modified until the GPU workload referencing them is completed.

`debugging process`

We have tried to edit and optimize the code of the referenced resource, but when we try to modify this part of the content, it will trigger the creation of a new copy of this part. This will be able to achieve our goal to a certain extent, but it will generate a lot of CPU overhead.

So we use Streamline identify instances of high CPU load. libGLES_Mali.so path function inside the graphics driver, you can see the extremely high occupancy time in the view.

Since we want to adapt to different frame rates on different mobile phones, we need to find out whether libGLES_Mali.so has a very high occupancy time on different models of devices. Here we use Union+U-APM To detect the proportion of the user's function occupancy on different models.

After + U-APM custom anomaly test, the following models will produce high libGLES_Mali.so , so we need to solve the fluency problem based on the operation of the underlying hardware, and there are more than one model due to the problem , We need to start from the memory level, consider how to call fewer memory buffers and release the memory in time.

`solutions and optimization`

Based on the previous analysis, we first try to optimize from the buffer zone.

single buffer solution • Use glMapBufferRange and GL_MAP_UNSYNCHRONIZED. Then use the sub-regions in a single buffer to build the rotation. This avoids the need for multiple buffers, but this solution still has some problems. We still need to deal with managing sub-area dependencies. This part of the code brings us extra workload.

multi-buffer solution • We try to create multiple buffers in the system and use the buffers in a circular manner. By calculating the number of suitable buffers, the code can reuse these circular buffers in subsequent frames. Since we use a large number of circular buffers, a large number of log records and database writes are very necessary. But there are several factors that can cause poor performance here:

1. Generated additional memory usage and GC pressure 2. The Android operating system actually writes log messages to the log instead of a file, which requires additional time. 3. If there is only one call, then the performance consumption here is minimal. However, due to the use of a circular buffer, multiple calls are needed here. We will enable the memory allocation tracking function in the Mono analyzer based on c# to locate the problem:

$ adb shell setprop debug.mono.profile log:calls,alloc

We can see that the method takes time every time it is called:

Method call summary Total(ms) Self(ms) Calls Method name 782 5 100 MyApp.MainActivity:Log (string,object[]) 775 3 100 Android.Util.Log:Debug (string,string,object[]) 634 10 100 Android.Util.Log:Debug (string,string)

It took a lot of time to locate our log records here, and our next direction may need to improve a single call or seek a brand new solution.

log:alloc also allows us to see memory allocation; log calls directly lead to a large number of unreasonable memory allocations:

Allocation summary Bytes Count Average Type name 41784 839 49 System.String 4280 144 29 System.Object[]

`Hardware acceleration`

Finally, I tried to introduce hardware acceleration and obtained a new drawing model to render the application on the screen. It introduces the DisplayList structure and records the drawing commands of the view to speed up the rendering.

At the same time, you can View to the off-screen buffer and modify it as you like without worrying about being referenced. This function is mainly suitable for animation, very suitable for solving our frame rate problem, and can set up animation for complex views faster.

If there is no layer, after changing the animation properties, the animation view will make it invalid. For complex views, this failure will propagate to all subviews, which in turn will redraw themselves.

After using the view layer supported by the hardware, the GPU creates a texture for the view. So we can animate complex views on our screen and make the animation smoother.

Code example:

// Using the Object animator view.setLayerType(View.LAYER_TYPE_HARDWARE, null); ObjectAnimator objectAnimator = ObjectAnimator.ofFloat(view, View.TRANSLATION_X, 20f); objectAnimator.addListener(new AnimatorListenerAdapter() { @Override public void onAnimationEnd(Animator animation) { view.setLayerType(View.LAYER_TYPE_NONE, null); } }); objectAnimator.start(); // Using the Property animator view.animate().translationX(20f).withLayer().start();

In addition, there are several points that still need to be noted when using the hardware layer:

(1) Clean up after use:

The hardware layer takes up space on the GPU. In the ObjectAnimator code above, the listener will remove the layer at the end of the animation. In the Property animator example, the withLayers() method will automatically create the layer at the beginning and delete it at the end of the animation.

(2) The hardware layer needs to be updated and visualized:

Using developer options, you can enable "Display Hardware Layer Updates". If you change the view after applying the hardware layer, it will invalidate the hardware layer and re-render the view to this off-screen buffer.

`Hardware acceleration optimization`

But this brings about a problem is that in interfaces that do not require fast rendering, such as scroll bars, the hardware layer will render them faster. When ViewPager scrolled to the sides, its page will be highlighted in green throughout the scrolling phase.

So when I scroll ViewPager , I use DDMS run TraceView , Sort method calls by name, search for “android/view/View.setLayerType” , and then track its references:

ViewPager#enableLayers(): private void enableLayers(boolean enable) { final int childCount = getChildCount(); for (int i = 0; i < childCount; i++) { final int layerType = enable ? ViewCompat.LAYER_TYPE_HARDWARE : ViewCompat.LAYER_TYPE_NONE; ViewCompat.setLayerType(getChildAt(i), layerType, null); } }

This method is responsible for enabling/disabling the hardware layer for children ViewPager It is called once ViewPaper#setScrollState()

private void setScrollState(int newState) { if (mScrollState == newState) { return; } mScrollState = newState; if (mPageTransformer != null) { enableLayers(newState != SCROLL_STATE_IDLE); } if (mOnPageChangeListener != null) { mOnPageChangeListener.onPageScrollStateChanged(newState); } }

As shown in the code, the hardware is disabled IDLE , otherwise it is enabled DRAGGING or SETTLING PageTransformer aims to "use animation properties to apply custom transitions to page views" (Source).

Based on our needs, we only enable the hardware layer when rendering the animation, so I want to override the ViewPager method, but since they are private, we cannot modify this method.

So I took another solution: on ViewPage#setScrollState() , after calling enableLayers() , we will also call

OnPageChangeListener#onPageScrollStateChanged()

. So I set up a listener, when ViewPager rolling state is different from IDLE , it will all ViewPager children of layer types reset NONE :

@Override public void onPageScrollStateChanged(int scrollState) { // A small hack to remove the HW layer that the viewpager add to each page when scrolling. if (scrollState != ViewPager.SCROLL_STATE_IDLE) { final int childCount = <your_viewpager>.getChildCount(); for (int i = 0; i < childCount; i++) <your_viewpager>.getChildAt(i).setLayerType(View.LAYER_TYPE_NONE, null); } }

In this way, after ViewPager#setScrollState() set a hardware layer for the page-I reset them to NONE , which will disable the hardware layer, so the resulting frame rate difference is mainly displayed on Nexus .

How to use GPU hardware layer acceleration to optimize the game fluency of Android system

Challenges encountered

`Data resource issues`

`debugging process`

`solutions and optimization`

`Hardware acceleration`

`Hardware acceleration optimization`

六一

`引用和评论`

马斯克被爆当场解雇推特工程师；苹果 2024 年或推出无接口设计 iPhone；GitHub 宣布裁员|思否周刊

NPU 与 GPU 相比，有什么差别？| 技术速览

【Triton 教程】矩阵乘法

在昇腾Ascend 910B上运行Qwen2.5推理

飞腾X100适配OpenEuler说明

部署 LLMs 前如何计算与优化 GPU 内存需求？

GPU 环境搭建指南：如何在裸机、Docker、K8s 等环境中使用 GPU

How to use GPU hardware layer acceleration to optimize the game fluency of Android system

Challenges encountered

Data resource issues

debugging process

solutions and optimization

Hardware acceleration

Hardware acceleration optimization

六一

引用和评论

马斯克被爆当场解雇推特工程师 ；苹果 2024 年或推出无接口设计 iPhone；GitHub 宣布裁员|思否周刊

NPU 与 GPU 相比，有什么差别？| 技术速览

【Triton 教程】矩阵乘法

在昇腾Ascend 910B上运行Qwen2.5推理

飞腾X100适配OpenEuler说明

部署 LLMs 前如何计算与优化 GPU 内存需求？

GPU 环境搭建指南：如何在裸机、Docker、K8s 等环境中使用 GPU

`Data resource issues`

`debugging process`

`solutions and optimization`

`Hardware acceleration`

`Hardware acceleration optimization`

`引用和评论`

马斯克被爆当场解雇推特工程师；苹果 2024 年或推出无接口设计 iPhone；GitHub 宣布裁员|思否周刊