The new scene of audio and video consumption has spawned more and more new technical requirements. From the current live broadcast, on-demand, RTC, to the future XR and Meta universe, the support of audio and video technology for new scenes is becoming more and more comprehensive. In recent years, AI algorithms have developed rapidly, but good algorithm results often require a lot of computing power resources, which makes the commercialization of algorithms face very big challenges. How should we give full play to the ability of software and hardware? How to effectively balance algorithm effect and performance?
At the LiveVideoStackCon2021 Beijing Summit, Yang Fenghai, senior algorithm expert of Alibaba Cloud Intelligent Video Cloud, started from the exploration of the latest scenes of Alibaba Cloud Video Cloud, and brought the best innovative practice experience sharing of Alibaba Cloud Video Cloud in the direction of virtual background and video super-segmentation.
Text | Yang Fenghai
Organize | LiveVideoStack
This sharing is mainly divided into 5 parts, including the introduction, innovation and optimization at the algorithm level, deep optimization at the software and hardware level, future prospects and QA.
1 Introduction
In terms of business form, audio and video include live broadcast, on-demand, RTC, media production and processing, as well as cloud games, cloud desktops, etc. In terms of technical links, it includes acquisition, encoding, transmission, cloud transcoding/processing/rendering, network distribution, receiving end decoding and rendering, and the part involving algorithms includes encoding pre-processing, decoding post-processing, and cloud video processing.
According to the order of computing power, the server has the strongest computing power, and the server is relatively poor, so it can be roughly described as a normal distribution. According to the normal distribution curve, most of the algorithms are now deployed in the cloud, and a few will be deployed on the end side.
Judging from the current situation of audio, video and algorithm deployment, the entire software and hardware system is heterogeneous, basically covering cloud servers and edge servers, various terminal devices and IOT devices.
In terms of hardware, it includes CPU+GPU, CPU+FPGA, CPU+DSP and CPU+ASIC chips. There are also many chip manufacturers involved. The mainstream ones include Intel, AMD, ARM, Nvidia, Qualcomm, Apple, etc., and the operating system coverage is relatively comprehensive. In addition, software standards, development and compilation environments, deep learning training and reasoning frameworks are also very different.
Such a complex heterogeneous software and hardware environment poses unprecedented challenges to the implementation of algorithms. How to give full play to the integration of software and hardware? How to effectively balance algorithm effect and performance? These are two issues that must be resolved.
2. Algorithmic innovation and optimization
Below I will introduce how we balance performance and effect from the algorithm level through two algorithms of virtual background and super division, so as to create value for the business.
Virtual background
Let's first look at the background of the algorithm. I believe that everyone has experienced online video conferences and online classes with babies since the epidemic last year. Of course, friends without babies must have watched live broadcasts, short videos, and followed online blind dates. Scenes, in these scenes, you really hope to put yourself in a virtual environment. On the one hand, you can protect your privacy, and on the other hand, it can effectively increase the fun and immersive experience.
Knowing the background of the algorithm, let's now look at how to implement it in the business?
The first is that the scene may be very complex, with light, noise, scenes (indoor, outdoor, single/multiple, meeting rooms, office areas, venues, homes, etc.), unclear boundaries of foreground and background, hand-held objects, and clothing decorations are very diverse ;
Secondly, there are very few data available for training and learning. The labeling standards of each open source data set are different and the accuracy does not meet commercial requirements. Manual labeling is time-consuming and laborious;
Finally, in terms of computing performance, the cloud requires high concurrency and low latency, and the terminal requires low latency, low power consumption and low resource occupation.
This requires us to design an algorithm that is very robust and can meet the performance requirements of different deployments. We know that there are two categories of pixel-level algorithms for distinguishing portraits: segmentation and matting. Of course, there are some subdivided areas, such as semantic segmentation, instance segmentation, video object segmentation (VOS), panoramic segmentation (Panoptic Segmentation), blue-green screen Cutout, natural scene cutout.
Which one should we choose for landing? First of all, we must know that our landing scenes are currently mainly education/conference/pan-entertainment scenes. After a comprehensive evaluation of effects and performance, we believe that the semantic segmentation of portraits can meet business demands.
After determining the direction of the algorithm, the first thing we have to do is to stand on the shoulders of giants to innovate, and then we must understand the context of the development of algorithms in this field over the past years.
It can be seen that from the initial FCN to the later segnet, Unet series, DeepLab series, HRNet, etc., it basically followed the Encoder-Decoder structure from the perspective of algorithm design and innovation, and then tried to design different backbones and blocks to balance the effect of the algorithm. Performance, then multi-branch design, multi-resolution fusion, Coarse2Fine structure and various Attentions, etc.
From the perspective of publishing papers, many algorithm models will be designed to be deeper (more layers), wider (more parallel branches, more channels, larger featuremaps), denser connections, larger receptive fields, and more Many global information, etc., but from the perspective of business landing, these complex algorithms are difficult to run in real-time on the end-side equipment.
From the road to the simple, our algorithm is born to meet the deployment of different heterogeneous platforms, so we adopted the Unet framework, and then integrated various lightweight Block design ideas, including SqueezeNet, MobileNet, ShuffleNet, EfficienNet, GhostNet, etc.
In addition, make full use of the attention structure of space dimension and channel dimension, fully integrate multi-resolution features, but at the same time ensure that the calculation is not slowed down, differentiated designs for different hardware platforms, and finally combine business scenarios to design specific structures and loss functions And so on, including specific edge loss, online difficult sample mining, etc.
The improvement of the neural network model is inseparable from scenarios and data. Therefore, before designing the algorithm, we must first define the current business scenario, then construct the data set, and train the iterative algorithm through the data. The algorithm in turn is collected through online business practice. Badcase, clean and expand the data set again, and fintune the algorithm model again. Only when the scene, data and algorithm are organically combined and iterated can be perfected.
Due to the limited distribution of the data set itself, data enhancement is essential. The color temperature difference of traditional portrait texture is large, and the synthesis effect is bad. If you add such data to the training, the overall benefit will not be very high, and if the synthesis effect is not good, there may be side effects. Therefore, we adopt the strategy of dynamic white balance and pyramid fusion to ensure that the foreground and background are more realistic.
Due to the high cost of manual data collection, and it is difficult to cover all human figures' action poses, environments and clothing, etc., as shown in the lower left corner, we use 3D animation maps of specific figures, actions, and scenes to expand data; on the right; Respectively, the anti-interference ability of lighting, noise and motion blur is improved. We calculate the consistency loss of the original data and the enhanced data through the network output to improve the robustness of the model.
No matter how well the algorithm is designed, some bad cases will inevitably occur in actual business scenarios. For example, if a person sits at a table, the arm and the body are not a connected entity. If you simply use the previous model, the arm may be mistaken for the background. For this reason, we have developed a multi-task joint learning method. At the same time, it combines portrait segmentation, human body key points, and human body analysis to perform multi-task joint training and learning for the model.
In the final inference, other tasks do not participate in the inference, and are only used to help the segmentation model extract and learn relevant information during the training phase. Doing so can not only improve the effect of the model, but it will not increase the complexity of the model.
In addition, everyone should have experienced the application of virtual background more or less. You can see that no matter which manufacturer, the edge processing is not very good, so we have specially designed special loss constraints for the edge, so that the edge accuracy can be obtained. Significant improvement.
In the final landing, the model needs to be lightweight. Commonly used methods include pruning, compression, distillation, quantification, and NAS. Here, we take distillation as an example to introduce how we make the model lightweight.
Let’s first look at the development process of knowledge distillation. In 2014, Hinton released the pioneering work Distill the knowledge in a neural network. Simply put, it is to train a complex teacher network and a lightweight student network separately, and then use KL divergence to make The output of the students is close to the output of the teacher. Later, in some improved articles, some people also discussed the method of passing the spatial attention in the classification network to the student network.
Distillation performs relatively well on classification tasks, but in pixel-level tasks, whether it is segmentation, matting, or oversegmentation, distillation is very difficult to adjust. In recent years, there have also been papers discussing this issue. For example, Microsoft’s article: structured knowledge distillation for semantic segmentation proposes to use the similar relationship between any two pixels in the feature map to perform distillation, and also uses the KL divergence distillation of pixel classification, And adversarial distillation based on GAN, etc.
In addition, there are papers that use the idea of focal loss to distill. First calculate the student network loss, give greater weight to the position where the loss is large, and finally calculate the distilled loss based on these weights.
We fully combine the KL divergence distillation of pixel classification and the distillation method based on GAN to perform distillation.
This is the actual algorithm effect. There are single background changes, virtual classrooms, virtual meeting rooms, etc., which can provide some little fun for boring meetings and classrooms while protecting privacy.
Video super resolution algorithm
At present, the application of video super-division algorithm has very high power consumption on the end, so it is mainly used in server-side video ultra-high-definition scenes, including 2K ultra 4K, 4K ultra 8K, and so on.
How to superscore on the end? Our current landing scene is mainly RTC. RTC is extremely demanding in terms of delay, packet size, power consumption, and resource occupation. Based on the characteristics of the RTC business, Alibaba Cloud chooses a weak network scenario on the terminal for over-division.
When the network is weak, the resolution, bit rate, and frame rate can be reduced through the QOS strategy to meet the requirements of smooth network transmission. At the playback end, the picture is reconstructed in high definition through the super-division algorithm. This not only guarantees the transmission quality under weak network conditions, but also enables users to get a good viewing experience.
Review the development history of the super-division model in recent years. It is mainly divided into two categories: traditional algorithms and deep learning algorithms. Traditional algorithms are several well-known interpolation algorithms. From the beginning of SRCNN in 2014 to the present, there are constantly new papers on deep learning, but few of these papers can really run on the end.
Through these network structures, some design ideas can be summarized and refined. Basically, the residual structure, or recursive structure, or Dense structure is used, and the other is based on GAN, but GAN is more difficult to implement on the end. In order to extract the correlation information between video frames, methods such as 3D convolution, variability convolution, optical flow, and motion estimation can be used, and computational resources are also consumed.
The super-resolution algorithm itself has to solve an ill-conditioned problem. Whether it is from low resolution to high resolution, or from high resolution to low resolution, there is no definite mapping function, so it is very difficult to learn, and in the video , It is difficult to meet the real-time and power consumption requirements on the terminal to use the information between frames for alignment.
In the RTC scenario, it is necessary to ensure the effect of the algorithm under the premise of low power consumption, and the requirements for packet size, CPU usage, GPU usage, and power heating are very demanding. In addition, even if the mid-to-high-end models are covered, if a large part of the mid-to-low-end models cannot be covered, some customers will be lost from the perspective of business and commercialization. Therefore, our goal is to cover all models. .
The figure roughly describes our super-divided network structure. We know that algorithms are inseparable from real scenarios. Only scenario-based algorithms have better business value. In RTC scenarios, the first consideration is the compression of encoding and the loss caused by downsampling.
When designing the model, we have taken into account the compression of encoding and the loss of downsampling, and added a distortion repair module to the first half of the model. In the second half of the model, the up-sampled image is enhanced to approximate GroundTruth. Both parts use Attention's structure-assisted feature extraction.
It is relatively easy to over-divide portraits and normal pictures, but to over-divide scenes with text and subtitles, high-frequency text information is not only easy to lose during the down-sampling process, but also easily damaged during the encoding process, and badcase is very easy to occur. , We have made a series of optimizations for this problem.
First, a large number of data enhancements will be made to the text, including fonts, colors, angles, etc. In addition, the introduction of EdgeLoss optimized for edges can effectively improve the text super-score effect.
Lightweighting has always been an issue that needs to be considered when landing. We use structure reparameterization to design the network. The essence of structure reparameterization is to train by paralleling some feature extraction branches.
For example, there is only one 3x3 connection extraction feature itself, and several other convolutions can be paralleled, and the final reasoning is combined through the structure reparameterization formula. Although a large amount of calculation will be added during training, it has no effect at all during inference, and more information and features can be extracted. Through this structure, we have a good balance between the algorithm effect and power consumption.
Lightweight can not only use structural heavy parameterization, but also sparse pruning. If the connection is purely sparse, the calculations on the CPU and GPU will not necessarily become faster after the sparse. For GPU calculations, highly parallel data is relatively more friendly, purely sparse connections, it seems that some connections are hollowed out to simplify parameters and calculations, but the actual calculation is due to channel alignment or discontinuous memory access Waiting does not necessarily reduce the calculation delay.
Therefore, most of the industry currently adopts a structured sparse approach. The parameters related to a certain convolution kernel in the two pictures on the left, after calculating the absolute value of the parameter and the curve of time change, it is found that if it gradually tends to 0, it means that the branch itself is very sparse. For some of the connections with very small values, cropping can be performed, but when cropping, the connection between the front and rear layers should also be considered, and the overall structure should be cropped.
The two figures here show the statistical baseline comparison of the super-score algorithm. From the left picture, it can be found that the super-scoring algorithm is better than the traditional algorithm in different gears, and it is obviously not on the same level. In the picture on the right, we have calculated the approximate PSNR distribution of the super-division algorithm at different bit rates and frame rates. With this distribution diagram, we can in turn guide the QoS strategy to reduce the bit rate and frame rate reasonably under different bandwidths.
This is the effect of the super-score algorithm for the live broadcast scene.
This is the effect of the super-resolution algorithm for the text scene.
3. Deep optimization at the software and hardware level
At the beginning, we have mentioned that there are many heterogeneous hardware, but in actual business scenarios, CPU and GPU optimization will account for more than 90% of the work, so in this part we mainly use CPU and GPU as examples to introduce optimization strategies. In contrast, CPU is more suitable for complex and serial control logic, while GPU is more suitable for parallel calculation because of its large number of ALU units.
The figure briefly introduces the hardware and software architecture of the CPU. From the overall design point of view, the CPU architecture is divided into two types: complex instruction set and simplified instruction set. The complex instruction set is represented by X86, and the reduced instruction set is represented by MIPS, ARM, and RISC-V. An important feature of the CPU is its three-level cache. When optimizing, it is necessary to understand its hardware and software structure before choosing a more suitable optimization method.
The figure describes the CPU calculation process. To complete the calculation, you must first fetch the data. You need to take a virtual address to the memory management unit and look up the table through the TBL. If it hits, you can directly fetch the data from the Cache for calculation. This efficiency is very high. If it is not found (Cache miss), it is necessary to access the main memory or even the disk to load the data through the bus (lower efficiency). In this case, the power consumption, performance, and delay impact are very large, and it is necessary to optimize Focus on consideration.
This leads to the optimization method of the CPU. It is roughly divided into code level, loop level, memory level, instruction level and thread level. The code level needs to minimize memory reads and writes, choose small data types as much as possible, and also needs structure alignment and branch optimization.
Commonly used methods at the loop level are loop unrolling, loop merging, and loop splitting. The memory level mainly follows the principles of time locality and space locality, ensuring that the data read once can be used for multiple calculation instructions.
In addition, it is necessary to minimize the frequent application and release of memory, and to ensure continuous access to memory, aligned access, merged access, forced aligned loading, cache prefetch, memory reuse, etc. The instruction level should reduce data dependence, optimize multiplication and division, make full use of system register resources, use instruction pipeline to hide memory access and instruction delay execution, SIMD instruction optimization, etc.
Regarding SIMD instruction optimization, there is NEON for ARM and SSE/AVX for X86. The so-called vectorization, such as a matrix calculation A×B+C=D, requires multiple memory accesses and calculations with scalar calculations, but if vector calculations are used, the number of memory accesses and calculations can be greatly reduced.
In the instruction pipeline of the CPU, the execution of an instruction needs to go through several processes of fetching, decoding, executing and writing back. The fetch operation of the next instruction can be started at the same time the fetch of the previous instruction is completed and the decoding is started. This method not only maximizes the throughput of the CPU, but also hides the memory access and calculation delays. Thread-level optimization, including data block multithreading, calculation branch multithreading, and asynchronous processing multithreading.
Compilation and assembly can adopt methods such as automatic optimization by the compiler, inline assembly, or hand-written assembly. In terms of CPU binding, if you frequently switch between CPU Cores, it will cause frequent context switching, which has a great impact on performance. Therefore, you can selectively bind Cores to improve performance. There is a difference between a large core and a small core on the mobile phone, and cores with different performances should be bound as needed to obtain the best performance.
Let's take a look at the hardware and software architecture of the GPU. The main server GPU manufacturers are Nvidia and AMD, and the PC-side GPUs are mainly Intel's HD series and AMD's APU series and Radeon series. The mainstream mobile GPUs are Qualcomm's Snapdragon Adreno series, ARM's Mali series and Apple's A series. The Imagination PowerR series is currently used relatively little.
GPU software standards mainly include Microsoft's DirectX series, OpenGL, Vulkan, OpenGL ES, OpenCL, etc. maintained by Khronos Group, Apple's Metal, AMD's Mantle, and Intel's ONE API.
Use this picture to briefly understand the hardware and software structure of the GPU. At the hardware level, it is divided into a master device and a slave device. The master device is generally the CPU side, and the slave device is generally the GPU side. Here we take OpenCl and CUDA as examples to introduce the architecture of the software and hardware level.
The hardware level of CUDA mainly includes multi-stream processors SM, numerous stream processors SP and some registers, etc. Open CL includes CU and PE. From the memory point of view, the main memory on the CPU side and the video memory on the GPU side. Video memory is also divided into global memory, constant memory, texture memory, local memory and shared memory, private memory, etc. From the perspective of thread execution, CUDA is divided into three levels: Grid, Block, and Thread, and OpenCL is divided into WorkGroup and WorkItem.
After understanding the basic structure of the GPU, let's now introduce the optimization methods of the GPU.
At the code level, try to use the CPU for serial calculations, and try to use GPU for a large number of parallel calculations. Try to use asynchronous methods between CPU and GPU to reduce direct interaction, otherwise it will be limited by memory access and IO and affect performance. Large-scale sequential reading and writing of data, single-load multi-instruction calculation, use of vectorized load and store, reduction of division operations, low-bit quantization, and use of built-in functions can also be classified as code-level optimization methods.
Kernel level optimization methods include reasonable adjustment of thread grouping according to the kernel, hiding instruction execution delays with a large number of threads, and multiple trials to determine the optimal kernel.
Thread-level grouping optimization can be automatically determined by the system in OpenCL how to divide WorkGroup size. According to experience, you can usually set WorkGroup size to a factor of NDRange size or a power of 2. The method to guarantee the bottom is Auto-Tuning to find the optimal grouping method.
In terms of CUDA, the configuration of SM will affect the number of concurrent thread blocks and warps it supports. In addition, since the size of warp is generally 32, the size of the thread contained in the block is generally set to a multiple of 32.
Memory level optimization, including the use of linear access and highly localized data structures, optimizing data layout to maximize Cache utilization, maximize space locality and time locality, merge memory access, differentiated use of Zero Copy, Image and buffer, etc. .
In theory, in the Qualcomm Snapdragon series, Image-object has better performance, and ARM Mali series and Buffer-object have more advantages, but this is not absolute. You can try to run in advance to get better memory mode inference performance. In addition, avoid bank conflicts, reduce off-chip memory access, and make full use of Local Memory and data fragmentation multiplexing.
The multiplexing of data blocks of GPU is similar to that of CPU. Appropriately increasing data blocks can effectively reduce the number of data memory accesses under the premise of the same computational complexity, which plays an important role in improving performance.
How to design a more lightweight operator based on software characteristics, I listed the following 6 points:
First, a model with a small amount of calculation and a low degree of parallelism is suitable for CPU calculations.
Second, a large number of parallel models are suitable for GPU computing.
Third, to reduce the use of low-calculation, no-calculation, and high-memory access operators, a method of fusion with other operators can be adopted to reduce the number of memory accesses.
Fourth, avoid using operators that are not good for parallelism.
Fifth, the convolution kernel should not be too large. If necessary, multiple small convolution kernels can be used instead. For example, one 5x5 can be replaced by two 3x3, and the parameter amount can be reduced to the original 18/25; a 7×7 convolution kernel can be replaced by three 3×3 convolution kernels, and the parameter amount can be reduced to the original 27/49.
Sixth, the alignment of the number of channels should be adjusted in conjunction with the hardware and the inference framework. When MNN reasoning, the data is generally organized according to NC4H4W4, and the arrangement of Huawei's HiAI generally chooses 16 alignment. If you do not understand the underlying implementation of these frameworks, it will cause a waste of computing resources.
For deep learning, computational graph optimization is a very core part. This includes node replacement, branch optimization, subgraph transformation and operator fusion, of which operator fusion is relatively used the most. On the right is an example of convolution and BN fusion. From the mathematical point of view, it can be derived from the formula that BN can be completely fused with convolution and simplified into a convolution for use, thereby reducing the memory access and calculation consumption in the inference stage.
In another example below, you can merge multiple 1x1 convolutions, and then perform 3 x 3 and 5 x 5 convolutions. Since the final concat is a pure memory access operation, it can be merged with the convolution operation before the memory is released. The concat operation is completed synchronously.
Convolution and matrix multiplication are the two most frequently used operators. There are many ways to implement convolution. It can be implemented directly in a sliding window, and optimized through loop unrolling, data block multithreading, and instruction set parallelism. It can also be achieved by matrix multiplication, but you need to perform img2col first and then perform matrix multiplication.
FFT is used less now. FFT is suitable for the convolution of a large kernel, and the convolution of a small kernel is faster than FFT if it is directly implemented. Winograd is one of the most used optimization methods now. Its idea is also very simple. It is to take advantage of the shorter instruction cycle consumed by addition than multiplication, try to combine calculations to calculate some constant calculations in advance, and use addition instead of multiplication. In order to reduce the total amount of memory fetching and calculation.
On the right is the Strassen algorithm optimization method of matrix multiplication, which is mainly to calculate the matrix in blocks, simplify many multiplications, and introduce an intermediate matrix for auxiliary calculation to further reduce the calculation amount. The algorithm can reduce the complexity of the matrix multiplication algorithm to 7 times and 18 additions.
This is an optimization at the algorithm level. The final engineering optimization level also needs to start from the following aspects, including loop optimization, data block and rearrangement, increase data reuse, reduce cache miss, vectorization, floating point to fixed point, low bit Quantification, multi-threading, etc.
The entire algorithm optimization process is roughly as shown in the figure. First design a lightweight algorithm model, and then perform software optimization, including the optimization of CV-Filter and Inference Engine used in various pre-processing and post-processing. The adaptation of the platform operating system, and finally hardware acceleration.
Under the OpenCL framework, it is first necessary to create a lot of kernels to distribute computing tasks to various hardware to run. The execution of threads and instructions will be managed by queues. There are also certain optimization strategies for this queue, which can rely on the optimal scheduling of the OpenCL system. You can use the flush interface provided by OpenCL to add an artificial dynamic queue refresh mechanism to the scheduling to improve performance.
Optimize business pipline, reduce host and device data copy. Pay attention to whether the performance bottleneck is fetching memory or computing, and then choose an optimization strategy after determining. Pay attention to the impact of CPU core switching, frequency reduction, etc. Focus on the preemption of resources by business processes or other algorithms. Pay attention to system resource occupancy rate, power consumption, audio and video freeze rate. When optimizing heterogeneous devices, you can consider the calculation performance Pre-Tuning to select the best calculation method. You can also use open source tools for performance optimization.
The above is my sharing on algorithms and optimization methods at the software and hardware level. Let's look forward to the future together.
Four, future prospects
With the development of audio and video technology, people's interaction methods are gradually evolving, and online and offline, virtual and reality will be more and more closely integrated. New interactive methods will inevitably give birth to a new generation of real-time audio and video processing architecture that integrates the edge and end of the cloud and integrates software and hardware. And it will pose the ultimate challenge of lower latency, greater computing power, and lower power consumption for the development of software and hardware and the optimization of algorithms.
This is the "cloud-in-one" solution of Alibaba Cloud Video Cloud. Taking into account the limited computing resources of the end-side equipment, some algorithms with high requirements for computing power and latency can be processed and rendered on the cloud GRTP engine. After the processing is completed, only ordinary rendering is required on the end. End-side "zero processing". On the right is the video demonstration effect of our cloud-integrated real-time background change + construction of a virtual anchor + bel canto.
In the future, AI, AR, VR, XR, Metaverse, etc. will have higher and higher requirements for computing power. Purely relying on algorithms and software optimization is inherently limited. Therefore, we must jointly promote the rapid development of hardware and truly open up computing power. And performance ceiling. We judge that in the future, we need a deeper integration of the cloud side and the integration of software and hardware to further reduce costs and increase efficiency, so that the algorithm can truly empower thousands of industries.
The above is my sharing this time, thank you all!
"Video Cloud Technology" Your most noteworthy audio and video technology public account, pushes practical technical articles from the front line of Alibaba Cloud every week, and exchanges and exchanges with first-class engineers in the audio and video field. The official account backstage reply [Technology] You can join the Alibaba Cloud Video Cloud Product Technology Exchange Group, discuss audio and video technologies with industry leaders, and get more industry latest information.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。