Uncle Jian: Turn the concept of science fiction into a product丨Programming Challenge x Guest Sharing

foreword

This article is based on the reorganization of the content shared by senior entrepreneur @ Uncle Jian in the " RTE 2022 Innovative Programming Challenge " publicity event.

Guest profile: Chen Jian (Uncle Jian), a senior entrepreneur of 2D AR/VR in China, the first batch of domestic practitioners in the field of virtual space digitalization, obtained the digital city expert certificate issued by the government, is one of the first batch of digital simulation experts, and the earliest A group of people who put forward the concept of "interdimensional idol" and put it into practice.

01 The virtual world in science fiction works

Nowadays, a large number of sci-fi works actually have descriptions of virtual worlds, such as "Ready Player One" and "Sword Art Online", all of which show this fully immersive virtual world. It can be said that this virtual world is all virtual tracks or A goal that every RTE entrepreneur dreams of achieving. Since it is a sci-fi film, it is definitely impossible to achieve it now, but doesn't it mean that we just lie down and do nothing? of course not.

Today, let's take a look at how the virtual world that everyone yearns and looks forward to is realized in the existing technology in these objective works depicting the next few decades or even longer.

We chose to use "Ready Player One" as the prototype for the introduction, because the brain intubation depicted in "Sword Art Online" is too far away, and the scene depicted in "Ready Player One" has a very specific time point - in 2045, their product production In 2030, it is relatively close to us, and there is an opportunity to achieve it through existing technology.

We analyze some technological or technical elements in "Ready Player One", including short-axis VR glasses. If you pay more attention to the VR industry, you will know a lot of VR manufacturers, such as Pancake released by Netfish, which has just entered the game. The solution (ultra-short-focus optical solution) is similar to this short-axis VR glasses. KAT VR was the first to set foot in the universal treadmill, and step VR entered the game this year. In this way, two major hardwares have actually been covered, and the rest also include full-pose gesture capture, somatosensory simulation, etc. Figure 1 sorts out the relevant technical elements.

■Figure 1

We will find that some of these technologies can already be realized, but the cost is very high, and the experience has not reached the level depicted in the movie, so there is no way to implement it.

02 How we do it

1. Criteria for screening science fiction concepts

In order to turn these sci-fi concepts into ideas, we screened them, and the criteria are shown in Figure 2.

■Figure 2

The three points mentioned in the figure are actually very critical, because firstly, if the technology cannot be initially realized, then the product will not be established; secondly, for example, depicting hundreds of thousands of people in the same scene, it is definitely impossible to achieve with the current technology , then whether it can be reduced to thousands or even hundreds of people is true in this specific application scenario; finally, the VR glasses are shown in the picture. At present, the high-specification short-axis VR glasses are actually still very expensive. Whether it is Meta's Oculus Quest 2 or the domestic Pico 3, it must reach a certain level of consumption.

2. Our final choice

(1) (certain precision) user avatar

According to the above screening criteria, we determined the final selection direction, the first is the user avatar with a certain accuracy. The accuracy of the virtual avatar in "Ready Player One" is very high. At present, if we use Unreal 5 and a 3090 or even 4090 graphics card to run a character alone, we can actually achieve the effect in "Ready Player One", but with a certain amount of virtual It's basically impossible to put an avatar in a scene and compute it in real time. Therefore, we choose a reasonable terminal, such as a computer with an ordinary graphics card, maybe a virtual avatar that can run on a 750 or even a mobile phone to represent the user, as shown in Figure 3.

■Figure 3

On the left are two two-dimensional characters. We have tested that a domestic Android phone of about 2,000 yuan two years ago can run about 200 such characters, and the frame rate can be maintained between 30 and 40 frames. If it is the small TV man on the right, it can run to hundreds of people.

(2) (a certain number) of multiplayer real-time interactions

The real-time interaction of multiple people will be more complicated. In addition to the bottlenecks and limitations of the rendering cloud engine just mentioned, it also includes a large number of data interaction limitations, including the user's status, voice, actions, expressions, etc. In fact, this also has a bearing limit. . Therefore, it is necessary to control the number within a certain range, but it is basically possible to cut in the operation scenario of the 100-person scale on the same screen or in the same scene, as shown in Figure 4.

在这里插入图片描述

■Figure 4

(3) User creation and transactions (within a certain range)

In "Ready Player One", users create and trade by themselves. At present, we will also develop editors so that users can build scenes or do interactive events of activities by themselves, but it is equivalent to a map editor similar to a game. The resource materials include the mall. Channels, as well as the conversion of mainstream 3D formats, so a certain range of user creation and transactions can be achieved, as shown in Figure 5.

■Figure 5

3. The application scenarios we choose

The above three points are more considered at the technical level, so what kind of application scenarios can we combine? There are many virtual applications on the market today, which can be used for conference affairs, online office, shopping scenarios, concerts, exhibitions, etc. We investigate and analyze these scenarios one by one, and we will find that they are actually the three dimensions mentioned above. Among them, the accuracy involves the user Avater. If the Avater accuracy of each user is required to be very high, the user can be allowed to import a model with relatively high accuracy for real-time calculation. The number of people on the same screen is basically a stage with performers. Or the concept of a host, then the audience can be weakened relatively, and the model they choose is less accurate, which requires a higher number of multi-person real-time interactions.

Combining the advantages of the company/team itself, we have chosen application scenarios such as entertainment, youth, social interaction, and the second dimension. For entertainment scenes such as concerts, we can ensure a certain amount, precision and real-time interaction, because we start from the virtual idol business and cut into the virtual world. In addition, we are more cutting in from various youthful scenarios such as e-sports, music, ACG, etc., and transforming them into online virtualization. We have also configured a large number of virtual spaces to support various social activities. In the final virtual scene, the actual operation of the productized scene is basically about 100 people on the same screen. In such an application scene, the accuracy of the user's Avater image is basically satisfactory. The most basic expectation is that enough resources can be allocated to the scene, props, and even the stage. In addition, we have landed a large number of projects in the field of secondary elements. The accumulation of capabilities and industry resources is a very important basis for us to make choices.

Based on this, our focus is more focused, and we choose to use virtual idol performances and virtual comic exhibitions as more precise incisions, because in addition to being feasible at the technical level and application level, they also have a very important feature, that is, the market space is sufficient. The higher the frequency, the more widely accepted by users. Therefore, even if the product has a larger application scenario or can cover more people, you can focus on one or two breakthroughs first, so as to make the company's development more stable.

4. Technical selection

(1) Cross-terminal rendering pipeline

After determining the breakthrough, we still need to do more work, such as technical selection. The virtual world will definitely involve 3D engines. In addition, since our goal is to perform cross-terminal, it also involves PC, tablet, mobile phone and even VR, etc., to perform cross-terminal rendering pipeline. Our technicians are doing a lot of this work to solve the problem of cross-terminal algorithm projects. In the end, we chose the urp pipeline of the Unity engine because the development threshold of Unity is relatively friendly, and many 3D UGC ecosystems are compatible with Unity. Also better. Taking into account the development threshold of unity, combined with the positioning of our product, the friendliness of UGC and its performance optimization, we finally made this choice.

(2) Asset Precision Management

In addition to the rendering pipeline, asset precision also needs to be managed. Because just mentioned, in fact, many times we consider not how many people a virtual event or a virtual scene can support, but how many people are displayed on the same screen. At present, because the rendering performance includes the rendering performance of the engine and the computing power of the device, the limit on the number of people online is far greater than the limit caused by RTC. Taking the sound network as an example, one of its audio channels can basically guarantee that tens of thousands of people are in one channel. In addition, the default is to open 16 mics at the same time, and we have been testing a channel to open 128 mics at the same time, so the online carrying capacity of RTC is not a big problem at present, so we consider the carrying capacity more from the rendering level. This involves the computing power of the device. If the running environment is on a PC, especially if it is a rendering node provided by the cloud, the computing power is completely sufficient, but it is impossible to support a large number of simultaneous people with ultra-high precision online. I use the four models shown in Figure 6 as an example.

■Figure 6

It will be found that the number of people on the same screen of these four models is actually very different. One of the four assets is cartoonish, his appearance is cartoonish, but the texture is realistic. There is a paper figure that we made it look like a 3D model through three rendering two technology. There is also a digital human with higher precision, and a fusion style with a two-dimensional style but higher precision.

In fact, among the styles of these four assets, the two on the left can run on mobile phones, while the two on the right cannot. Because the hair of the first character is the shader we wrote ourselves, of course, if there are multiple characters, it cannot be carried. The second character is mainly because it is a style of three rendering and two rendering, which can be run in the case of multiple people on the same screen, but the number will be limited. It may not be clear here. We actually added a contour line to the character. If we want to increase the anti-aliasing of the contour line, it will actually cost a lot of performance. As for the two characters on the right, it is basically impossible to achieve such an effect in a mobile phone, and large-scale processing is required. Therefore, in addition to determining the algorithm, we will also carry out the precision management of assets according to the application scenario. For example, for performance scenes, we will focus more resources on the performers, improve the performers' asset accuracy, and control the audience's asset accuracy. For some social scenarios, such as user chat rooms, there may be up to a dozen people in them. In this case, we will evenly distribute computing power and resources to each user's Avater to make its assets more effective. , the accuracy is higher, so the user experience is better.

(3) Real-time interactive data

The third one is the most complex real-time interaction data. Take the virtual activity application as an example. In fact, it contains at least the data shown in Figure 7.

■Figure 7

We will analyze the data dimension of this real-time interaction, and finally split it into four kinds of data, as shown in Figure 8.

■Figure 8

One is audio, such as the voice interaction between users, we have adopted the RTC solution of the sound network. In addition, there are actually virtual screens in the virtual world, such as the projection screen of a press conference or a concert, and the video in it will not be placed on the client, because real-time operation and data synchronization cannot be achieved. Therefore, the real-time distribution and push of the video stream is used. For this, we also use the RTV solution of the sound network, but this RTV is created by me, and it belongs to the video category of RTC. The third is information, which includes the user's state. In actual scenarios, it is impossible for every user to wear a capture device. Most users actually display the user's state in the scene, such as walking forward, turning left, applauding, Sending text messages, etc., are collectively referred to as user status. In this regard, we have adopted the RTM scheme of the sound network, which transmits in the form of a message queue, and can be synchronized with the RTC of the sound network. Finally, our own original technology, called structured data. For structural data, we currently focus on the actions and expressions of virtual characters, which are mainly used for performers. If a performance scene is to be performed, take the stage in the upper right corner of Figure 9 as an example.

■Figure 9

Here, the user is in a virtual three-dimensional space. He can operate his Avater to walk freely on the stage and control the real-time viewing angle. This screen is generated on the user side, so it cannot be all video, and must be transmitted to the idol on the stage. actions and expressions, so that they can be received on the user side for real-time rendering. Structural data is to map the action expressions presented by the virtual character or the user's Avater to the Avater in the virtual world, and at the same time combine the simultaneous synchronization of different places and multiple people to achieve a stage-like effect.

So if we want to try to replicate the activities or behaviors in the virtual world, the data dimension of interaction is very diverse. Therefore, although many people may have various definitions of the metaverse, I think that no matter what, the metaverse must be a high-dimensional data interaction and a highly simulated virtual world. I think this premise must exist.

03 Some suggestions

Next, I will share with you some small suggestions.

First of all, the product is the art of compromise, because we will encounter many difficulties in making a product, especially the real-time interactive virtual product, which has many limitations at the technical level, and needs to achieve the best technical level, cost level and experience level. Therefore, the art of compromise can also be said to be a kind of art of balance. We need to consider what can be compromised and to what extent. This is actually what product managers need to think about.

Then there are more product means to break the technical limitations. For example, when the user wants to run a virtual stage performance on the terminal, we generally display 100 characters on the same screen, but also consider the visual density, energy consumption and heat generation, and user experience. Even if we used the simplified model in a scene, we would only show 100 viewers. But what if it's a concert that requires thousands of viewers to be online at the same time? This requires the use of product means. Users display their real-time avatars in separate lines. Considering that it is a performance scene, we design different lines for audiences who cannot see each other, but can see the content on the stage. The stage content includes the performers and the audience invited to the stage. The line display here is affected by computing power and other objective factors, but audiences on different lines can interact normally, including text interaction and bullet screen interaction, which can be run across lines. We combined the line and sound network channel concepts. In fact, in any era, technology has certain limitations. We can break this technical limitation through product means, so that the product can better meet the user experience and create more application scenarios.

Finally, communicate more with the underlying technical partners and engage in the developer community. The current version of Shengwang's official website can only support 16 people to open the microphone in one channel, but in fact, the internal test version can support 128 people to open the microphone. The qualification for this kind of internal test is actually from the technical partner py. . We have a very good relationship with many technical partners, whether it is in the field of RTC, the field of engines, or the field of capture solutions, so many times we can get some internal test versions or targeted invitation test versions. In this case, you can Knowing the future plans of technical partners or the indicators to be realized, these can be used in product development in advance, and also give their products a first-mover advantage when they are officially launched.

04 Q&A session

1. What is the prospect of real-time interactive applications?

The prospect of real-time interaction is definitely huge, because it used to be a luxury to even look at a picture, not to mention video, basically relying on text interaction, and also through offline methods, it may sound like a fantasy now, so the information Interaction must be developed to higher and higher dimensions. Looking at the evolution of the entire Internet information interaction, you will find two laws. The first is to go one-dimensional and two-dimensional to the information side of higher and higher dimensions, and then to three-dimensional. The second is that the efficiency of interaction increases. Come higher. I take the live broadcast that everyone is most accustomed to as an example. The current live broadcast is actually not real real-time interaction. Because of the delay of cdm, the interaction delay is generally more than 3 seconds in many cases. In the case of poor network , the delay between the live broadcast and the live broadcaster is even more than 10 seconds. In this case, everyone definitely hopes that the delay can be further shortened, and finally achieve the immediacy of a video call in a live broadcast environment, and the picture quality is no different from the current live broadcast. I believe this can be realized soon, because the industry, including Shengwang, is already testing lower-latency live broadcast solutions. I think the field of real-time interaction is very broad, and even the word real-time interaction is too down-to-earth, otherwise, in my eyes, real-time interaction can actually describe the characteristics of the next-generation Internet better than the metaverse.

2. What is the difference between the Metaverse and AR, VR, etc.?

There is a stalk on the Internet - the Metaverse is a basket, and anything can be put in it. So if the Metaverse is regarded as the next-generation Internet, or the 3D Internet, or the 3D virtual world, in fact, AR and VR are very critical components of the Metaverse at the visual level in my understanding, and even they will become the main carrier of the Metaverse. . The 3D environment we run now is mainly presented to the audience through 2D screens. It is not a 3D virtual world in the true sense, because there is no sense of depth in it. For example, in VR, I can very accurately judge the distance and defense of the opponent, and even capture the hand to touch the opponent's ear. This is a 3D virtual event in the 2D Internet, which is just a transitional stage. My understanding is that AR and VR will be the mainstream carriers when the metaverse era really arrives, or when it is about to come. Before the popularity of VR, the content and interactive reflection involved were designed according to the 3D environment, which is another step that must be taken.

3. What technologies are mainly used in the metaverse scene?

Metaverse involves too many technologies. First of all, we must pay attention to engines, including mainstream UE and Unity, as well as domestic emerging virtual engines, Cocos3D, WebGL 2.0, and 3D-related engines and development environments. Everyone must understand. Although there is also a 2D metaverse like Gather.town, I personally think this is just a transitional state, because the threshold for 3D is relatively high now, and the computing power requirements are relatively large. This 2D metaverse is more like a graphical chat room. I think it will have bottlenecks. Therefore, I am more supportive of entering the business directly from 3D, because the 3D development environment is becoming more and more mature, and the supporting ecology Also getting better.

The second technology to master is RTC, because interaction is inevitable for the metaverse in all virtual worlds, so RTC is indispensable. I made a product similar to Gather.town 12 years ago, but it was implemented with fresh, and the shape is exactly the same as the current Gather.town, but I basically did this 12 years ago, but it didn't work. , This really has nothing to do with the network speed and machine performance. My own analysis thinks that it is because there is no RTC technology, because the Gather.town communication method I did at that time was still mainly text communication, and did not improve the efficiency of interaction and The interactive experience, or even the text-based communication environment, is actually out of touch with the graphical visualization environment. So I think in addition to this 3D engine technology, the second thing that promotes the development of the metaverse is the rt real-time interactive technology represented by RTC, which represents the dimension and density of information interaction.

4. Which technologies of Shengwang can support virtual activities?

Here are mainly the dimensions of the four data I mentioned earlier, namely audio RTC, video, information and structure data. Regarding structural data, although Shengwang has not launched a special product, in fact, Shengwang has been discussing the transmission of higher-dimensional structural data. So I feel that these capabilities of Shengwang are actually very helpful for virtual activities. Of course, Shengwang will also launch some peripheral capabilities, and I recommend everyone to pay attention and pay attention. For example, the AI voiceprint technology of Shengwang, which can realize real-time voice-changing effect in chat rooms, but this real-time is relative, there is still a delay of several hundred milliseconds, but there is basically no difference from real-time in online chat.

In addition, based on the current RTC and R rtm technologies, Shengwang has also released a lot of meta series solutions, such as original live broadcast, original language streaming, original karaoke, interactive games, etc., all of which can be found on the official website of Shengwang. can be seen in all of them, including more detailed techniques.

5. What do individuals need to pay attention to when implementing virtual activities?

It will be a little stressful and difficult for individual developers to develop real-time virtual activities, but of course it is not impossible. If you want to challenge your personal development, I will give the following suggestions: The first is not to do too complicated data interaction, and realize it first. The most basic sound and user status; the second is to use ready-made art materials as much as possible. Take independent game developers as an example. In fact, the most headache is the art materials, so we need to use more ready-made art materials, but we do not encourage everyone to use them here. Piracy, but if it is not for commercial use, you can also consider searching for some available art materials on the Internet; the third is to consider whether the product is done by yourself, or to challenge the technical difficulty, or just a graduation design, if you want to develop a virtual If the event is for commercial use, the application scenarios must be selected very accurately, because the number of virtual event admission teams is even more than 100 people. When picking such a large team, you must choose the accurate application scenario. Don’t be afraid that this scenario is small and the most It is good to choose a scene that is so small that other teams are unwilling to do it.

About "RTE 2022 Innovative Coding Challenge"

RTE (Real Time Engagement) Innovative Programming Challenge is an annual online hackathon held by Shengwang since 2019 for RTC (Real Time Communication) developers, programming enthusiasts and geeks around the world.

In this competition, we are divided into 2 tracks. Track 1 will continue to use the classic title "Soundnet SDK Application Development". At the same time, this year, we also specially launched the new topic "Scenario-based Whiteboard Plug-in Application Development" for Track 2, which provides developers with a more focused problem-solving direction and explores the boundary between scene applications and technical capabilities.

Uncle Jian: Turn the concept of science fiction into a product丨Programming Challenge x Guest Sharing

01 The virtual world in science fiction works

02 How we do it

1. Criteria for screening science fiction concepts

2. Our final choice

3. The application scenarios we choose

4. Technical selection

03 Some suggestions

04 Q&A session

1. What is the prospect of real-time interactive applications?

2. What is the difference between the Metaverse and AR, VR, etc.?

3. What technologies are mainly used in the metaverse scene?

4. Which technologies of Shengwang can support virtual activities?

5. What do individuals need to pay attention to when implementing virtual activities?

RTE开发者社区

引用和评论

沐言智语开源 Muyan-TTS：基于高质量播客数据集，二次开发友好；Ztalk.ai：实时语音会议翻译，支持 30+语言丨日报

从 DeepSeek 看25年前端的一个小趋势

Open WebUI：开源AI交互平台的全面解析

大模型中的Token究竟是什么？从原理到作用深度解析

一文掌握 MCP 上下文协议：从理论到实践

MySQL × 向量数据库：大模型时代的黄金组合实战指南

Mac 安装 DeepSeek-R1 本地化部署