Developer Practice丨The Future of Agora Home AI Audio and Video

The author of this article is the winner of the RTE 2021 Innovative Programming Challenge, Li Xinchun from Shanghai Jiaotong University. He shared the ideas, system design and development experience of this entry.

01 Background not to be ignored

At the national level, one of the directions of my country's artificial intelligence development during the 14th Five-Year Plan period is: new product designs and platforms based on AI hardware will become mainstream. Currently, artificial intelligence solutions are changing from a "software" model to a "software + hardware" model. With intelligent computing chips and systems, new multiple intelligent sensor devices and integrated platforms, a new generation of artificial intelligence basic support platforms has matured day by day. Based on AI hardware, in the context of the coordinated development of "device + cloud + chip", product perception, understanding, reasoning and decision-making capabilities will achieve breakthroughs.

From the perspective of enterprise development, AI technology is being increasingly used in all aspects of society, from basic face recognition to driverless, whether it is machine learning or deep learning, although it is a weak artificial intelligence era, it is enough People's lives have a significant impact. Therefore, relying on enterprise application practice, combined with the development direction of artificial intelligence, to create unique artificial intelligence products is worth exploring the strategic direction.

02 Origin of this project

Because I have been engaged in related work in the technical field, from AR, VR to current AI. Accumulate experience in continuous work and practice, and continue to think about how future technology will change life. The so-called artificial intelligence is currently applied in several aspects such as factory production, life services, social governance, etc., and each field is independent of each other. With its own unique algorithm and model, is it possible to build a cloud platform, access various audio and video for real-time analysis and feedback, and form an AI cloud service platform? Therefore, the starting point of this competition is to form a set of feasible application practices and propose a system architecture of the AI platform on the cloud.

03 System configuration

In this project, YOLO V3 is used as the basic algorithm recognition engine, the audio and video transmission of the sound network Agora is used as the data source of the intelligent terminal, and the open source hardware NodeMcu and its supporting equipment are used as the representative of the intelligent hardware terminal, and finally formed in the home LAN. Smart home platform.

YOLO V3 : is the third edition of the YOLO (You Only Look Once) series of target detection algorithms. In this version, the recognition performance of small targets has been improved, and the speed has been better improved. At present, the algorithm has been updated to the V5 version, which has greatly improved the speed and recognition results. Simply put, the algorithm can achieve real-time recognition data, the recognition accuracy also meets the basic requirements, and the cost of configuration, use and learning is low.

Agora SDKs : Two SDKs provided by Acoustic Network are used in this project. The main function of RTC real-time audio and video communication is the transmission of real-time video and audio. RTM cloud signaling provides efficient and highly concurrent real-time messages. The SDK is compatible with more than 20 development platforms such as iOS, Android, Windows, macOS, Web, and applets, which can facilitate expansion and multi-platform interactive development. At the same time, for registered users, there are 10,000 minutes of free time each month, which can fully meet the daily needs of ordinary developers, and the measured end-to-end delay is <400ms in the case of 4G networks. Development and testing is also a very good experience.

Smart Hardware : Actual applications in the project include Node MCU controlled by smart car and terminals for audio and video transmission (Raspberry Pi Android things, etc.). Because the performance of the terminal at hand is too poor, the development and use of old mobile phones as smart hardware control terminals, through The local area network comprehensively manages all the intelligent hardware devices in the family.

development environment : Because the project uses machine learning algorithms, there are still some requirements for equipment. My current development environment is as follows:

Hardware environment:

CPU：I7 9700K
GPU：GTX1050TI
Memory: 16G, 500SSD

Software Environment:

VS2015
Arduino IDE
Unity 3D 2019.2

04 System Design

The initial plan of the platform is to use the home-only AI project application, and build a set of smart management platforms that can be used within the home through family notebooks, network surveillance cameras, smart hardware devices, Internet of Things terminal devices, and Agora audio and video platforms. Main functions such as access to intelligent hardware equipment, real-time audio and video communication, real-time signaling control, etc., complete the complete logic from equipment management to event processing.

platform function : The platform introduces YOLO V3 and encapsulates it to meet the needs of real-time call recognition in Unity3D; adjusts the Agora audio and video transmission script, recognizes the callback video image in real time, and outputs the recognition event; builds Agora signaling Message group for intelligent hardware control; use Unity Charts for data display and analysis.

Device Management : With the improvement of smart device performance and the rapid development of networks, smart hardware based on audio and video is developing rapidly. The management platform only needs to interface with all kinds of well-organized hardware devices for decoupling. The current home management platform mainly focuses on video management and supports multiple video surveillance access: Agora video stream, Web camera, HTTP video stream (HLS), video files, etc.

AI algorithm : The system uses the open source YOLO V3 algorithm to process various video data. In addition, various artificial intelligence recognition algorithms can be accessed. Of course, the problem of calling and data feedback in Unity needs to be solved. The current home management platform supports C# and C++ calls through the packaging of YOLO V3; the current project supports basic 80 object recognition (self-contained), helmet recognition (download from the Internet), hockey puck recognition (custom), etc., all available on the github project It can be downloaded in the file, and different types of recognition can be realized by replacing the file with the same name.

model construction : As mentioned above, the current platform already supports YOLO's basic 80 object recognition, helmet recognition, and ice hockey recognition. Models that are independently trained in accordance with YOLO V3 can be implemented in this project, which greatly reduces program development. The difficulty.

hardware control : currently uses the cloud signaling provided by the sound network for remote device control, builds group rooms for real-time communication of messages in the home environment, and supports smart hardware control through custom protocols. For some smart hardware enthusiasts , It may only need a script to connect to the system of this platform.

event processing : Use Agora cloud signaling to send different identification messages and control commands in the constructed home smart management group, and send them to mobile users and smart devices through instant messages. Supports online and offline messages, so that users never miss a message at any time.

mobile application : In this project, Agora audio and video live broadcast and cloud signaling functions are used to allow users to perform real-time control of smart hardware while receiving messages, such as controlling switches and camera rotation. At the same time, because Agora has good scalability and encapsulation, the application system can be decoupled, which facilitates the separate programming of each module for users, and improves system availability.

05 System development

Compared with last year, Agora updated and released the RTM instant messaging SDK for Unity3D. At the same time, the developer platform is becoming more and more abundant under professional management and operation. At the same time, the support for Unity 3D development is also constantly improving. There are many tutorials and examples as well as Rich development documentation.

In this development, because we need to perform image recognition on Agora's own video, we modify the VideoSurface.cs script:

//调用yolo封装过的sdk，detectedCallback是托管函数，对回调数据进行处理
var container = RTCGameManager.rtcYoloManager.GetYolo().Detect(nativeTexture.EncodeToJPG());
if (detectedCallback != null)
 {
    detectedCallback.Invoke(new DeviceItem(), container, nativeTexture,width, height);
  }

Because the YOLO machine learning framework is used, the project also involves a lot of C++ development. In addition to the configuration, learning, and data training of YOLO V3, as well as the data communication between C++ and C#, it took a lot of effort to make Unity 3D call YOLO smoothly. YoloWrapper is encapsulated in the project source code, and the encapsulated DLL can be called.

yolo V3 封装
这一阶段使用AlexeyAB在Windows上配置Yolo V3并实现运行和视频识别
目前系统封装了如下主要函数

//定义C#的Debug函数，方便调试
typedef void(*FuncPtr)(const char *);
//传入Debug托管函数
extern "C" YOLODLL_API void  set_debug(FuncPtr fp);
//传入是否展示Opencv渲染输出
extern "C" YOLODLL_API void  set_show(bool s);
//测试callback
extern "C" YOLODLL_API void  test(char* s);
//仅能使用NEt视频流，Web视频流的识别
extern "C" YOLODLL_API bool detect_net( char* filename, char* type, float thresh , bool use_mean);
//配合上一函数使用
extern "C" YOLODLL_API int  update_tracking(uchar* data, bbox_t_container &container,int &w,int &h);
//辅助色彩转换函数
extern "C" YOLODLL_API int  bgr_to_rgb(const uint8_t* src, const size_t data_length, uchar* des);
//辅助大小转换（图片太大时候使用）
extern "C" YOLODLL_API int resize(const uint8_t* src, const size_t data_length, int w, int h, uchar* des);
//初始化，传入Yolo v3cfg、weights等
extern "C" YOLODLL_API int init(const char *configurationFilename, const char *weightsFilename, const char* names, int gpu);
//识别单张图片文件
extern "C" YOLODLL_API int detect_image(const char *filename, bbox_t_container &container);
//识别单张图片文件bytes
extern "C" YOLODLL_API int detect_mat( uint8_t* data, const size_t data_length,  bbox_t_container &container, float thresh, bool use_mean);
//关闭系统
extern "C" YOLODLL_API int dispose();

In terms of intelligent hardware control, the open source hardware NodeMCU is used this time. Although it is hardware, after packaging, the C++ programming habits can be used to smoothly carry out port control and data interaction. In this project, the steering gear is used to control the camera. Steering, the motion control of the trolley is carried out by the motor, and the obstacle avoidance control is carried out by the ultrasonic wave.

//loop函数，循环检测前方是否有障碍物，同时处理服务器
void loop() {
//超声波测距
  digitalWrite(trigPin, LOW);  
  delayMicroseconds(2); 
  digitalWrite(trigPin, HIGH);
  delayMicroseconds(10); 
  digitalWrite(trigPin, LOW);
  duration = pulseIn(echoPin, HIGH);
  distance = ((duration/2) / 29.1);
  if(disLeng<3){
    disAvg[disLeng]=distance;
     disLeng++;
  }else{
    disAvg[0]=disAvg[1];
    disAvg[1]=disAvg[2];
    disAvg[2]=distance;
  }
// 距离控制，过近则停止运动
if((disAvg[0]+disAvg[1]+disAvg[2])/3<10){
stop_motors();
}
delay(50);
//处理服务器响应
  server.handleClient();
  delay(50);
}

In the end, the project formed a preliminary demonstrable system after more than a month of development. I hope that with the help of the Agora Challenge platform, we will gather the strength of our friends to enrich and expand this platform together. We also hope to provide developers in various fields with some new ideas and practices.

06 written at the end

About this project : The project is currently only a small integrated management platform formed by local AI recognition + Agora audio and video platform + cloud signaling message communication + intelligent hardware, and the final form of this platform should be:

The cloud service AI recognizes the middle station, a configurable model algorithm library, and a complete event handling process.
The cloud service equipment management platform provides comprehensive management and data input and output for all kinds of hardware sensor equipment connected.
Smart ecology: use Agora audio and video, cloud signaling, etc. to develop apps for set-top boxes, home robots, and monitoring devices, and use smart hardware to develop applications for home devices such as smoke detectors, infrared, and one-key SOS.

Of course, this system still requires a lot of manpower and material resources, and it also requires more system design and development, which is difficult and long.

Developer Practice丨The Future of Agora Home AI Audio and Video

01 Background not to be ignored

02 Origin of this project

03 System configuration

04 System Design

05 System development

06 written at the end

RTE开发者社区

引用和评论

语音独角兽 ElevenLabs 创始人：人性中的不完美，恰是人愿意互动的关键；秘塔「今天学点啥」：解析复杂内容语音讲解丨日报

一文掌握 MCP 上下文协议：从理论到实践

开放创新，昇腾 CANN 再向深处

AI Agent爆火后，MCP协议为什么如此重要！

2025年医疗大模型各医疗场景赋能实践研究报告130+份汇总解读|附PDF下载

AdventureX 2025 正式启动：五天四夜，120小时极限创造！一起在杭州点燃青年创新之火！

MCP 协议为何不如你想象的安全？从技术专家视角解读