In the recently announced 2021 International Audio Retrieval and Evaluation Competition (MIREX), NetEase Shufan Yizhi Voice Team joined hands with NetEase Cloud Music Audio and Video Lab, relying on production-level AI technology innovation capabilities, to achieve significant performance in lyrics recognition and playlist recognition. Break the world record and win the championship.
MIREX is a top international competition in the field of audio retrieval. It adopts the method of submitting models, public verification sets, and non-public test sets to provide fair and credible evaluations for various cutting-edge technologies in the fields of audio information retrieval and music signal processing. Since its launch in 2005, it has attracted extensive participation from world-renowned universities, research institutions and technology companies. Well-known teams in the field, such as the National University of Singapore and Queen Mary University of London, have participated in this competition.
Greatly broke the world record
In MIREX 2021, NetEase Shufan and the cloud music team participated in the competition of Automatic Lyrics Transcription (lyric recognition) and Set List Identification (song list recognition). The studio song version, which outputs the song tracks sung in the concert in chronological order (task1), and the start and end times of each track (task2).
In the lyrics recognition track, NetEase achieved a breakthrough in WER (Word Error Rate) from 37.02 (the best score in 2020) to 11.45. For experimentation, that's a 2x improvement, but for production, it's a huge difference between unavailable and usable.
The track list identification track, which had been absent from MIREX for several years due to the technical silence in the field, resumed this year and became the stage for NetEase's performance. As shown in the table below, the indicators of NetEase's submission model have improved significantly compared with previous years, and the difference in individual indicators has even exceeded 12 times.
Comparison of the best results in the validation set over the years:
Among them, ED is the edit distance between the song sequence predicted in task1 and the ground truth, the smaller the value, the better; sBD and eBD are the evaluation errors of the start time and end time of the song predicted in task2, in seconds, and the more Small is better.
Comparison of the best results in the test set over the years:
A number of innovations to improve the model's anti-interference ability
According to the participants of Netease Shufan Yizhi team, the task of this event is different from speech recognition. The data set of the lyrics recognition track comes from the foreign K song APP, which means that the training data has a noisy background and more noise interference. Lower-quality lyric audio, such as missed singing, wrong singing, impromptu dialogue/monologue, etc. - even if the lyrics closely match the original lyrics, the task complexity is still different from ordinary speech recognition, because the background music is still there, and the same The words of the melody tend to show different pitches, intonations and speech rates in different styles and rhythms. Such a complex scene brings huge challenges to model training. The model must have strong anti-interference ability against background music and noise in order to correctly identify the lyrics.
For lyrics recognition, NetEase has made a lot of targeted optimizations in terms of data and models. Based on the technical solution of speech recognition, the framework is refined to the extreme. The idea of pre-training language models is used to improve the anti-interference ability, and the adjustment is carried out in stages. Excellent, in order to improve the accuracy of the model, thereby greatly breaking the world record.
Specifically, when modeling, the audio information with accompaniment is directly input into the model, and the original information is kept as much as possible, and then the singing voice is brought into the model alone. The label modeling of various noises is introduced for background music, and the popular pre-training language model idea is used to train the acoustic model through the Mask training method to improve the context awareness and anti-interference ability of the model. For the singing voice, it adopts staged training and tuning, using the speech model as the seed model, and on this basis, the lyrics are used for staged model tuning.
For the possible flaws in the lyrics in the K-song data, or redundant information, such as the lyrics/songwriter information is redundant for the lyrics, these contents are also a kind of interference to the model training, and there needs to be a method to filter them out. To this end, NetEase Shufan has developed a set of processes and methods for automatic lyrics data screening, which relies on the confidence of the pre-trained model to filter and screen the data. This is also an iterative process, and the model accuracy is improved through continuous screening.
For playlist recognition, the industry's traditional solution is based on signal processing technology, but there has been no major new breakthrough in this technology, which is why the track has been silent for a long time. NetEase introduced the solution of lyrics recognition + text retrieval into this field this time, thus achieving a leap forward.
Production-level innovation based on the music business
A large record breaking is not the whole result. NetEase Shufan's technical solution also has good scalability. With sufficient training data (the data set used in the competition is not very large), it will have better performance, and it can also be very convenient. Expanded to the field of Japanese and Korean lyrics/playlists. In fact, these technologies have already been applied in NetEase's cloud music business. In other words, this is a production-level technological breakthrough in industry, rather than academic AI research in the laboratory.
Over the years, NetEase Cloud Music has always been committed to promoting the diversification and prosperity of China's music industry with the help of the Internet and digital technology in its business development. Since the platform launched the "NetEase Musician" product service at the end of 2016, it has gathered more than 400,000 original musicians by the end of 2021. NetEase Cloud Music continues to improve product functions and experience, and expand the value of music, such as community video song retrieval, look live broadcast, etc. In 2020, the online system has used the function of lyrics recognition.
"Chinese Music Trend Report (2022)"
During this process, NetEase Cloud Music also encountered similar challenges to the above-mentioned MIREX track. For example, when different original musicians performed the same song, the timbre, rhythm, and even the lyrics changed. Of course, mixing Chinese and English is also common. situation, these are distractions. Therefore, NetEase Cloud Music cooperated with NetEase Shufan Yizhi team to improve product experience through this set of technological innovations.
According to experts from NetEase Cloud Music Audio and Video Laboratory, the main benefits of the technology included in the competition plan in cloud music are to save labor costs and improve business results.
In terms of saving labor costs, one of the needs of the music library is to upgrade the line-by-line lyrics to word-by-word lyrics (such as karaoke effect). The extraction is done to align the boundaries of the lyrics and the melody, which saves a lot of manpower. Another scenario is the security of the music library. NetEase Cloud Music developed a sensitive lyrics check system based on the lyrics recognition technology, so as to detect sensitive words automatically at low cost.
In terms of improving business effects, a typical scenario is humming recognition. NetEase Cloud Music has effectively improved the recognition effect through the technical solution of melody matching + lyrics recognition. Currently, it has been launched through 20% traffic grayscale. The second is to use the playlist recognition technology solution for mlog video song recognition, combined with audio fingerprint and cover recognition to form a unified song recognition solution. For the videos posted by NetEase Cloud Music users in mlog, this solution can effectively identify the songs sung in the video. Songs, and match the corresponding songs in the music library, so as to realize the association between the video and the songs in the music library, and then drain each other. In addition, there is also a live broadcast application. Based on the look live broadcast audio analysis of this technology, the songs sung by the anchor can be accurately identified.
The co-construction model accelerates the implementation of AI
The successful application of MIREX's technical achievements has once again verified the success of the cross-BU co-construction model commonly used within NetEase. Co-creation and co-construction enables the two teams to complement each other's advantages and avoid weaknesses, and the research and development direction is closer to business needs, accelerating the implementation.
Taking the above mlog video song recognition application as an example, the audio fingerprint used in the solution was developed by NetEase Cloud Music Audio and Video Lab, and it is also the technology that broke the historical record of the past 6 years on MIREX2020. It is characterized by high speed and strong noise resistance. But it cannot identify different versions - and this is the strength of the lyrics recognition technology developed by the Netease Shufan Yizhi team. As long as the lyrics are consistent, different versions can be identified.
Of course, lyrics recognition alone can’t solve scenarios without lyrics and various foreign languages, which requires the addition of NetEase Cloud Music’s cover recognition technology, which can handle no lyrics and foreign languages, but has poor noise resistance, which is just complementary.
In the case of humming recognition, NetEase Cloud Music humming recognition technology can handle user humming or even whistling. However, the reality is that many times users can sing lyrics, but they are out of tune. At this time, introducing lyrics recognition can get better results. .
To sum up, these four kinds of music recognition technologies with their own strengths can be integrated to form a comprehensive solution, which can greatly expand business application scenarios and achieve good results.
Co-creation and co-construction also have a more exciting future. Participants of the two teams predict that the technology in this competition will be able to shine in scenarios such as security detection, music copyright detection, music content providers, and media industry exploration.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。