Introduction to CBLUE (Chinese Biomedical Language Understanding Evaluation Benchmark) includes four common medical natural language processing tasks: medical text information extraction, medical terminology standardization, medical text classification, and medical question and answer.
1 Introduction
With the continuous development of artificial intelligence (AI) technology, more and more researchers have begun to pay attention to the research and application of AI technology in the field of medical and health. An important part of accelerating the implementation of the AI technology industry is the development of standard data sets and scientific evaluation systems. set up. CBLUE[1], a Chinese medical information processing challenge list initiated by the Chinese Information Society Medical Health and Biological Information Processing Professional Committee, was launched in April this year. The benchmark covers 8 classic medical natural language understanding tasks and is the first public in the industry. The public evaluation benchmark in the field of Chinese medical information has attracted wide attention after it went live, and has attracted more than 100 teams to participate in the rankings. Recently, the CBLUE working group published the paper [2] and open-sourced the benchmark baseline [3], hoping to promote the technological development of the Chinese medical AI community. This article gives a comprehensive introduction to common medical natural language understanding tasks and modeling methods.
2. Task introduction
The full name of CBLUE is Chinese Biomedical Language Understanding Evaluation Benchmark, which includes four common medical natural language processing tasks: medical text information extraction, medical terminology standardization, medical text classification, and medical question and answer. While CBLUE provides researchers with real scene data, it also provides a unified evaluation method for multiple tasks. The purpose is to promote researchers to pay attention to the generalization ability of AI models.
The following is a brief introduction of each subtask:
(1) Medical information extraction:
- CMeEE (Chinese Medical Entity Extraction dataset): Medical entity recognition task, to identify key terms in medical texts, such as "disease", "drugs", "inspection and inspection", etc. The task focuses on common pediatric diseases, and the data comes from authoritative medical textbooks and expert guides.
- CMeIE (Chinese Medical Information Extraction dataset): Medical relationship extraction task, used to determine the relationship between two entities in medical text, such as "disease-check" between "rheumatoid arthritis" and "joint tenderness count" The data source is the same as CMeEE. Entity recognition and relationship extraction are very basic technologies in medical natural language processing, which can be applied to the structuring of electronic medical records and the construction of medical knowledge graphs.
(2) Normalization of medical terms:
- CHIP-CDN (CHIP-Clinical Diagnosis Normalization dataset): a task of standardizing medical clinical terminology. Clinically, there are often hundreds of different ways of writing about the same diagnosis, surgery, medicine, examination, and symptoms (such as: "Type 2 diabetes", "Diabetes (Type 2)" and "Type 2 diabetes". Represents the same concept), the problem to be solved by standardization is to find the corresponding standard statement (such as "ICD code") for various clinical writing methods. In real applications, terminology standardization technology plays an important role in medical insurance settlement and DRGs (diagnostic automatic grouping) products. The data set comes from "diagnosis" entries written by real doctors and does not involve patient privacy.
(3) Medical text classification:
- CHIP-CTC (CHIP-Clinical Trial Criterion dataset): a standard classification task for clinical trial screening. Clinical trials refer to scientific research conducted by human volunteers, also known as subjects. The purpose is to determine the efficacy, safety, and side effects of a drug or a treatment method, which will contribute to the advancement of medicine and the improvement of human health. Plays a key role. The screening criteria are the main indicators (such as "age") proposed by the person in charge of clinical trials to identify whether subjects meet a certain clinical trial. Recruitment of subjects for clinical trials is generally done by manually comparing medical records and clinical trial screening criteria , This method is time-consuming, laborious and inefficient. The purpose of the construction of this data set is to promote the use of AI technology to automatically do the screening and classification of clinical trials and improve the efficiency of scientific research. The data set comes from the public Chinese clinical trial registration website, all of which consist of real clinical trials.
- KUAKE-QIC (KUAKE-Query Intention Classification dataset), medical search user query intention recognition task, the goal is to improve the relevance of search results. For example, the user's intention to query "what should be done for diabetes?" is to search for related " treatment plan ". The data comes from the user search terms of the search engine.
(4) Medical search and Q&A:
- CHIP-STS (CHIP-Semantic Textual Similarity dataset): Medical sentence semantic matching task. Given question pairs from different diseases, determine whether the semantics of the two sentences are similar. For example, "What do you eat for diabetes?" and "Diabetic recipes?" are semantically related; "Hazards of hepatitis B and three small yang" and "Hepatitis B big three." "The Harm of Yang" is semantically irrelevant. The data comes from desensitized Internet online consultation data.
- KUAKE-QTR (KUAKE-Query/Title Relevance dataset): Medical search "search term-page title" relevance matching task, used to determine the relevance between the user search term and the title of the returned page in the search engine scenario. The goal is Improve the relevance of search results.
- KUAKE-QQR (KUAKE-Query/Query Relevance dataset): medical search "search term-search term" relevance matching task, same as QTR task, used to determine the semantic relevance between two search terms, the goal is to improve the search scene The recall rate of classic user searches for long-tail words in China.
3. Mission characteristics
The CBLUE working group summarized the characteristics of the eight tasks included in the benchmark:
- Data anonymity and privacy protection: Biomedical data usually contains sensitive information, so the use of these data may violate personal privacy. In this regard, we anonymized the data without affecting the validity of the data before publishing the benchmark, and conducted manual inspections one by one.
- Task data sources are rich: For example, "Medical Information Extraction" tasks come from medical textbooks and expert guidelines; "Medical Text Classification" tasks come from real and open clinical trial data; "Medical Questions and Answers" tasks come from search engines or Internet online consultation corpus. These rich scenarios and data diversity provide researchers with the most important treasure for studying AI algorithms, and at the same time pose a higher challenge to the versatility of AI algorithm models.
- Task distribution is real: All the data in the CBLUE list comes from the real world, the data is real and noisy, so higher requirements are put forward for the robustness of the model. Take the “medical information extraction” task as an example: the data set follows a long-tailed distribution, as shown in Figure (a); in addition, some data sets (such as CMeIE) have a hierarchy of coarse-grained and fine-grained relational labels, which is In line with the logic of medical common sense and human cognition, as shown in Figure (b). Real-world data distribution puts forward higher requirements for the generalization ability and scalability of AI models.
4. Method introduction
As represented by Bert [4], large-scale pre-training language models have become a new paradigm for NLP problem solving. Therefore, the CBLUE working group has also selected 11 most common Chinese pre-training language models as the baseline to conduct sufficient experiments, and A detailed evaluation of the performance of the data set is currently the industry's most comprehensive Chinese medical natural language understanding task baseline, which can help practitioners solve common medical natural language understanding problems.
The introduction of the pre-trained language models for 11 experiments is as follows:
- BERT-base[4]. A BERT benchmark model with 12 layers, 768-dimensional representation, 12 attention heads, and a total of 110M parameters;
- BERT-wwm-ext-base[5]. Chinese pre-trained BERT benchmark model using Whole Word Masking (WWM);
-RoBERTa-large[6]. Compared with BERT, RoBERTa removes the Next Sentence Prediction (NSP) task and dynamically selects the masking method for training data;
- RoBERTa-wwm-ext-base/large. A pre-training model that combines the advantages of RoBERTa and BERT-wwm;
- ALBERT-tiny/xxlarge[7]. ALBERT is to share weights in different layers of the transformer. It is pre-trained for two target tasks: Masked Language Model (MLM) and Sentence Order Prediction (SOP) Model of
- ZEN[8]. n-gram enhanced Chinese text encoder based on BERT;
- Mac-BERT-base/large[9]. Mac-BERT is an improved BERT that uses MLM as a correction pre-training task, reducing the difference between pre-training and fine-tuning stages;
- PCL-MedBERT [10]. A medical pre-training language model proposed by the Pengcheng Laboratory Intelligent Medicine Research Group, which has excellent performance in medical problem matching and named entity recognition.
5. Performance Evaluation & Analysis
The following figure shows the baseline performance of 11 pre-trained models on CBLUE:
As shown in the above table, using a larger pre-trained language model can achieve better performance. In some tasks, the model using whole word occlusion does not perform better than other models, such as CTC, QIC, QTR, and QQR. This indicates that the tasks in CBLUE are challenging and require better models to solve. In addition, we found that albert-tiny achieves comparable performance to the basic model in the tasks of CDN, STS, QTR, and QQR, indicating that smaller models may also be effective in specific tasks. Finally, we noticed that the performance of the medical pre-training language model PCL-MedBERT is not as good as expected, which further proves the difficulty of CBLUE. The current model may be difficult to quickly achieve excellent results.
6. Concluding remarks
The goal of the CBLUE Challenge List is to allow researchers to effectively use real-world data under the concept of legality, openness, and sharing, and to allow researchers to pay more attention to the generalization performance of the model through multi-task scenario settings. At the same time, it is also hoped that the publicly available baseline evaluation code can effectively promote the technological progress of the medical AI community. The Baseline code address is: https://github.com/CBLUEbenchmark/CBLUE , readers who feel helpful can star this project. who want to show their skills on the challenge list, please move: 160e3d5485edac https://tianchi.aliyun.com/specials/promotion/2021chinesemedicalnlpleaderboardchallenge
7. Reference
[1].https://mp.weixin.qq.com/s/wIqPaa7WBgkxUGLku0RBEw
[2]. https://arxiv.org/pdf/2106.08087.pdf
[3]. https://github.com/CBLUEbenchmark/CBLUE
[4]. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2018.
[5]. Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Ziqing Yang, Shijin Wang, and Guoping Hu. Pre-training with whole word masking for chinese bert. arXiv preprint arXiv:1906.08101, 2019.
[6]. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and V eselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
[7]. Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942, 2019.
[8]. Shizhe Diao, Jiaxin Bai, Y an Song, Tong Zhang, and Y onggang Wang. Zen: pre-training chinese text encoder enhanced by n-gram representations. arXiv preprint arXiv:1911.00720, 2019.
[9]. Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Shijin Wang, and Guoping Hu. Revisiting pre-trained models for chinese natural language processing. arXiv preprint arXiv:2004.13922, 2020.
[10]. https://code.ihub.org.cn/projects/1775
Copyright Statement: content of this article is contributed spontaneously by Alibaba Cloud real-name registered users, and the copyright belongs to the original author. The Alibaba Cloud Developer Community does not own its copyright and does not assume corresponding legal responsibilities. For specific rules, please refer to the "Alibaba Cloud Developer Community User Service Agreement" and the "Alibaba Cloud Developer Community Intellectual Property Protection Guidelines". If you find suspected plagiarism in this community, fill in the infringement complaint form to report it. Once verified, the community will immediately delete the suspected infringing content.
**粗体** _斜体_ [链接](http://example.com) `代码` - 列表 > 引用
。你还可以使用@
来通知其他用户。