Brief introduction:CMBLUE (Chinese Biomedical Language Understanding Evaluation Benchmark) includes four kinds of common medical natural Language processing tasks: extraction of medical text information, standardization of medical terms, classification of medical text and medical question and answer.

1. The introduction

With the continuous development of artificial intelligence (AI) technology, more and more researchers begin to pay attention to the research and application of AI technology in the field of medical and health, among which an important link to accelerate the implementation of AI technology industry is the establishment of standard data sets and scientific evaluation system. By Chinese information society of China medical health and biological information processing professional committee of Chinese medical information processing challenge list CBLUE [1] it launched in April, the benchmark cover eight kinds of classic medical natural language understanding task, is the industry’s first public evaluation benchmark in the field of public health information in Chinese, has received the widespread attention after launch, It has attracted more than 100 teams. Recently, the working group of CBLUE has published a paper [2] and opened the evaluation baseline[3], hoping to promote the technology development of the Chinese medical AI community. This paper gives a comprehensive introduction to the common medical natural language understanding tasks and modeling methods.

2. Mission introduction

The full name of CBLUE is Chinese Biomedical Language Understanding Evaluation Benchmark, which includes four common medical natural Language processing tasks including extraction of medical text information, standardization of medical terms, classification of medical text and medical question and answer. While providing real scene data for researchers, CBLUE also provides a unified evaluation method for multiple tasks, aiming to promote researchers to focus on the generalization ability of AI models.

The following is a brief description of each subtask:

(1) Medical information extraction:

  • CMEEE (Chinese Medical Entity Extraction Dataset) : Medical Entity identification task, which identifies key terms in Medical texts, such as “disease”, “drug”, “examination and test”, etc. The task focused on common pediatric diseases, using data from authoritative medical textbooks and expert guidelines.
  • CMEIE (Chinese Medical Information Extraction Dataset) : Medical relation extraction task, used to determine the relationship between two entities in the medical text, such as “rheumatoid arthritis” and “joint tenderness count” there is a “disease-examination” relationship, the data source and CMEEE. Entity recognition and relationship extraction are very basic technologies in medical natural language processing, which can be applied to the structuring of electronic medical records and the construction of medical knowledge map.

(2) Medical terminology normalization:

  • ChIP-CDN (ChIP-Clinical Diagnosis Normalization Dataset) : The task of medical Clinical terminology standardization. Clinically, about the same diagnosis, surgery, medical, examination, symptoms tend to have hundreds of different writing (such as: “Ⅱ diabetes”, “diabetes (type 2)” and “” type 2 diabetes have said the same concept), the problem of standardization for clinical various writing is to find the corresponding standard language (e.g.,” ICD code “). In real-world applications, terminology standardization plays an important role in healthcare settlement and DRGS (Diagnostic Automatic Grouping) products. The data set is derived from “diagnostic” entries written by real doctors and is not personal to patients.

(3) Classification of medical texts:

  • ChIP-CTC (ChIP-Clinical Trial Criterion Dataset) : Clinical Trial screening criteria classification task. Clinical trial is a scientific study conducted by human volunteers, also known as subjects, with the purpose of determining the efficacy, safety and side effects of a drug or a treatment method. It plays a key role in promoting the development of medicine and improving human health. Screening criteria are the main criteria (such as “age”) developed by the clinical trial leader to determine whether the subjects meet the criteria for a clinical trial. Subjects for clinical trials are usually recruited through manual comparison of medical records and clinical trial screening criteria, which is time-consuming, laborious and inefficient. The purpose of this data set is to promote the use of AI technology for automatic screening and classification of clinical trials and improve the efficiency of scientific research. The datasets were obtained from open Chinese clinical trial registration websites and were composed of real clinical trials.
  • KUKE-QIC (KUKE-Query Intention Classification Dataset), a medical search user Query Intention identification task, aims to improve the relevance of search results. Such as the user query “diabetes should do what test?” The intention is to search for relevant “treatment options”. The data comes from the search engine’s user search terms.

(4) Medical Search and Q&A:

  • ChIP-STS (Chip-Semantic Similary Dataset) : Medical sentence Semantic matching task. Given a pair of questions from different diseases, determine whether the two sentences have the same meaning. For example, “What do diabetes eat?” “And” A Diabetes Diet?” It’s semantically related; “The harm of hepatitis B is small three Yang” and “the harm of hepatitis B is big three Yang” are semantically unrelated. The data were derived from the online consultation data of desensitized patients.
  • Kuake -QTR (Kuake – Query/Title Relevance Dataset) : Medical search “keyword – page Title” Relevance matching task, used to determine the Relevance between the user’s search terms and the Title of the returned page in the search engine scene, with the goal of improving the Relevance of search results.
  • Kuake-QQR (Kuake — Query/Query Relevance Dataset) : Medical search “searchword-searchword” Relevance matching task, which, together with the QTR task, is used to determine the semantic Relevance between two search terms. The goal is to improve the recall rate of classic user retrieval of long-tail words in search scenarios.

3. Task characteristics

The CMBLUE working group summarized the characteristics of the eight tasks included in the benchmark:

  1. Data Anonymies and Privacy: Biomedical data often contains sensitive information, so the use of such data may violate an individual’s privacy. In response, we anonymized the data without affecting the validity of the data before publishing the benchmark, and manually checked the data one by one.
  2. Task data sources are rich: for example, “medical information extraction” category of tasks from medical textbooks and expert authority guide; The task of “medical text classification” comes from real and open clinical trial data; “Medical Question and Answer” tasks are derived from search engines or online inquiry corpus on the Internet. These rich scenes and data diversity provide the most important treasure for researchers to study AI algorithms, and at the same time, pose higher challenges to the generality of AI algorithm models.
  3. Real Task Distribution: All data in the CBLUE list comes from the real world. The data is real and noisy, so higher requirements are put forward for the robustness of the model. Taking the “medical information extraction” task as an example, the data set follows a long-tail distribution, as shown in Figure (a). In addition, some data sets (such as CMEIE) have a hierarchy of coarse-grained and fine-grained relational labels, which is consistent with medical common sense logic and human cognition, as shown in Figure (b). Real-world data distribution puts forward higher requirements for generalization and extensibility of AI models.

4. Method introduction

Represented by Bert[4], large-scale pretrained language models have become the new paradigm for NLP problem solving. Therefore, the CMBLUE working group also selected 11 of the most common Chinese pretrained language models as the baseline to conduct sufficient experiments and conduct a detailed evaluation of the performance of the data set. At present, it is the industry’s most complete baseline of medical natural language understanding tasks in Chinese, which can help practitioners solve common medical natural language understanding problems.

The pre-training language models of the 11 experiments are summarized as follows:

  • Bert base model [4]. Bert base model with 12 layers, 768 dimensional representation, 12 attention heads, a total of 110M parameters;
  • Bert-WWM-Ext-Base [5]. Chinese pre-training Bert benchmark model using Whole Word Masking (WWM);

– Roberta – Large [6]. Compared with Bert, Roberta removes the Next Sentence Prediction (NSP) task and dynamically selects the masking mode of the training data;

  • Roberta-WWM-Ext-Base/Large. Pre-training model combining the advantages of Roberta and Bert-WWM;
  • Albert – Tiny/XXLarge [7]. Albert is shared in different layers of Transformer for two target tasks: Masked Language modeling (MLM) and Sentence Order Prediction (SOP) are pre-trained models.
  • [8] Zen. Bert based N-gram enhanced Chinese text encoder;
  • Mac-Bert is a modified Bert, which uses MLM as the calibration pre-training task, and reduces the difference between pre-training and fine-tuning stage.
  • PCL-Medbert [10]. A medical pre-training language model proposed by the Intelligent Medicine Research Group of Pengcheng Laboratory has excellent performance in medical problem matching and named entity recognition.

5. Performance evaluation & analysis

The following figure shows the baseline performance of the 11 pre-training models on CBLUE:

As shown in the table above, better performance can be achieved with a larger pre-trained language model. In some tasks, models using full-word masking did not perform better than others, such as CTC, QIC, QTR, and QQR, indicating that the tasks in CBLUE were somewhat challenging and needed to be solved by a better model. In addition, we find that Albert-Tiny achieves comparable performance to the base model in CDN, STS, QTR and QQR tasks, indicating that smaller models may also be effective in specific tasks. Finally, we note that the performance of the medical pre-training language model, PCL-Medbert, is not as good as expected, which further demonstrates the difficulty of CBLUE and that current models may not be able to achieve excellent results quickly.

6. Conclusion

The goal of the CBLUE challenge list is to enable researchers to effectively use the data of real scenes under the concept of legality, openness and sharing, and to make researchers pay more attention to the generalization performance of the model through the setting of multi-task scenes. It is also hoped that the publicly available baseline evaluation code will contribute to the technological advancement of the medical AI community. Baseline code address is:, it is helpful to think of the reader can star this project. Hope in the mutton bustin challenge on the list of friends please click:

Reference 7.




[4]. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2018.

[5]. Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Ziqing Yang, Shijin Wang, and Guoping Hu. Pre-training with whole word masking for chinese bert. arXiv preprint arXiv:1906.08101, 2019.

[6]. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, Arxiv Preprint Arxiv :1907.11692, 2019. Arxiv Preprint Arxiv :1907.11692, 2019. Arxiv Preprint Arxiv :1907.11692, 2019

[7]. Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942, 2019.

[8]. Shizhe Diao, Jiaxin Bai, Y an Song, Tong Zhang, and Y onggang Wang. Zen: pre-training chinese text encoder enhanced by n-gram representations. arXiv preprint arXiv:1911.00720, 2019.

[9]. Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Shijin Wang, Arxiv Preprint Arxiv :2004.13922, and Guoping Hu. Revisiting pre-trained models for Chinese natural language processing. 2020.


Copyright Notice:The content of this article is contributed by Aliyun real-name registered users, and the copyright belongs to the original author. Aliyun developer community does not own the copyright and does not bear the corresponding legal liability. For specific rules, please refer to User Service Agreement of Alibaba Cloud Developer Community and Guidance on Intellectual Property Protection of Alibaba Cloud Developer Community. If you find any suspected plagiarism in the community, fill in the infringement complaint form to report, once verified, the community will immediately delete the suspected infringing content.