The article comes from Xi Xiaoyao selling meng house, the author xi Xiaoyao

Zhihu: How to build high-quality machine learning data sets? www.zhihu.com/question/33…

So there is a warm sell meng house impulse (~ ∇)

Whether doing research or solving business problems, making data sets is an inescapable problem. Many just into the students feel released a set of data is the most easy to infuse water, burning goose if you really do will find, randomly generated a data set is very easy, but if in order to solve the actual problem or let everybody can play in the above upon salt for the purpose, to create a working, high quality, moderate difficulty of data set is not easy, Ak47 schematic item and loot added: super time consuming and brains costing money

Although there is no deliberate research data set, how to do it, because the project or research, however, need to do is driving a duck on the nearly 10 data sets, but only in the q&a, dialogue and done some classification problem, so as the private box “how to build knowledge map” this kind of problem, please pass small evening ╮ ( ̄ del  ̄ “) ╭

Since I have not studied this problem deliberately, I would like to share some important points I think, respectively

  1. What is high quality
  2. Basic tools
  3. Data and label sources
  4. Enough is enough pretreatment
  5. Verify availability and construct an iterative loop of data sets as early as possible
  6. About complex NLP tasks

What is high quality

First time into the pit, some people may think that “high quality” = “super clean”, so in order to pursue the “high quality” and crazy pretreatment, finally crying.

There are two general motivations for doing data sets. One is for research, that is, for the benefit of researchers and the advancement of the field;

I have to say that SQuAD’s release is a big boost to the NLP wave of research

Another is to use a data-driven approach to optimize business metrics or solve real problems in a project.

The definition of “high quality” behind these two seemingly unrelated purposes is very similar: solve the problem!

However, for the latter purpose, the problem generally comes from online systems

In general, before making a data set is generally already exist a set of system (to make the system cold start, the first to develop a set of rules to drive the system), the system online will naturally generate log, after analyzing the badcase can know what the problem is make uncertain of the existing system, these problems can consider to use data-driven methods to solve, So we need to do the data set. Solving these problems is the first goal of your data set.

For the former purpose, the problem generally comes from the status quo of academic research

At present, NLP research is mostly data-driven, even data-set driven. Although this is not a good phenomenon, we have to admit that it has promoted the development and research upsurge of NLP to a large extent. When an existing dataset fails to cover the domain pain point, or fails to fulfill its mathematical tool potential, or has already been solved, a new dataset, or more specifically, a new benchmark, is needed.

In other words, what are the industry’s pain points? Or can we further tap the potential of current mathematical tools? Or is the current stage of mathematical tools not sufficiently advanced to solve the problem? This should be the first question to consider before making a high-quality data set.

Think of SNLI[1] in 2015, SQuAD[2] in 2016, GLUE[3], CoQA[4] in 2018, and now SuperGLUE[5], MRQA(mrQa.github. IO), all problem driven, When the existing data set is insufficient to cover the pain point of the problem or fails to meet the mathematical tool potential, or the previous problem has been solved nearly enough, a new data set emerges to solve the next pain point problem.

Once you’ve identified the problem to solve, half the quality of the data set is guaranteed, and the rest is up to the data set. The most critical problem is the selection of data and label sources, as well as the degree of pre-processing. In addition, the construction of iterative closed-loop and the processing of complex NLP tasks will also have a very important impact on the efficiency and quality of problem solving. The following is introduced in turn (~ ∇)-☆

Basic tools

As long as it is not too urgent, master some useful tools and tricks before making data sets, which can greatly reduce unnecessary repetition and inefficient labor and improve the efficiency of iteration.

  • Check github before writing crawlers and cleaning raw data

  • Regular expression text cleaning tool, do not explain

  • **Hadoop/Spark ** Don’t bother your small server if you have more than 10 million levels of corpus

  • Vim analysis samples exclusively. If the data set is only tens of thousands or one hundred or two hundred thousand, VIM performance is generally enough, but the default VIM configuration is relatively chicken and anti-human, need to be familiar with and configured in advance. If vim is a problem, other editors with regular search and highlighting that aren’t too bad will do

  • Awk,grep,cut, and WC command line tools

    For analysis samples only. If your data set is too big, your vim will break down. Of course, you can play ipython if you don’t like these commands, but it’s much less efficient to write code, and it’s more difficult to save analysis results, and don’t open(file).readlines()

  • **ipython + screen/tmux ** It is inefficient to write Python scripts using VIm when analyzing important statistical features of data sets such as sample size distributions, and it is intolerable to have a large data set with repeated IO. So it’s much more efficient to do an ipython where you load a dataset or part of a sample into memory and then do various analyses. In addition, to avoid restarting after SSH disconnects, you can hang ipython in a Screen or TMUx window. Of course, load a lot of data in, remember to occasionally del useless intermediate results, so as not to overwhelm the server memory. Oh yes, remember to understand some commonly used magic commands such as %save, can be very convenient for complex operations backup.

Data and label sources

A second key influence on data set quality is the choice of data and label sources. The data can be generated by artificial construction, writing, crawling from the Internet or secondary processing of public data sets; Labels can also be tagged manually or obtained by remote supervision.

Artificial construction and labeling

The easiest way to think of is that the data and labels come from artificial (~ ∇) Unfortunately, Xiao Xi does not have the funds to help you accumulate experience on the crowdsourcing platform (. ́ him.) For many NLP tasks, relatively simple data generally on the Internet can always find the right, but there are some tasks of data are hard to come into contact with, and on the Internet in general can only artificial carefully constructed (such as natural language reasoning, most molecules in task-oriented dialogue task, participles, NER and extract some sequence tagging task). If you want to systematically study Annotation, Xiao Xi recommends a book called Natural Language Annotation, which you have swiked half way through in the library. This book is quite nice, but also against a PM little sister (//∇//)\ (HOPE she will not see me know HHHH

Fortunately, for most NLP tasks, it is possible to find suitable data sources on the Internet or modify existing public data sets.

climb

If you want to crawl by yourself, you can crawl or even download English corpus on demand through foreign websites such as Twitter, Quora, Wiki and Reddit. If the data acquisition script provided by the official website cannot meet your needs, you can search it on Github. There’s almost always some weird third-party crawler that gets around the restrictions. If the target data is Chinese, of course, there will also be Weibo, Tieba, Douban, Baidu Baike, Zhihu and other websites waiting to be crawled.

Of course, the disadvantage of Twitter, Weibo, tieba and other sites is that there is too much content to fill, remember to go to Github to find the corresponding pre-processing script to slim down. Be careful not to use scripts that are too grandiose, because it can be a problem if they are too clean. We’ll explain why later

change

Honestly, it’s a lot of dirty work to climb your own data, especially if you have to climb a lot of data or go to less mainstream sites! So small evening more recommended or first from the existing data set to think of a way, and then a ready-made change can absolutely save a lot of force! \

In fact, many data sets were made in this way. For example, Early Socher used The emotion classification data set MR[16] with only 10,000 samples to decompose sentences in MR into phrases and clauses, and then marked them separately. As a result, it has become an SST[17]╮ with more than 200,000 samples and multi-granularity ( ̄▽ “). In addition, RECENTLY I happened to brush a paper for text style control [18], which also used Parser to disintegrate Yelp sentiment classification data set [19] and then make crazy processing. It became structure -> text stylized text generated datasets (Parser is a really good dataset builder). In short, play once know, change more convenient than climbing

The remote monitoring

When it comes to tagging, the most obvious thing to think about is crowdsourcing. Needless to say, the next step.

A more economical way to do this is remote monitoring, which is very playable, imaginative, and of high quality.

The premise of remote supervision is to propose a reliable assumption, such as “given a Query-answer pair, if the answer string appears in a document recalled by the search engine, the document can answer the query”. So there are machine reading comprehension datasets TriviaQA[6], searchQA[7]; Another example is “The emoji contained in a Twitter can reflect the (fine-grained) emotions of this Twitter”, so there is the sentiment classification dataset TwitterSentiment[8] and the emotion-controlled conversation generation dataset Mojitalk[9].

If you are not sure, sample some samples by yourself and make a rough statistics of the percentage of samples that your hypothesis is valid. As long as it is valid in most cases, there is hope. Then add some detailed constraints on the hypothesis (for example, “Answer” in TriviaQA must appear frequently in doc; In Mojitalk, Twitter with multimedia information is discarded directly, and only the most frequentemoji is seen when there are multiple emojis.) Under a sound assumption, after a few iterations there is usually a working data set.

In short, to be good at remote monitoring is to master reverse thinking. Forget the word “tagging” and think instead of “holding the tag and finding the data.”

Ok, take a five-second break, you understand. (↓ ~ ∇)↓

Enough is enough pretreatment

In fact, it is not a good thing to have “cleanliness” when doing data sets, especially when lexical diversity & semantic richness are strong, and a regular expression that seems to make the data set cleaner is likely to be \

  1. Sand sculpted valid patterns associated with category tags, causing some of the existing X->Y mappings to disappear
  2. Reduced learning opportunities for models to fight noise. You can’t eliminate all noise, but you eliminate many learning opportunities for models to recognize noise and adapt to noise ** **

This aspect of xiao Xi a bitter tears ah, once spent half the afternoon to write dozens of cleaning rules, the result is more difficult to convergence model and development set performance is worse. Finally found that the amount of data and model are not too small, follow the principle of least preprocessing is generally enough, in addition to some conventional operations (such as filtering HTML tags, URLS, desensitization, weight, truncation, etc.), xiaoxi generally only for the following cases: \

  1. The result is “label leakage”, which is easy to happen in simple tasks, typical label situations, especially when there are too many data sources. For example, if the goal of your task is to get the model to judge emotions based on text semantics, don’t go easy on emojis and emojis and control their proportion in the dataset.
  2. As a result, the sample is too long, such as 100 consecutive same emojis, ha, ah, etc
  3. Reserved function words (e.g., BERT’s [UNK],[PAD],[CLS],[SEP], etc.) \ appear in the sample

Of course, if your dataset is task specific, remember to filter out the yellow content =,=. For high-frequency typos, dots and dots that make you feel dirty, let them go if you don’t have special needs… (If you really want to completely eliminate them, change the data source ah feed, do not delusional force to fight against the broad masses of the people produced hot chicken!!)

Verify availability and construct an iterative loop of data sets as early as possible

Whether manual annotation or remote supervision of annotations, data set looks ready to not representative is available, if the noise is too big or the label boundary mark too vague (a large number of tagging error, or tagging rules is too loose, too vague, cause people can’t distinguish between certain categories of difference), is likely to complex model on the data set can’t convergence; Leakage “label” on the other hand, if datasets (such as you use emoji remote supervision and construct the emotion classification data sets, but forget to filter out emoji) or tags and content has a very direct mapping (category too specific or death) tagging rules writing, it will lead to a very simple model can easily brush this dataset to nearly full marks, In other words, such a simple and direct task can be done in a few rules and lines of code, and there is no need to do data-driven model training. \

So never try to build the dataset all at once, but rather build a “generate dataset -> run baseline-> BadCase Study -> Update strategy -> regenerate dataset” loop as early as possible. Note that the baseline is too cumbersome to choose (forget the model that is sensitive to various overparameters), preferably one that is universally proven, has open source code, is easy to use, and is ok with almost no tuning (such as the BERT series).

In the early part of the iteration, the first goal is to get the Baseline converging properly on your data set. In the middle of the iteration, focus on how the baseline performs on your development set. If it’s too good, watch out for tag leaks or data leaks (Y in X, or forget to de-weight), if it’s too bad, tune parameters. Later, I will pay more attention to badCase, to see if badcase is more sample problem (annotation noise) or real lack of model ability.

About complex NLP tasks

Of course, all of the above are fairly broad, and data sets can be quite different for different NLP problems. Some simple NLP tasks, such as text classification, are almost based on the above basic principles, but some complex NLP tasks, such as task-based dialogue and knowledge graph correlation, even if completely manual generation and annotation are difficult to do. \

For example, data sets related to task-based conversations are difficult to construct in the lazy way of remote supervision, and samples and labels may be difficult to generate without human annotation. The paper of MultiWOZ[10] covers the subtasks of DST, ACt-to-text Generation and context-to-text Generation. Machine-machine (such as M2M[11]), machine-human (such as DSTC series [12][13][14]), human-human (such as ATIS[15], WOZ series [10]) These three methods of collaborative construction of task-based dialogue datasets are well summarized, which will make you feel that it is a very challenging task to produce a high-quality task-based dialogue datasets. If you try to explore it from scratch, you may end up with a face of confusion, ( ̄▽ “”)╭

Therefore, in the face of some complex NLP tasks, we must remember to read the latest and most authoritative data sets of paper carefully first. The construction experience of such data sets may not be found in the entire wechat and Zhihu

reference

[1] Bowman S R, Angeli G, Potts C, An annotated corpus for learning Natural Language Inference [J]. ArXiv PrePrint arXiv:1508.05326, 2015.

[2] Rajpurkar P, Zhang J, Lopyrev K, et al. Squad: Questions for Machine Comprehension of text[J]. ArXiv Preprint arXiv:1606.05250, 2016. [3] Michael J, et al. Glue: A Multi-task Benchmark and Analysis Platform for Natural Language Understanding [J]. ArXiv Preprint arXiv:1804.07461, 2018. [4] Reddy S, Chen D, Manning C D. Coqa: A conversational question answering challenge[J]. Transactions of the Association for Computational Linguistics, 2019, 7: 249-266. [5] Wang A, Pruksachatkun Y, Nangia N, et al. Superglue: A Stickier Benchmark for General-purpose Language Understanding Systems [J]. ArXiv Preprint arXiv:1905.00537, 2019. [6] Joshi M, Choi E, Weld D S, et al. Triviaqa: A large scale diinstantly supervised challenge dataset for Reading Comprehension [J]. ArXiv preprint arXiv:1705.03551, 2017. [7] Dunn M, Sagun L, Higgins M, et al. Searchqa: A New Q&A dataset augmented With Context from A Search Engine [J]. ArXiv Preprint arXiv:1704.05179, 2017. [8] Bhayani R, Huang L. Twitter sentiment classification using distant supervision[J]. CS224N Project Report, Stanford, 2009, 1(12): 2009. [9] Zhou X, Wang W Y. Mojitalk: Generating Emotional Responses to Scale [J]. ArXiv :1711.04090, 2017. [10] Budzianowski P, Wen T H, Tseng B H, et al. Multiwoz-a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling[J]. arXiv preprint [11] P Shah, D Hakkani-Tur, G Tur, A Rastogi, A Bapna, N Nayak, ET al. And L Heck. 2018. Building a conversational agent overnight with Dialogue self-play. [12] Jason Williams, Antoine Raux, Deepak Ramachan- dran, and Alan Black. 2013. The dialog state track- ing challenge. In Proceedings of the SIGDIAL 2013 Conference, Pages 404 — 413. [13] M. Henderson, B. Thomson, and S. J. Young. 2014b. Word-based Dialog State Tracking with Recurrent Neural Networks. In Proceedings of SIGdial. [14] Matthew Henderson, Blaise Thomson, and Jason D Williams. 2014c. The third dialog state tracking challenge. In Spoken Language Technology Work- shop (SLT), IEEE, 2014, Pages 324 — 329. IEEE. [15] Charles T Hemphill, John J Godfrey, and George R Doddington. 1990. The atis spoken language sys- tems pilot corpus. In Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania [16] B. Pang, L. Lee. 2005. Seeing stars: Exploiting class relationships for sentiment categorization with re- spect to rating scales. In Proceedings of ACL 2005. [17] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. Manning, A. Ng, C. Potts. 2013. Recursive Deep Models for Semantic Compositionality Over a Sentiment Tree- bank. In Proceedings of EMNLP 2013. [18] Oraby S, Harrison V, Ebrahimi A, et al. Curate and Generate: Corpus and Method for Joint Control of Semantics and Style in Neural NLG[J]. ArXiv Preprint arXiv:1906.01334, 2019. [19] Zhang X, Zhao J, LeCun Y. Character-level convolutional networks for text classification[C]//Advances in neural information processing systems. 2015: 649-657.\

Note: the menu of the official account includes an AI cheat sheet, which is very suitable for learning on the commute.

Note: If you join our wechat group or QQ group, please reply to "add group" to get a discount coupon of knowledge planet, please reply to "knowledge Planet".Copy the code

Like articles, click Looking at the