This paper is expected to take 20 minutes, and mainly obtains several knowledge points through this paper:

  • Understand the development history and mission boundaries of NLP
  • Learn how to use an NLP model quickly

One, the introduction

The king of thousands of words

Some time ago, github, an open source project, opened its source and gained 4K start in a short time

It is a kind of reverse dictionary, and its greatest use lies in solving the Tip of the tongue problem, that is, when words are on the Tip of your tongue.

For example, enter a description: the person who flies the plane

Output description: pilot, pilot, captain, etc…

The site is wantwords.thunlp.org/

Just saw this library, I felt very curious, so I briefly looked at the general idea

How can you infer an overview word from a description?

General train of thought

Refer to learning zhuanlan.zhihu.com/p/100382190…

Take a look at the library’s Readme description for a general idea

Core model

Opacity:

Firstly, a paragraph of description is segmented with word segmentation tool

  • If there is only one word after word segmentation, the weight table of multiple synonyms in the synonym forest will be matched from the pre-trained weight table
  • If there are multiple words after word segmentation, each word is decomposed into one token by a Bert code. The multi-channel reverse dictionary language model is used to calculate the correlation score, and the tokens of the top N are returned. Finally, according to the dictionary, the tokens are reverse-checked to get the corresponding Chinese

General implementation

It’s not the point of this article. It doesn’t matter if you don’t understand it. Learn and watch

Initialization work

Tokenizer_Ch = BertTokenizer.from_pretrained(' bert-base-Chinese ') # tokenizer_Ch = BertTokenizer.from_pretrained(' bert-base-Chinese ') # tokenizer_Ch = BertTokenizer.from_pretrained(' bert-base-Chinese ') # Word2index, index2word, (wd_C, wd_sems, wd_POSs, wd_charas) Index2synset = [[] for I in range(len(word2index))] for line in mask_ = load_data() # open(BASE_DIR + 'word2synset_synset.txt').readlines(): wd = line.split()[0] synset = line.split()[1:] for syn in synset: Index2synset [word2Index [wd]].append(word2index[syn]) # MODEL_FILE = BASE_DIR + 'en. model' model = torch.load(MODEL_FILE, map_location=lambda storage, loc: storage) model.eval()Copy the code

participles

Import thulac lac = thulac.thulac() fenci = lac.cut(description) def_words = [w for w, p in fenci]Copy the code

Word count: 1: single channel

# word vector find related words, sort, if in the word forest, Then the score of the corresponding synonym will be multiplied by the 2 # tensor matrix tonsor(137422,200) * tonsor(200,1) to get tonsor(137422,1) score = (model.embedding.weight.data).mm(model.embedding.weight.data[def_word_idx[0]]) if RD_mode == 'CC': # When CC, exclude itself, EC is the most accurate, do not exclude itself. score[def_word_idx[0]] = -10. score[np.array(index2synset[def_word_idx[0]])] *= 2 sc, indices = torch.sort(score, Descending =True) # predicted = indices[:NUM_RESPONSE].CPU ().numpy() score = sc[:NUM_RESPONSE].detach().numpy()Copy the code

Word count >1: multi-channel

Model entry github.com/thunlp/Mult…

Defi = '[CLS] '+ description # encode text input def_word_idx = tokenizer_ch.encode (defI)[:80] Def_word_idx.extend (tokenizer_ch.encode ('[SEP]')) # Convert indexed_tokens definition_words_t = in the PyTorch tensor Torch. Array (def_word_idx), dtype=torch. Int64, device=device) score = model('test', x=definition_words_t, w=words_t, ws=wd_sems, wP=wd_POSs, wc=wd_charas, wC=wd_C, msk_s=mask_s, msk_c=mask_c, Mode = mode) SC, business = business (score, descending=True) :NUM_RESPONSE].detach().cpu().numpy()Copy the code

Results transformation

# index2Word = index2word[predicted]Copy the code

The above brief introduction to an open source NLP project, focus on a general processing outline can be:

Step1, input processing

  • Processing the word segmentation of text content into tokens allows the model to recognize mathematical descriptions

Step2. Model processing

  • Feature extraction: make a series of changes, predict its maximum possibility, and get a weight tensor

Step3. Output processing

  • The mathematical description of weight tensor is translated into text content according to the dictionary

After looking at a single model, there is no clear understanding of the story of the whole development of NLP and what direction everyone is studying in the current situation. Let’s talk about the development process and current situation of NLP to facilitate us to have a simple global understanding of NLP

2. Introduction to NLP

The development history

Reference zhuanlan.zhihu.com/p/148007742

  • 1950-1970 – Adopting a rules-based approach

    • The researchers believe that natural language processing is similar to how humans learn and understand a language. Based on this theory, a large number of rules are defined, but the rules are limited and can only solve simple problems
  • 1970-early 20th century – Use of statistics-based methods

    • With the development of technology and the enrichment of corpus, statistics-based schemes gradually replaced rule-based methods, and NLP began to transition from empiricism to rationalism, from laboratory to practical application
  • 2008-2018 – Introduction of deep learning RNN, LSTM, GRU

    • Achievements in the field of image recognition and voice recognition incentive, people also gradually began to introduce the depth study for natural language processing research, from the original word vector to word2vec in 2013, will be the combination of deep learning and natural language processing to the climax, and in machine translation, question answering system, reading comprehension, and other fields has achieved some success
  • Today,

    • In 2017, Google proposed Transformer architecture model, and at the end of 2018, Google launched Bert model based on Transformer architecture. Once Bert model was born, it showed excellent performance in 11 basic TASKS of NLP (a ranking list). At present, many models are modified based on or reference to Bert model
    • If you want to learn about Transformer and Bert, watch this video

      www.bilibili.com/video/BV1P4…

      www.bilibili.com/video/BV1Mt…

  • bert Large family

Current Research Direction

There are two directions

zhuanlan.zhihu.com/p/56802149

  • Natural language understanding NLU
  • Natural language generates NLG

11 common tasks

  1. Sequence tagging: word segmentation /POS Tag /NER(named Entity Recognition)/ semantic tagging

  2. Classification tasks: text classification/affective computing

  3. Sentence relation judgment: Entailment text implication /QA/ natural language inference

  4. Generative tasks: machine translation/text summarization

Understand the history and boundaries of NLP (common exploration directions)

Let’s take Bert model as an example. How to call a model in actual practice

Three, the actual combat model call

Before actual combat, there should be two problems more meng force 😳

Q1: Where can I find models?

A1: There is an established community that has put together a lot of models that we can use, and it’s HuggingFace (github does its own search, of course, but it’s fragmented).

Q2: How do you use a model?

A2: Don’t panic, HuggingFace provides a very detailed tutorial to get started quickly!

Left left left left down down down

A quick start on Hugging Face

Hugging Face

huggingface.co/

The company primarily provides NLP services, but it also provides an open source community at 🐂, where most open source models can be found.

  • It provides a library called Transformers, which provides thousands of pre-trained models supporting text classification, information extraction, q&A, summary, translation and text generation in over 100 languages. The Transformers are seamlessly integrated with PyTorch and TensorFlow

A little incident: this library name has been changing, some online articles published at different times of its name is inconsistent, don’t be confused, it is actually a library…

It was called PyTorch – Pretrained – Bert, then PyTorch – Transformers, and then Transformers 2.0

Visiting the site, we can switch to the model list

Do a filter in the left sidebar based on what you want to do

Click on any model, and most of the model pages will have a Readme description and model presentation window

The model can be loaded in either offline or remote mode. If you want to go offline, go to the Files and Versions window to download resources

Call a model: actual bert-base-Chinese

Method one is called using PyTorch

Pytorch is a machine learning library (along with TensorFlow) from Facebook that supports friendly debugging and stable apis. It has been an instant hit and is now used by more people than TensorFlow 🐶

import numpy as np import torch from transformers import BertTokenizer, BertForMaskedLM # BERT [CLS] and [SEP] marked the beginning and end of the sentences. Samples = ['[CLS] Zhuge [MASK] is the sentence mask_index = 3 # for people from The Three Kingdoms period [SEP]'] # ---- step1, token handling ---- tokenizer = BertTokenizer. From_pretrained (' bert-base-Chinese ') # That is, a Chinese character and separator tokenized_text = [tokenizer.tokenize(I) for I in samples] # converts each token to the corresponding index input_ids = Convert_tokens_to_ids (I) for I in tokenized_text] Input_ids = torch. Tensor (input_ids) # ---- step2 Eval () # train model = bertformaskedlm. from_pretrained(' bert-base-Chinese ') model.eval() # train model https://huggingface.co/docs/transformers/main_classes/output # we can be found in the document, the corresponding model of ginseng, Outputs = model(input_ids) ---- Outputs sample = outputs.logits[0].detach().numpy() pred = np.argsort(-sample[mask_index],axis=0)[:20] print(tokenizer.convert_ids_to_tokens(pred))Copy the code

Approach two uses pipeline invocation

Using PyTorch is still a bit of a hassle, but we can also quickly call the Model with the Pipeline provided by HuggingFace

Pipeline makes unified encapsulation of model input and output, so it is more convenient

A list of currently available pipelines

Huggingface. Co/docs/transf…

Call examples:

from transformers import pipeline unmasker = pipeline('fill-mask', Model =' bert-base-Chinese ') print(unmasker(Paris is the capital of [MASK] country) )Copy the code

Developing API interfaces

In Python, you can use Flask, Django (similar to front-end KOA and EggJS) framework to develop

.

The front end has the interface, is far from the molding of a product

Four,

See here, are the two flags complete?

  • Understand the boundaries of NLP and what the latest mainstream NLP models are
  • If you quickly call an NLP model

A little insight:

Let go of psychological burden: As a front-end developer who has not been exposed to ARTIFICIAL intelligence, it is easy to have fear in the face of artificial intelligence at the beginning. In fact, with the development of technology, many technologies will produce division of labor phenomenon (part of the work is more and more sinking, part of the work is more and more lower threshold). As the front end, we can understand the macro aspect of new technology and have a cognition of the boundary scope of new technology

In plain English: There are plenty of pre-trained models out there that can be used directly. We don’t have to be an alchemist to know the general logic of a model

Give play to front-end advantages: front-end is a group that is very sensitive to user interaction and can quickly find product pain points. We can wrap an NLP model in a layer to create a more competitive and interesting small product

Broadly speaking: Learn how to select Face, and you can integrate technologies such as NLP into your products to develop your own smart products

❤️ Thank you

That is all the content of this sharing. I hope it will help you

Don’t forget to share, like and bookmark your favorite things.

Welcome to the public account ELab team harvest dachang good article ~

We are from the front end department of Bytedance, responsible for the front end development of all bytedance education products.

We focus on product quality improvement, development efficiency, creativity and cutting-edge technology and other aspects of precipitation and dissemination of professional knowledge and cases, to contribute experience value to the industry. Including but not limited to performance monitoring, component library, multi-terminal technology, Serverless, visual construction, audio and video, artificial intelligence, product design and marketing, etc.

Bytedance internal promotion code: C4QC2V7

Post links: jobs.bytedance.com/campus/posi…