Recently, it is the recruitment season again. Here are some questions about NLP that are frequently asked in the interview.

One basic problem

1. What are the common tasks of NLP

  • Sequence tagging tasks: POS, NER…
  • Classification tasks: Emotion analysis, intention recognition,…
  • Sentence relational tasks: Intelligent question answering, rewriting,…
  • Text generation tasks: machine translation, document summary,…

2. Please introduce the method of text representation (word vector) that you know.

  • Based on one-HOT, TF-IDF, Textrank;
  • Topic model: LSA (SVD), pLSA, LDA;
  • Fixed representation based on word vector: Word2vec, FastText, GloVe;
  • Dynamic representation based on word vector: ELMo, GPT, BERT

3. How to generate sentence vector?

  • doc2vec
  • bert
  • Word vector splicing, average, TF-IDF weighted average

4. How to calculate text similarity

Character-based: Minimum edit distance Vector-based: Cosine distance or Euclidean distance after conversion to word vector or sentence vector supervised method: build classifier

5. The solution to sample imbalance?

  • A sampling
  • undersampling
  • The text to enhance

6. What are the manifestations of over-fitting and how to solve them?

Generally, when training accuracy is particularly high, but testing accuracy is particularly low, it indicates that overfitting may occur. How to solve the problem of overfitting:

  • Increase data volume
  • Data to enhance
  • Add L1 and L2 regulars
  • Dropout
  • Batch Normalization
  • early stopping

7. Have you ever used jiaba participle? Do you understand the principle

  • Realize efficient word graph scanning based on prefix dictionary and generate directed acyclic graph (DAG) composed of all possible word formation situations of Chinese characters in sentences
  • Dynamic programming is used to find the maximum probability path and find the maximum segmentation combination based on word frequency
  • For unknown words, HMM model based on Chinese word formation ability is adopted, and Viterbi algorithm is used

8. Know about named entity recognition? What methods are usually used and what are their characteristics

  • CRF. It needs to compile features manually, and the training speed is fast
  • BiLSTM_CRF. Automatic feature extraction, slow training speed, need a lot of annotated data
  • BERT_CRF. Automatic feature extraction, slow training speed, need a lot of annotated data

9. Know HMM and CRF?

  • All belong to probability graph models, which are commonly used in sequence labeling tasks. CRF generally performs better than HMM
  • HMM belongs to generative model and CRF belongs to discriminant model

10. What about RNN and LSTM? What are the characteristics of LSTM relative to RNN

  • RNN, or Recurrent Neural Network, is a Neural Network for processing sequence data. Compared with ordinary neural networks, it can process sequential data
  • LSTM (Long short-term Memory) is a special RNN, which is mainly used to solve the problem of gradient disappearance and gradient explosion in the process of Long sequence training. To put it simply, LSTM can perform better in longer sequences than ordinary RNN

11. Can you use regular expressions? What is the difference between re.match() and re.search()?

  • The former matches the beginning of a string and returns Match Object on success and None on failure
  • The latter searches the string and returns Match Object on success and None on failure, matching only one

Second order problem

1. What are the differences among Elmo, GPT and Bert?

  • The same
    • Both ELMO and BERT are bidirectional models, but ElMO is actually the stitching of the outputs of two unidirectional neural networks, and the ability of feature fusion is weaker than BERT’s ability of feature integration
    • You’re typing sentences, not words
    • Elmo, GPT and BERT all solve the problem of polysemy
    • Pretraining models using mass text training
  • Similarities:
    • GPT is a one-way model
    • BERT will mask the input
    • Elmo uses LSTM as the neural network layer, while GPT and BERT use Transformer as the neural network layer

2. How to Learn about BERT? How does that work? Has it ever worked?

  • Transformer network structure
  • Self-attention principle, calculation process

3. What is the difference between Word2vec and fastText

  • Similarities:
    • These are unsupervised algorithms
    • Can train word vectors
  • Similarities:
    • FastText can be used for supervised training and has a similar structure to CBOW (CBOW is the context of the target word as input and the target word as a label, while skip-gram is the opposite), but the learning target is a manually labeled label.
    • FastText takes subword into account for long words, which can alleviate ooV problems to some extent
    • FastText introduces N-gram, which takes into account word order features

4. LSTM time complexity, Transformer time complexity

  • LSTM time complexity: sequence length * vector length ²
  • Transformer time complexity: sequence length ²* vector length

So in terms of computational complexity, when the sequence length N is less than the representation dimension D, the rate in the self-attention layer is faster than in the recurrent layer

5. How to reduce the reasoning time of trained neural network?

  • Service on GPU/TPU/FPGA
  • Prune to reduce parameters
  • Distillation of knowledge (for smaller Transformer models or simple neural networks)
  • Layered softmax