GitChat author: Li Jiaxuan 译 文 : Starting from scratch, how to read a paper on artificial intelligence, and build the paper and the realization of the code concern public number: GitChat technology chat, serious talk about technology

The first part of this Chat:

First of all, I will explain how to read a paper on machine learning from the ground up and how to deal with the mathematical problems in the paper. Then, starting with a classic paper, it explains how to quickly sort out and understand a deep learning framework and model.

There are a lot of papers on artificial intelligence and machine learning recently, so how does an engineer with an engineering background, less academic experience or some experience read a paper on artificial intelligence?

At the beginning of my academic exploration, I preferred to read the full text closely, especially the classic papers in the field of deep learning, but found that this approach took so much time that it squeezed my real purpose — engineering realization and engineering integration. In addition, because there are too many things I want to grasp, I fail to grasp the core of an article, which leads to easy forgetting. For example, I forget the article I read yesterday just like drinking water.

I’m going to talk to you about two things.

First, start from scratch, read a paper level

“Starting from zero” here means that we need to know from zero what the article does, what methods are used and what results are obtained, and whether such methods and results can be used for reference by me.

It’s not like, you know, you get into a whole new field and you start reading papers. For unfamiliar territory with no contact. My method is to read the Chinese review, the Chinese doctoral thesis, and then the English review. Through the Chinese review, we can first understand the basic terms of this field and the common methods of experiment.

Otherwise, if you start directly from the paper, the height of the author is not consistent with our level, and it is easy to take it for granted or not look at it at all. Therefore, before reading this article, I have a thorough understanding of the basic knowledge involved in this article.

At this point, return to understanding the article from scratch.

The reading of a text tends to have three ascending levels:

1. Read the general information of the passage (5-10 minutes)

  • Read the title, abstract, and introduction carefully.

  • Read section and sub-section headings and skip the content.

  • Read the conclusion and discussion (the author usually discusses the shortcomings and deficiencies of this study here, and provides suggestions and directions for future research).

  • Go through the references and make a note of what you have already read.

Therefore, after level 1, you should be able to answer the following five questions:

  1. Article classification: About the implementation of the article? Analysis of existing systems? A description of the research theory?

  2. Content: Is there a corresponding paper? What basic theory is the article based on? (theoretical outside

  3. Are the assumptions in the article true?

  4. Contribution: Is this article a significant improvement in effect (state of art)? Or is there an innovation in methodology? Or have you perfected the basic theory?

  5. Clarity: Is it a clear text?

Once you finish the first level, you can decide if you want to go to the second level, which is enough knowledge to use someday, rather than starting right now.

Level 2. Grasp the content of the passage and ignore details (1 hour)

The second level requires careful reading to grasp the main points:

  1. Understand the meaning of diagrams and tables and the conclusions they support.

  2. Make a note of any unread documents in your references that you think are important, which will give you insight into the context of the article.

To complete the second level, it is necessary to know what evidence is used and how to prove a conclusion.

Especially in this level, if encountered do not understand (there are many reasons: too many formulas, do not understand the terminology, do not understand the experimental means, too much literature reference). Note that we have not yet on the same basis with the author, it is suggested to start with a few important references, supplement the background knowledge.

Level 3. Understand the text in detail (5-6 hours)

If this article is something you want to apply to your current project, you need a third level. The goal is to be able to reproduce (re-implement) the paper under the same assumptions.

At the same time, we should pay attention to the corresponding code of the paper on GitHub, jump to the program can accelerate understanding.

By comparing your reproduced results to the original paper, you can really understand what is new about a paper, as well as its underlying premises or assumptions. And you can get some direction for your future work from the re-enactment.

The advantage of doing all three layers is that you can have a reasonable estimate of how long you will take to read a text, and can even adjust the depth of a text according to the time and needs of your job.

Ii. How to read a machine learning paper with a lot of mathematical content

This is common in many AI papers, so it is generally not necessary to understand all the steps of a formula at the first level. Skip the formula, read the description, read the results, read the conclusion.

As you accumulate math in your daily work, at the second level you may be able to understand the author’s purpose and steps directly by looking at the formula.

If you have to go to the third level, you may need to do some derivation along with the article. But in fact, if you have the code out there, you can understand the math better from an engineering perspective.

Finally, it is suggested that you try to read the 128 papers in the field of interest according to your own needs and adjust the level of reading. (I just read it. Welcome to join us.)

128 papers on machine learning in 21 major fields

The following two points are explained in combination with TensorFlow architecture and system design paper TensorFlow: Large-scale Machine Learning on Heterogeneous Distributed Systems:

####TensorFlow programming model and basic concepts

Explain static graph models with less than 20 lines of code.

TensorFlow operates in the following four steps:

  1. Load data and define hyperparameters;

  2. Build a network;

  3. Training model;

  4. Evaluate models and make predictions.

Let’s take a neural network as an example to explain the operation of TensorFlow. In this example, we construct a raw data satisfying the unary quadratic function y = ax2+b, and then construct the simplest neural network with only one input layer, one hidden layer, and one output layer. Learn weights and biases of the hidden layer and output layer through TensorFlow, and see whether the loss value decreases with the increase of training times.

Generate and load data

Start by generating input data. Assuming that the final equation to be learned is y = x2 − 0.5, let’s construct a bunch of X’s and y’s that satisfy the equation and add some noise points that do not satisfy the equation.

import tensorflow as tf
import numpy as np
Make up functions that satisfy a quadratic equation of one variable,1,300 x_data = np. Linspace (- 1) [: np. Newaxis]# In order to make the points more dense, we built 300 points, distributed in the range of -1 to 1, directly adopted the method of NP generation arithmetic sequence, and converted the one-dimensional array of 300 points into a two-dimensional array of 300×1Noise = Np.random. Normal (0, 0.05, x_data.shape)# Add some noise points to make it consistent with the dimension of X_DATA, and fit it as a normal distribution with mean value 0 and variance 0.05Y_data = np. Square (x_data) -0.5 + noise# y = x^2 -- 0.5 + noiseCopy the code

Next define placeholders for x and y as variables to be entered into the neural network:

xs = tf.placeholder(tf.float32, [None, 1])
ys = tf.placeholder(tf.float32, [None, 1])Copy the code

Building a Network Model

Here we need to build a hidden layer and an output layer. As a layer in the neural network, the input parameter should have four variables: input data, dimension of input data, dimension of output data and activation function. Each layer is vectorized (y = weights×x + biases), and the output data is finally obtained after the nonlinear processing of the activation function.

To define the hidden layer and output layer, sample code is as follows:

def add_layer(inputs, in_size, out_size, activation_function=None):
  # construct weight: in_size×out_size size matrix
  weights = tf.Variable(tf.random_normal([in_size, out_size])) 
  # Construct bias: 1×out_size matrixBiases = tf.variable (tf.zeros([1, out_size]) + 0.1)# Matrix multiplication
  Wx_plus_b = tf.matmul(inputs, weights) + biases 
  if activation_function is None:
     outputs = Wx_plus_b
  else:
     outputs = activation_function(Wx_plus_b)
return outputs Get the output data
# Build the hidden layer, assuming the hidden layer has 10 neurons
h1 = add_layer(xs, 1, 20, activation_function=tf.nn.relu)
Construct the output layer, assuming that the output layer is the same as the input layer, with 1 neuron
prediction = add_layer(h1, 20, 1, activation_function=None)Copy the code

Next, we need to construct the loss function: calculate the error between the predicted value and the true value of the output layer, sum the square of the difference between the two and take the average to obtain the loss function. The gradient descent method is used to minimize losses with an efficiency of 0.1:

# Calculate the error between the predicted value and the actual valueloss = tf.reduce_mean(tf.reduce_sum(tf.square(ys - prediction), Reduction_indices = [1])) train_step = tf. Train. GradientDescentOptimizer (0.1). Minimize (loss)Copy the code

Training model

We let TensorFlow train 1000 times and output the value of the training loss every 50 times:

init = tf.global_variables_initializer() Initialize all variables
sess = tf.Session()
sess.run(init)

for i in range(1000): # Train 1,000 times
  sess.run(train_step, feed_dict={xs: x_data, ys: y_data})
  if i % 50 == 0: Print the loss value every 50 times
    print(sess.run(loss, feed_dict={xs: x_data, ys: y_data}))Copy the code

#### Basic implementation of TensorFlow

Including: device, distributed operation mechanism, cross-device communication, gradient computing.

There are two distributed modes of TensorFlow, data parallelism and model parallelism. The most common one is data parallelism. The principle of data parallelism is simple, as shown in figure 1. CPU is mainly responsible for gradient averaging and parameter updating, while GPU1 and GPU2 are mainly responsible for model replica training. They are called “model copies” because they are based on a subset of the training samples, and there is some independence between the models.

enter image description here

The specific training steps are as follows.

  1. The model network structure is defined on GPU1 and GPU2 respectively.

  2. For a single GPU, different data blocks are read from the data pipeline respectively, and then forward propagation is carried out to calculate the loss, and then the gradient of the current variable is calculated.

  3. All the gradient data output by GPU are transferred to CPU, gradient averaging is performed first, and then model variables are updated.

  4. Repeat steps 1 to 3 until the model variables converge.

The main purpose of data parallelism is to improve the efficiency of SGD. For example, if the mini-batch size of SGD is 1000 samples at a time, then it can be calculated on 10 models simultaneously if it is cut into 10 pieces of 100 samples per piece, and then 10 copies of the model are made.

However, because the calculation speed of 10 models may be inconsistent, some are fast and some are slow, when CPU updates variables, should it wait for the completion of all the mini-batch calculation and then sum up the average to update, or let some of the calculation completed first update first, and cover the previous ones?

This brings up the issue of synchronous versus asynchronous updates.

Distributed stochastic gradient descent method means that model parameters can be distributed and stored on different parameter servers, and working nodes can train data in parallel and communicate with parameter servers to obtain model parameters. Parameters update can be divided into synchronous and asynchronous methods, namely async-SGD and SYNc-SGD. As shown in figure:

enter image description here

Synchronization of stochastic gradient descent method (also known as an update, synchronous training) means that during the training, each node on the task need to read in Shared parameters, and perform the parallel gradient calculation, the synchronization needs to wait for all the work node local gradient is good, then to merge all Shared parameters, accumulative, and one-time update to the parameters of the model; In the next batch, all working nodes are trained after receiving the updated parameters of the model.

The advantage of this scheme is that each training batch considers the training situation of all work nodes, and the loss decrease is relatively stable. The disadvantage is that the performance bottleneck is on the slowest working node. In heterogeneous devices, the performance of the working nodes is often different, which is a significant disadvantage.

Asynchronous stochastic gradient descent (also known as asynchronous update or asynchronous training) means that tasks on each working node independently calculate local gradients and asynchronously update them into the parameters of the model, without coordination and waiting operations.

The advantage of this solution is that there are no performance bottlenecks; The disadvantage is that the gradient value calculated by each working node is sent back to the parameter server, there will be conflict of parameter update, which will affect the convergence speed of the algorithm to a certain extent, and the jitter is large in the process of loss reduction.

How to choose between synchronous update and asynchronous update? Is there a way to optimize it?

The implementation difference between synchronous update and asynchronous update mainly lies in the policy of updating the parameters of the parameter server. If the amount of data is small and the computing capability of each node is balanced, the synchronization mode is recommended. Asynchronous mode is recommended when there is a large amount of data and the computing performance of different machines varies. The specific use of which one can also depend on the experimental results, generally asynchronous update effect will be better if the amount of data is large enough.

The following shows you how to create a cluster of TensorFlow servers and how to compute a static graph distributed within the cluster.

All nodes in the TensorFlow distributed cluster execute the same code. Distributed task code has a fixed structure:

Ps_hosts and worker_hostsJob_name and task_index of the current node. For example: tf. The app. The flags. DEFINE_string ("ps_hosts".""."Comma-separated list of hostname:port pairs")
tf.app.flags.DEFINE_string("worker_hosts".""."Comma-separated list of hostname:port pairs")
tf.app.flags.DEFINE_string("job_name".""."One of 'ps', 'worker'")
tf.app.flags.DEFINE_integer("task_index", 0, "Index of task within the job")
FLAGS = tf.app.flags.FLAGS
ps_hosts = FLAGS.ps_hosts.split(",")
worker_hosts = FLAGS.worker_hosts(",")

Step 2: Create the server for the current task node
cluster = tf.train.ClusterSpec({"ps": ps_hosts, "worker": worker_hosts})
server = tf.train.Server(cluster, job_name=FLAGS.job_name, task_index=FLAGS.task_index)

# step 3: If the current node is a parameter server, call server.join() and wait indefinitely; If it is a working node, go to Step 4
if FLAGS.job_name == "ps":
  server.join()

# Step 4: Build the model to be trained, build the calculation diagram
elif FLAGS.job_name == "worker":
# build tensorflow graph model

Step 5: Create tF.train.Supervisor to manage the training process of the model
Create a supervisor to oversee the training process
sv = tf.train.Supervisor(is_chief=(FLAGS.task_index == 0), logdir="/tmp/train_logs")
The supervisor is responsible for session initialization and restoring the model from the checkpoint
sess = sv.prepare_or_wait_for_session(server.target)
Start the loop until the Supervisor stops
while not sv.should_stop()
   # Training modelCopy the code

The above code framework is adopted to carry out distributed training for MNIST data sets. See the codes:

Github.com/tensorflow/…

Part 2 of this Chat will explain how to translate your requirements into a description of your paper and implement it.

Take the recommendation system as an example:

www.textkernel.com/building-la…

We take the construction of knowledge base in recruitment recommendation system as an example to explain where and how to introduce NLP and knowledge graph.

(1) Why build knowledge base? Here is a knowledge-based search:

enter image description here

That is, we want to structure the corresponding job description as a knowledge graph:

enter image description here

We know that the knowledge graph includes entities and entity relationships, so the entity database should include job database, occupation database, resume database and entity thesaurus in combination with the recruitment scenario. Entity relation may include belongingness relation, hierarchy relation and association relation.

Let’s make a structured extraction of the job description to design the label system of entity relationship, as follows:

enter image description here

How do you extract it?

  1. Look for positioning words and punctuation marks and cut them into short sentences.

    Job Description :(Assistant/apprentice): Responsible for the bar, follow the master to make drinks, cut and match fruit bowls, snacks; Remuneration: The basic salary of regular employees is 3000-3500 yuan/month + bonus + five insurances and one housing fund, and the company provides free food and accommodation

  2. Locating the core content based on feature words/words in short sentences

    (Assistant/apprentice) bar, follow the master to make drinks, cut and match fruit dishes, snacks basic salary of regular staff 3000-3500 yuan/month + bonus + five insurances and one housing fund The company will arrange the nearest working place according to the residence of employees

  3. Core word extraction

    Assistant apprentice Bar counter make drinks cut with compote snacks Base salary 3000-3500 yuan/month bonus five insurances and one housing fund the nearest place to work

So in this process, how to find positioning words? It is generally divided into 3 steps:

(1) Positioning words -> seed words -> positioning words. Such as:

enter image description here

(2) Based on pos tagging. Such as:

Word segmentation of the text, part of speech tagging, find verbs, numerals, quantifiers, as positioning words

enter image description here

(3) Based on grammar. Such as:

A continuous noun or abbreviation following a verb

Phrase co-occurrence frequency is high. Verb + adjective, verb + adverb combination

enter image description here

For part-of-speech tagging, see the Chinese Part-of-speech Tagging set:

Gist.github.com/luw2007/601…

Finally build up a recruitment knowledge base:

enter image description here

Finally, I hope you can read more papers, summarize and review the articles you have read, and practice on TensorFlow with the open source implementation on GitHub. A large amount of paper accumulation in a field can reveal many existing problems and opportunities.


Transcript: Li Jiaxuan: How to Learn TensorFlow from Artificial Intelligence Papers