What is front-end intelligence

Front-end intelligence is a series of landing schemes of AI in the front-end field. One of the best known is Ali’s ImgCook, which provides a great boost to front-end graphics writing UI code.

However, the exploration of AI in the front-end field is still very little at present, and more is the possibility of the future. It is more important to find a good combination of solutions that can improve efficiency, standardize processes, improve project quality in the front-end development process, or provide richer functionality to the front-end products.

Juejin. Cn/post / 696640…

Introduction to deep learning

The process of machine learning can be understood as: through a large number of input and output samples, the model F (x,θ∗)f(x, \theta ^ *)f(x,θ∗) parameters θ\thetaθ θ, so as to reduce the error of the model prediction results as much as possible.

  1. The relationship between machine learning and deep learning

  • Machine learning

Linear regression, logistic regression, decision tree, support vector machine, Bayesian model, regularization model, model integration, neural network…

There are many common machine learning methods. Most of them are nonlinear models, which can be summarized as the following formula:


f ( x . Theta. ) = w T ϕ ( x ) + b f(x, \theta) = w^T \phi(x) + b

Traditional machine learning methods mostly rely on manually designed algorithms, which are simple and explicable, and can achieve good results in appropriate scenarios.

However, in many complex tasks, the artificial algorithm can not achieve good results, there is a large performance bottleneck.

  • The neural network

The value of neurons at the LLL layer can be calculated according to the following formula, where WWW and BBB are weights and biases respectively, and FFF is **** activation function used to increase the degree of nonlinearity:


z l = W l a l 1 + b l z_l = W_l \cdot a_{l-1} + b_l


a l = f l ( z l ) a_l = f_l(z_l)

The common activation function is ReLU:


R e L U ( x ) = { x x p 0 0 x < 0 ReLU(x) = \begin{cases} x & x \geq 0 \\ 0 & x < 0 \\ \end{cases}

There is a rigorous proof that neural networks can simulate any function.

  • Deep learning

Deep learning is based on artificial neural network algorithms. It is through the use of very deep layers to obtain very high nonlinear ability, almost can fit all complex functions. However, it has hundreds of millions of parameters, which means that it needs a large number of sample learning to constantly adjust a large number of parameters, and finally can obtain a very good fitting effect.

Loss function

We need to specify a loss function to determine how far the model’s predictions differ from the real answer. Only by accurately describing the error can the neural network know what shape it will take.

Our goal is to make the loss function as low as possible!

The common loss functions are cross entropy loss, MSE loss…

Cross entropy loss: mainly used to measure the difference information between two probability distributions. That’s how different the predicted probability distribution is from the actual probability distribution.

Parameter optimization algorithm – gradient descent method

With the loss function, we know how much room there is to optimize the parameters of the current neural network.

As we know, the fastest descending direction of the surface is the gradient direction, so we update parameters according to the gradient direction of the loss function and propagate back in the network.

But our update magnitude needs to decay, or we might oscillate around the optimal solution.

  1. Common deep learning models

  • CNN convolutional neural network

Different from ordinary neural networks, convolutional neural networks are mainly used for image tasks, and their input is a three-dimensional (as well as color RGB channel) picture matrix.

convolution

pooling

  • RNN recurrent neural network

In feedforward neural network, the information transfer is unidirectional. But the connections between neurons in biological neural networks are more complex. RNN is a neural network with short-term memory ability.


h t = f ( h t 1 . x t ) h_t = f(h_{t – 1}, x_t)

  • LSTM is based on gated recurrent neural network

LSTM can control the rate at which information accumulates, including selectively adding new information and selectively forgetting previously accumulated information.

Case analysis

  1. Pix2Code: Automatically generates UI code from design drawings

Pix2Code this paper can convert a pure image into front-end/client code!

Links to papers: arxiv.org/abs/1705.07…

Github:github.com/tonybeltram…

Youtube: youtu. Be/pqKeXkhFA3I

(a) is the expected GUI, (b) is the GUI generated by this method; Figure (c) and (d)

  1. A UI prototype diagram can be generated into a DSL (domain-specific language, which can be understood as a structured object describing the UI)
  2. Source code generated by DSL. This is manually programmed to convert

The network structure

The method combines CNN and LSTM modules.

The overall architecture diagram is as follows:

CNN is good at extracting image features for understanding the semantic features of the input design draft: UI elements, layout, style, etc.

LSTM and RNN are good at learning text and sequence rules.

  1. CNN network used to understand input GUI image content, learning design draft image features;

  2. An LSTM network (left) is used to understand the basic law of DSL context, the law of a-word token producing the next B-word token

    1. Contains no relationship to the prototype diagram
    2. Token refers to a word, such as switch, button, etc
  3. Another LSTM network (right) is used to understand the relationship between DSL and prototype graph x -> output context token C

The data set

The data set consists of a GUI image and a DSL text group of about 1700 groups.

Here is a data pair with the image as the input and the DSL text as the expected output.

The input The label
header { btn-inactive, btn-inactive, btn-inactive, btn-active, btn-inactive } row { single { small-title, text, btn-red } }
UI picture, input diagram during model training The DSL code is the GroudTruth of the model training, which is the expected output
  • During the training phase, the network takes both UI images and DSL code as input.
  • In the deduction phase, the network simply enters a picture of what you want to predict.

The stage of training

First, preprocess the DSL text:

  1. Split Obtains a token list
  • Add every word in the DSL (including..\n) to obtain a Token array;
  • Token array by<START>Began to<END>The end.
Token_sequence = [START_TOKEN] # START with <START> Replace ("\n", "\n") # replace("\n", "\n") # replace("\n", "\n") # replace("\n", "\n") # replace("\n", "\n") # replace("\n", "\n") # replace("\n") also as token = line.split(" ") for token in tokens: Voc.append (token) token_sequence.append(token) # Collect tokens token_sequence.append(END_TOKEN) # END with <END>Copy the code
['<START>', 'header', '{', '\n', 'btn-inactive', ..., '<END>']
Copy the code
[1, 3, 2, 0,....] # Convert to a thesaurusCopy the code
  1. Slide out 48 tokens as the input context, followed by the next token as the real tag of the predicted result GT (compare the predicted result with the real tag to calculate Loss)

Why 48: hyperparameters, selection too small to get global information, can’t close parenthesis TODO

  • Slide 48 tokens from a Token array of length N as a single Context input for network training.
  • The Padding in front of the array is 48 empty strings, ensuring that the first input is 48 empty strings, and the 49th Token is<START>.
Suffix = [PLACEHOLDER] * CONTEXT_LENGTH # 48 Spaces A = np.concatenate([suffix, token_sequence]) # Can still get to 48 token (47 are in front of the space), similar to the padding for j in range (0, len (a) - CONTEXT_LENGTH) : Context = a[j:j + CONTEXT_LENGTH] # context context, slide out 48 tokens as the list label = a[j + CONTEXT_LENGTH] # next token, to be predicted labelCopy the code

Network training

Now we have a picture I, 48 tokens to form Xt, and one token to be predicted as the label Yt.

The formula of network primary deduction is as follows:


p = C N N ( I ) q t = L S T M ( x t ) r t = ( q t . p ) y t = s o f t m a x ( L S T M ( r t ) ) x t + 1 = y t \begin{aligned} p &= CNN(I) \\ q_t &= LSTM(x_t) \\ r_t &= (q_t, p) \\ y_t &= softmax(LSTM'(r_t)) \\ x_{t+1} &= y_t \end{aligned}

Softmax can convert a vector to a new vector with a sum of 1, semantically, to output the probability value for each category.


s o f t m a x = e z i c = 1 C e z c softmax = \cfrac{e^{z_i}}{\sum_{c=1}^{C}{e^{z_c}}}

  1. First of all,

The prediction result YT (which is a one-hot vector) represents the probability value of each token.

The cross entropy loss was calculated with the label Yt, and the difference between the predicted result and the correct result was measured to indicate the quality of the predicted result.


L ( I . X ) = t = 1 T x t + 1 log ( y t ) L(I, X) = – \sum_{t=1}^{T}x_{t+1}\log(y_t)

This error is propagated back, model parameters are updated, and an iteration is completed.

  1. Add the tag from last timeYt, form a new 48 tokens (the first to leave the team), and then use the new tokens as the next roundYt, repeat the above training process until all token predictions are completed.
  2. Use a new pair of datasets.

Code generation phase

  1. According to the above training model Modal, first input 48 blank tokens asXt, and the images to be predictedI, predict the first token.
  2. Add new predicted tokens to the teamXt, the first token leaves the team, and the model predicts the next token. Repeat the process until the prediction is made<END>So far.
  3. The predicted token list is decoded to generate DSL code and finally translated into GUI code.

The experiment

About 77% accuracy across all three platforms.

  1. SketchCode

Generate code from hand-drawn sketches

Code: github.com/ashnkumar/s…

Blog:blog.insightdatascience.com/automated-f…

The principle is not much different from Pix2Code, except that the dataset has been hand-drawn.

  1. Sketch2Code

sketch2code.azurewebsites.net/

Microsoft’s version is also hand-drawn into Code.

Improvement ideas

www.zhihu.com/question/43…

Write in the last

Pure visual vs. non-visual

Tesla: Anyone who relies on lidar is doomed to failure

Andrej Karpathy, Senior director of AI at Tesla, says “people don’t drive with lasers from their eyes.” **

As a person with a superficial understanding of computer vision, I was not optimistic about Tesla’s pure vision scheme at that time. I thought that the development bottleneck and security of CV were a big problem, and the effect was definitely not as good as radar scheme. Because radar solutions can immediately provide accurate depth and range information.

zhuanlan.zhihu.com/p/30856685

But tesla’s autopilot performance hit me in the face. Later, I realized that the semantic meaning of visual information is richer and the upper limit is higher. Compared with the radar scheme that only provides depth and distance information, the visual scheme is more intelligent and can make more complex decisions based on the rich picture details.

  • Radar solution: immediate depth and range information, but not much more.
  • Visual scheme: It is relatively difficult to read depth and distance information from pictures, but vision (light) can carry more information than other carriers. If a good enough model can be trained, better effects can be achieved.

So is intelligently generated front-end code!

An accurate DSL can be delivered directly via software such as Sketch, but this solution will never be smart (though it is the most reliable solution that will work in the short term).

When we see a design draft, I believe we do not need to ask the design students to confirm:

  1. Those places are dynamic copywriting
  2. Where there are interactions
  3. Where is it responsive
  4. Where you can link

This is because we have so much prior knowledge about UI interactions in our brains. If you want the generated code to understand these relationships, you must use the semantic understanding of AI.

However, the current AI scheme is still very, very primitive and unavailable, because there are still too few people studying it, and the research progress is also in the initial stage. But I believe that AI will make us “unemployed” one day ~