The author | Emil Wallner


The translator | yi ying
Edit | Emily
AI Front Line introduction:In the next three years, deep learning will transform front-end development, speeding up prototyping and making software development easier.





The space only came to prominence last year when Tony Beltramelli published his pix2Code paper and Airbnb launched Sketch2Code.






Please pay attention to the wechat public account “AI Front”, (ID: AI-front)

Currently, the biggest obstacle to automated front-end development is computing power. However, we can use current deep learning algorithms and artificial training data to explore manual front-end automation.

In this article, we will teach a neural network how to write basic HTML and CSS code based on an image of a design prototype. Here is a brief overview of the process:

1) Design diagram for neural network

2) Neural networks convert images into HTML code

3) Render output

We will build the neural network through three iterations.

The first iteration results in the most basic version, with an understanding of the moving parts in the drawing. The second iteration results in HTML code that focuses on automating all the steps and explaining the neural network layer. The final iteration is the Bootstrap version, and we will create a model that can be used to generalize and explore the LSTM layer.

All the code is on Github and FloydHub.

The models were built based on Beltramelli’s Pix2Code paper and Jason Brownlee’s natural language description of images tutorial, written in Python and Keras, a Tensorflow-based framework.

If you are new to deep learning, it is recommended that you have a general understanding of Python, back propagation, and convolutional neural networks.

The core logic

Let’s recap our objective briefly. We want to build a neural network that generates HTML/CSS code that corresponds to the screen capture.

When you train a neural network, you can give it a few screenshots and the corresponding HTML code.

As it learns, it predicts all the matching HTML tags one by one. When predicting the next tag, it receives a screenshot and all matching tags at that point.

This Google Sheet contains a simple sample of training data.

Creating a word-for-word prediction model is now the most common way, and it’s the way we’ll use it in this tutorial, although there are other ways.

Notice that the neural network gets the same screenshot for each prediction. That is, if it were to predict 20 words, it would receive the same pattern 20 times. At this stage, don’t worry about how the neural network works, but focus on the input and output of the neural network.

Let’s look at the previous labels. Suppose we want to train the network to predict the statement “I can code”. When it receives the “I”, it can predict the “can”. Next time, it will receive “I can” and predict “code”. It receives all the previous words each time and only has to predict the next word.

The neural network creates different features from the data, connects the input and output data, creates models to understand what is contained in each screenshot and THE HTML syntax, which gives the knowledge needed to predict the next tag.

When the trained model is used in real life, the situation is similar to that of the trained model. Generate text one by one using the same screen capture each time. The difference is that now, instead of receiving the correct HTML tag directly, it receives the tag it has generated so far and then goes to predict the next tag. The whole prediction process starts with the “start tag” and ends when the prediction reaches the “end tag” or the maximum.

Hello World

Let’s start building a version of Hello World. We will provide the neural network with a display that says “Hello World! And train it to generate the corresponding tag.

First, the neural network maps the pattern to a list of pixel values. Each pixel has three RGB channels, and the value of each channel is between 0 and 255.

To make the neural network understand these tags, I use one hot encoding. So the sentence “I can code” can be mapped to:

The figure above contains the start and end tags. These tags control the start and end times of the neural network’s predictions.

For input data, we will use different sentences, starting with the first word and gradually adding each word. The output data is always one word.

Sentences follow the same logic as words. They also require the same input length, but here we are limiting the maximum sentence length, not the number of words. If the sentence is shorter than the maximum length, it is padded with empty words that consist entirely of zeros.

As you can see, the words are printed from right to left. That way, each training would force each word to change its position, so that the model would remember word order rather than individual word position.

There are four predictions in the figure below, with each row representing one prediction. From the left, the images are represented as RGB channels: red, green, and blue, with the previously mentioned words. Outside the brackets, there is prediction after prediction, ending with a red square.

In the Hello World version, we use three tokens: “Start”, “Hello World! And “end”. A token can be anything. It can be a character, a word, or a sentence. Although using character tokens requires a smaller vocabulary, it limits the neural network. Word markers often have the best performance.

Here are our predictions:

The output

  • 10 epochs:start start start

  • 100 epochs: start

    Hello World!


    Hello World!

  • 300 epochs: start

    Hello World!

    end
    The pit I dropped:
    • The first version was built before data was collected. Early in the project, I managed to get a copy of an old archive of over 38 million websites from the Geocities website. At the time, however, I was so carried away with the data that I didn’t see how much work it would take to shrink 100K words.

    • Handling terabytes of data requires good hardware or a lot of patience. After several problems running on my Mac, I finally opted for a powerful remote server. To keep things running smoothly, I expect to rent a remote test unit with an 8-core CPU and a 1GPS network connection.

    • A lot of things don’t make sense until I understand the inputs and outputs. Input X is the screen shot and previously predicted label, and output Y is the next predicted label. Once I understood this, it became easier to understand the relationship between them, and it became easier to build different architectures.

    • Watch out for rabbit-hole traps. Because this project intersects with many fields of deep learning, I have been involved in other fields for many times during the research process. I spent a week writing RNN from scratch, obsessing over vector Spaces and getting confused by other implementations.

    • The image-to-code network is actually an Image Caption model. But even though I was aware of this, I still ignored a lot of the papers on Image Caption just because I didn’t think they were cool. When I discovered this, I accelerated my understanding of the problem.

    Run the code on the Floydhub platform

    Floydhub is a deep learning training platform. I didn’t know about this platform until I first got into deep learning. Since then, I’ve been using it to train and manage deep learning experiments. You can install it and have your first model up and running in 10 minutes. This is the best choice for training models on a cloud GPU.

    If you’re new to Floydhub, I recommend you check out their 2-minute installation tutorial and my 5-minute overview tutorial.

    Replication warehouse:


    Log in and start the FloydHub command-line tool:


    Running Jupyter Notebook on FloydHub’s cloud GPU machine:


    All the notebooks are ready in the FloydHub directory. After running, you can find here the first notebook: floydhub/Helloworld/Helloworld ipynb.

    For more detailed guidance and markup instructions, see an earlier article I wrote.

    The HTML version

    In this release, we will automate many of the steps in Hello World. This chapter will focus on creating a scalable implementation and the dynamic parts of a neural network.

    While this version is not ready to predict HTML based on random sites, it is still good for exploring the dynamic parts of a problem.


    An overview of

    The following figure shows what architectural components look like when expanded.

    There are two main parts, encoder and decoder. Encoders are used to create image features and label features. Features are the basic building blocks created by the network to connect patterns and labels. In the final stage of coding, we associate the image features with the words in the previous tag.

    The decoder then uses a combination of patterns and tag features to create the next tag feature, which in turn predicts the next tag through a fully connected neural network.

    Pattern characteristic

    Because we needed to insert a screenshot for each word, this became a bottleneck when we were training the network (example). Therefore, instead of using images directly, we extract the information needed to generate labels.

    We then encode the extracted information into image features using a convolutional neural network pre-trained on ImageNet.

    Features are extracted from layers before final classification.


    In the end, we got 1536 8×8 pixel images as feature images. Although these features are difficult to understand, neural networks can extract the location of objects and elements from them.

    Label features

    In the Hello World version, we use unique heat encoding to represent tags. In this version, we will use word vectors in the input and continue to use unique thermal encoding to represent the output.

    The method of changing the mapping notation while keeping the construction of each sentence unchanged. Unique thermal coding treats each word as a separate unit. But here, we convert each word in the input data into a list of values. These values represent relationships between different labels.


    The dimension of the word vector is 8, but it often varies between 50 and 500 depending on the size of the vocabulary.

    The eight numbers per word are similar to weights in general neural networks, used to map relationships between different words.

    Neural networks can use these features to connect input and output data. For now, don’t worry about what they are, we’ll delve into this in the next section.

    The encoder

    We run the word vector into LSTM, which returns a list of tag features. These tag features are then sent to the TimeDistributed dense layer to run.

    Along with the word vector, there is another processing. Image features are first flattened and all values are converted to a list of numbers. We then apply a dense layer on this layer to extract high-level features, which are then linked to tag features.

    This may be a little hard to understand, so let’s break down the process.

    Label features

    We first send the word vector to the LSTM layer to run. As shown below, all sentences are padded to the maximum length of the three tokens.


    To mix signals and find higher-level patterns, we apply TimeDistributed dense layers to tag features. The TimeDistributed dense layer is the same as a normal dense layer, but with multiple inputs and outputs.

    Image characteristics

    In the meantime, we’ll prepare the images. We sort through all the mini-image features and then transform them into a set of lists. The message has not changed, just the organization.


    As mentioned earlier, to mix signals and extract more advanced concepts, we apply a dense layer. And since we only need to process one input value, we can use a regular dense layer. Then, to connect the image features to the tag features, we copied the image features.

    In this case, we have three tag characteristics. Therefore, we get the same number of image features and tag features.

    Connect image features and tag features

    All sentences are padded to create three tag features. Since we have preprocessed the image features, we can now add an image feature for each tag feature.


    After adding each image feature to the corresponding tag feature, we finally get three groups of image tag feature combinations. We then use them as input to the decoder.

    decoder

    Here, we use image tag feature combinations to predict the next tag.


    In the example below, we use a combination of three image label features to output the next label feature.

    Note that the SEQUENCE of the LSTM layer is set to false here. Thus, the LSTM layer returns a predicted feature rather than the length of the input sequence. In our case, this would be the characteristic of the next tag, containing the information needed for the final prediction.


    Final prediction

    The dense layer connects the 512 values in the next tag feature to the four final predictions like a traditional feedforward neural network. Suppose we have four words in our vocabulary: start, Hello, world, and end.

    Word prediction can be [0.1,0.1,0.1,0.7]. The Probability distribution of the Softmax activation function in the dense layer is 0 to 1, and the sum of all predictions equals 1. In this case, it predicts that the fourth word will be the next tag. Then, the unique heat code [0,0,0,1] is converted to a mapping value, such as “end”.


    The output


    Here is the original website for reference.

    The pit I dropped:

    LSTM is harder for me to understand than CNN. As I unfold all the LSTMS, they become easier to understand. Fast. Ai’s video on RNN is very useful. Also, before trying to understand how features work, focus on input features and output features themselves. It’s much easier to build a vocabulary from scratch than to shrink a huge vocabulary. This includes fonts, div tag sizes, HEX color values, variable names, and plain words. Most libraries are created to parse text files rather than code. In the document, everything is separated by Spaces, but in the code, you need to use custom parsing. Features can be extracted using models trained on ImageNet. This may seem counterintuitive, since ImageNet has almost no Web images. However, the loss from doing so was 30% higher compared to a Pix2code model trained from scratch. Of course, I’m also interested in using a pre-trained model of the Inception – Resnet type based on web screenshots.

    The Bootstrap version

    In our final version, we will use a dataset from the Bootstrap site generated in the Pix2Code paper. By using Twitter’s Bootstrap, you can combine HTML and CSS and reduce the size of the vocabulary.

    We’ll make sure it generates labels for screenshots it hasn’t seen before, and we’ll delve into how it builds awareness of screenshots and labels.

    Instead of training on Bootstrap tags, we’ll use 17 simplified tokens and convert them to HTML and CSS. This dataset includes 1500 test screenshots and 250 validation images. Each screenshot has an average of 65 tokens, and a total of 96,925 training samples will be generated.

    With some tweaks to the model in the Pix2Code paper, the model can predict web components with 97% accuracy (BLEU 4-Ngram greedy search, more on that later).


    An end-to-end approach

    In the Image Caption model, the effect of extracting features from the pre-trained model is very good. But after a few experiments, I found that pix2Code’s end-to-end approach worked better. The pre-trained model has not been trained with web data, only for classification.

    In this model, we replace the pre-trained image features with lightweight convolutional neural networks. However, instead of using max-pooling to increase the information density, we increase the step length to maintain the position and color of elements.


    Two core models, convolutional neural network (CNN) and recursive neural network (RNN), can be used here. The most common RNN is the long and short term memory (LSTM) network, which I will cover.

    I’ve covered a lot of great CNN tutorials in previous posts, so I’ll just focus on LSTM.

    Understand the time step in LSTM

    One of the difficulties of LSTM is the concept of time step. Primitive neural networks can be thought of as having two time steps. If you give it “hello”, it will predict “world”. However, it is difficult to predict more time steps. In the example below, enter four time steps, one for each word.

    LSTM is a neural network suitable for ordered information, which is suitable for input with time step. If you expand our model, you’ll see something like this. You need to keep the same weight for each step that you recurse down. You can set a set of weights for the old and new outputs.



    The weighted inputs and outputs are linked together by an activation function, which is the output of the corresponding time step. As we reuse these weights, they will extract information from some input and build up knowledge about the sequence.

    Below is a simplified version of each time step in LSTM.


    To understand this logic, I recommend you follow Andrew Trask’s excellent tutorial and build an RNN yourself from scratch.

    Understand the different cells in the LSTM layer

    The number of units in each LSTM layer determines its memory capacity and the size of each output feature. Again, one feature is a long string of numbers used to transfer information from layer to layer.

    Each unit in the LSTM layer learns to trace a different aspect of the syntax. Below is a visualization of a unit tracking the original div information, a simplified tag we used to train the Bootstrap model.


    Each LSTM cell maintains a cell state. Think of the cellular state as memory, with weights and activation functions used to modify the state in different ways. This allows the LSTM layer to fine-tune what information is retained and discarded for each input.

    In addition to each input passing output characteristics, the LSTM layer also passes cell state, where each cell corresponds to a value. To understand how components in LSTM interact, I recommend Colah’s tutorial, Jayasiri’s Numpy implementation, and Karphay’s lectures and articles.



    Test accuracy

    Finding a fair way to measure accuracy is hard. If one word in your prediction is out of sync, your accuracy may be zero, assuming you choose to compare word by word. If you delete a word that fits the prediction, you might end up with 99 percent accuracy.

    I use BLEU, a best practice for machine translation and image captioning models. It breaks sentences into four grams in a sequence of 1 to 4 words. The “cat” in the forecast below should be “code”.


    To get the final score, you need to multiply all the numbers by 25%, (4/5)*0.25 + (2/4)*0.25 + (1/3)*0.25 + (0/2)*0.25 = 0.2 + 0.125 + 0.083 + 0 = 0.408. The sum is multiplied by the penalty value for sentence length. Since the sentence length in our example is correct, the sum is directly the final result.

    You can make it harder by increasing the number of Grams. The four Gram model is the most suitable model for human translation. I recommend using the code below to run a few examples and read the Wiki page to further understand this.


    The output


    Some links to output samples:

    • Create site 1 – Original 1

    https://emilwallner.github.io/bootstrap/pred_1/

    • Create site 2 – Original 2

    https://emilwallner.github.io/bootstrap/real_2/

    • Create site 3 – Original 3

    https://emilwallner.github.io/bootstrap/pred_3/

    Create site 4 – Original 4

    • https://emilwallner.github.io/bootstrap/pred_4/

    Create site 5 – Original 5

    • https://emilwallner.github.io/bootstrap/pred_5/

    The pit I dropped:
    • Instead of testing random models, understand the weaknesses of different models. I started by using random models like batch standardization and bidirectional networks and trying to implement attention mechanisms. Then I looked at the test data and found that it was impossible to accurately predict color and location. I realized that CNN had a weakness. This leads me to use larger steps instead of Maxpooling. The validation loss increased from 0.12 to 0.02, and the BLEU score increased from 85% to 97%.

    • Use only pre-trained models if they are related. Given the small size of the given data set, I think a pre-trained image model would improve performance. From my experiments, the end-to-end model is slower to train, requires more memory, but is 30% more accurate.

    • When running the model on a remote server, your plans need to be adjusted. On my Mac, it reads files alphabetically. But on a remote server, it’s randomly located. This creates a mismatch between the screenshots and the code. Although it still converges, the validation data is 50% worse than what I corrected.

    • Make sure you understand the library functions. Includes padding Spaces for hollow tokens in the vocabulary. When I didn’t add padding, the prediction didn’t include any empty tokens. This was also something I noticed after looking at the output several times, and the model never predicted a single token. On a quick check, I noticed it wasn’t even in the vocabulary. In addition, vocabularies are trained and tested in the same order.

    • Use a more lightweight model for testing. Using GRU instead of LSTM reduces each epoch cycle by 30% and has no significant impact on performance.

    The next step

    Front-end development is an ideal field for applying deep learning. Because it’s easy to generate data, and current deep learning algorithms can map most of the logic.

    One of the most exciting areas is the application of attentional mechanisms to LSTM. This will not only improve accuracy, but also allow us to intuitively see where CNN is focusing when generating tags.

    The attention mechanism is also key to the communication between tags, stylesheets, scripts, and the back end. The attention layer tracks variables to ensure that the neural network can communicate between programming languages.

    But in the near future, the big question is how to find a scalable way to generate data so that fonts, colors, text and animation can be gradually added.

    Most of the progress so far has been in the process of turning sketches into template applications. In less than two years, we’ll be able to draw an application on a piece of paper and have front-end code in less than a second. Airbnb’s design team and Uizard have built two working prototypes.

    Some experiments
    start
    • Run all models

    • Try different hyperparameters

    • Test a different CNN architecture

    • Add a bidirectional LSTM model

    • Implement the model with different data sets

    Further tests
    • Use the appropriate syntax to create a reliable random application or web page generator.

    • Data from sketches to application models. Automatically convert application or Web screenshots to sketches, and use GAN to create different types of sketches.

    • Apply an attention layer to visualize the focus of each prediction on the image, similar to this model.

    • Create a framework for modular methods. For example, multiple encoder models with fonts, one for color and one for typography, are then combined into a single decoder. Obtaining stable solid image features is a good sign.

    • Provide the neural network with simple HTML components and teach it to generate animations using CSS.

    Original link:

    https://blog.floydhub.com/turning-design-mockups-into-code-with-deep-learning/

    Please pay attention to the wechat public account “AI Front”, (ID: AI-front)