Click “CSDN” at the top and select “Top public account”

Critical moment, the first time to arrive!

A month ago, we published an article titled, Three years from now, Will ARTIFICIAL Intelligence Revolutionize front-end Development? “Screenshot-to-code-in-keras” was a Screenshot from a Screenshot that was listed on the GitHub list at that time. In this project, neural networks automatically turn designs into HTML and CSS code through deep learning, and author Emil Wallner says that “ai will revolutionize front-end development in three years.”

As soon as this Flag was established, it caused a very heated discussion at home and abroad, with both joy and sorrow, both praise and opposition. Emil Wallner has written a series of articles on this, especially in Turning Design Mockups Into Code With Deep Learning. He shared in detail how he built a powerful front-end code generation model according to papers such as Pix2Code, and described in detail how he used LSTM and CNN to write the design prototype into HTML and CSS websites.

The following is the full text:

In the next three years, deep learning will transform front-end development, allowing rapid prototyping and lowering the barriers to software development.

Last year saw a breakthrough in the field, with Tony Beltramelli publishing a paper on Pix2Code [1] and Airbnb launching Sketch2Code [2].

Currently, the biggest obstacle to front-end development automation is computing power. However, we can now use deep learning algorithms, as well as synthetic training data, to explore the automation of manual front-end development.

In this article, we’ll show you how to train a neural network to write basic HTML and CSS code from a design diagram. Here is a brief overview of the process:

  • Provide blueprints to trained neural networks

  • The neural network converts the design into HTML code

A larger version please click: https://blog.floydhub.com/generate_html_markup-b6ceec69a7c9cfd447d188648049f2a4.gif

  • Apply colours to a drawing the picture

We will build this neural network through three iterations.

First, let’s build a simplified version to master the infrastructure. The second version is HTML, where we will focus on the automation of each step and explain the layers of the neural network. In the final version, Boostrap, we will create a generic model to explore the LSTM layer.

You can access our code through Jupyter Notebook at Github[3] and FloydHub[4]. All FloydHub notebooks are in the ‘FloydHub’ directory, and local things are in the ‘local’ directory.

These models were created based on Beltramelli’s Pix2Code paper and Jason Brownlee’s “Image Tagging Tutorial” [5]. The code is written using Python and Keras (the upper framework of TensorFlow).

If you’re new to deep learning, I recommend familiarizing yourself with Python, back-propagation algorithms, and convolutional neural networks. You can read my three previous posts:

  • First week of learning Deep Learning [6]

  • Exploring the history of deep learning through programming [7]

  • Color black and white photos using neural network [8]

The core logic

Our goal can be summarized as: to build a neural network that can generate HTML and CSS code that matches the design diagram.

While training the neural network, you can give several screenshots and the corresponding HTML.

The neural network learns by predicting matching HTML tags one by one. When predicting the next tag, the neural network looks at the screenshot and all the correct HTML tags up to that point.

Here’s Google Sheet with a simple exercise number:

https://docs.google.com/spreadsheets/d/1xXwarcQZAHluorveZsACtXRdmNFbwGtN3WMNhcTdEyQ/edit?usp=sharing

Of course, there are other ways [9] to train neural networks, but creating a word-by-word prediction model is by far the most common practice, so we’ll use that in this tutorial as well.

Note that each prediction must be based on the same screenshot, so if the neural network needs to predict 20 words, it needs to look at the same screenshot 20 times. Putting aside how neural networks work for a moment, let’s look at the inputs and outputs of neural networks.

Let’s take a look at “previous HTML tags” first. Suppose we need to train a neural network to predict a sentence like: “I can code.” When it receives “I”, it predicts “can”. Next it receives “I can” and continues predicting “code”. That is, each time the neural network receives all the previous words, but only predicts the next word.

The neural network creates features from the data. It must connect the input data with the output data by creating features. It needs to build a representation to understand what is in the screenshot and the HTML syntax predicted. The knowledge accumulated during this process can be used to predict the next tag.

The practical application of a trained model is similar to the process of training the model. The model generates text one by one from the same screenshot. The difference is that you don’t have to supply the correct HTML tags; the model just takes the tags generated so far and predicts the next tag. The prediction starts with the “start” tag and terminates when the prediction reaches the “end” tag or exceeds the maximum limit. Here’s another example from Google Sheet:

https://docs.google.com/spreadsheets/d/1yneocsAb_w3-ZUdhwJ1odfsxR2kr-4e_c5FabQbNJrs/edit#gid=0

Hello World version

Let’s try to create a “Hello World” version. We gave the neural network a screenshot of a web page that said “Hello World” and taught it how to generate HTML code.

A larger version please click: https://blog.floydhub.com/hello_world_generation-039d78c27eb584fa639b89d564b94772.gif

First, the neural network converts the design into a series of pixel values, each containing three channels (red, blue and green) with values ranging from 0 to 255.

I use one-hot encoding [10] here to describe the way a neural network understands HTML code. The encoding of the sentence “I can code” is shown below:

The example above adds the “start” and “end” tags. These labels tell the neural network where to start and where to stop.

We use sentences as input data, the first sentence contains only the first word, and then one new word is added each time. The output data is always a single word.

Sentences have the same logic as words, but they also need to ensure that the input data is of the same length. The upper limit for words is the size of the vocabulary, while the upper limit for sentences is the maximum length of sentences. If the sentence length is less than the maximum length, fill it up with empty words — empty words are all zero words.

As shown in the image above, the words are arranged from right to left, which forces each word to change position during each training round. This allows the model to learn the order of words instead of remembering the position of each word.

The chart below shows four predictions, with each row representing one prediction. On the left side of the equation is the graph represented by the values of the red, green and blue channels, along with the previous words. Outside the brackets are the predictions for each time, with the last red square representing the end.

#Length of longest sentencemax_caption_len = 3#Size of vocabulary vocab_size = 3# Load one screenshot for each word and turn them into digits images = []for i in range(2): images.append(img_to_array(load_img('screenshot.jpg', target_size=(224, 224))))images = np.array(images, dtype=float)# Preprocess input for the VGG16 modelimages = preprocess_input(images)#Turn start tokens into one-hot encodinghtml_input = np.array( [[[0., 0., 0.], #start [0., 0., 0.], [1., 0., 0.]], [[0., 0., 0.], #start <HTML>Hello World! </HTML> [1., 0., 0.], [0., 1., 0.]]])#Turn next word into one-hot encodingnext_words = np.array( [[0., 1., 0.], # <HTML>Hello World! </HTML> [0., 0., 1.]]) # end# Load the VGG16 model trained on imagenet and output the classification featureVGG = VGG16(weights='imagenet', include_top=True)# Extract the features from the imagefeatures = VGG.predict(images)#Load the feature to the network, apply a dense layer, and repeat the vectorvgg_feature = Input(shape=(1000,))vgg_feature_dense = Dense(5)(vgg_feature)vgg_feature_repeat = RepeatVector(max_caption_len)(vgg_feature_dense)# Extract information from the input seqence language_input = Input(shape=(vocab_size, vocab_size))language_model = LSTM(5, return_sequences=True)(language_input)# Concatenate the information from the image and the inputdecoder = concatenate([vgg_feature_repeat, language_model])# Extract information from the concatenated outputdecoder = LSTM(5, return_sequences=False)(decoder)# Predict which word comes nextdecoder_output = Dense(vocab_size, activation='softmax')(decoder)# Compile and run the neural networkmodel = Model(inputs=[vgg_feature, language_input], outputs=decoder_output)model.compile(loss='categorical_crossentropy', optimizer='rmsprop')# Train the neural networkmodel.fit([features, html_input], next_words, batch_size=2, shuffle=False, epochs=1000)Copy the code

<HTML><center><H1>Hello World! < / H1 > < / center > < / HTML > “and” end “. A token can stand for anything. It can be a character, a word, or a sentence. The advantage of choosing characters as tokens is that the required vocabulary is small, but it limits the learning of the neural network. Choosing words as tokens has the best performance.

The next prediction:

# Create an empty sentence and insert the start tokensentence = np.zeros((1, 3, 3)) # [[0,0,0], [0, 0]] start_token = [1, 0., 0.] # startsentence[0][2] = start_token # place start in empty sentence# Making the first prediction with the start tokensecond_word = model.predict([np.array([features[1]]), sentence])# Put the second word in the sentence and make the final predictionsentence[0][1] = start_tokensentence[0][2] = np.round(second_word)third_word = model.predict([np.array([features[1]]), sentence])# Place the start token and our two predictions in the sentence sentence[0][0] = start_tokensentence[0][1] = np.round(second_word)sentence[0][2] = np.round(third_word)# Transform our one-hot predictions into the final tokensvocabulary = ["start", "<HTML><center><H1>Hello World!</H1></center></HTML>", "end"]for i in sentence[0]: print(vocabulary[np.argmax(i)], end=' ')Copy the code

The output

  • 10 Epochs: start start start

  • 100 epochs: start

    Hello World!


    Hello World!

  • 300 epochs: start

    Hello World!

    end

Among them, I made mistakes

  • Build a working first version, then collect data. Early in the project, I managed to download an old archive of the entire Geocities hosting site, which contained 38 million websites. Because of the powerful potential of neural networks, I didn’t take into account the huge amount of work involved in generalizing a 100,000-size vocabulary.

  • Handling terabytes of data requires good hardware or a lot of patience. After a few problems with my Mac, I had to use a powerful remote server. In order to ensure the smooth flow of work, you need to be prepared to rent a mining machine with 8 CPU and 1G bandwidth.

  • The key is figuring out the input and output data. Input X is a screenshot and the previous HTML tag. And the output Y is the next label. Once I understand the input and output data, it’s easy to understand the rest. It is also easier to experiment with different architectures.

  • Stay focused and don’t be tempted. Because this project involves many fields of deep learning, I am deeply involved in many aspects. I’ve spent a week writing RNNS from scratch, been addicted to embedding vector Spaces, and fallen into the trap of extreme implementations.

  • The image-to-code network is nothing more than an image annotation model in disguise. Even though I knew this, I ignored many of the papers on image tagging because they weren’t cool enough. Having some knowledge of this can help us speed up our learning of the problem space.

Run the code on FloydHub

FloydHub is a training platform for deep learning. I discovered this platform when I first started learning deep learning, and I’ve been using it to train and manage my deep learning experiments ever since. You can get your model up and running in under 10 minutes and it’s the best choice for running your model on a cloud GPU.

If you haven’t used FloydHub, please refer to the official “2-minute Setup Manual” or the “5-minute Starter Tutorial” I wrote [11].

Clone code repository:

git clone https://github.com/emilwallner/Screenshot-to-code-in-Keras.gitCopy the code

Login and initialize FloydHub command-line tools:

cd Screenshot-to-code-in-Kerasfloyd loginfloyd init s2cCopy the code

Running Jupyter Notebook on FloydHub’s cloud GPU machine:

Floyd run - gpu env tensorflow - 1.4 - data emilwallner/datasets/imagetocode / 2: data - mode jupyterCopy the code

All notebook files are stored in the ‘FloydHub’ directory, and everything local is in the ‘Local’ directory. After you run it, you can find the first notebook in the following file:

floydhub/Helloworld/helloworld.ipynb

If you want to learn more about the command parameters, please refer to my post:

https://blog.floydhub.com/colorizing-b&w-photos-with-neural-networks/

The HTML version

In this release, we will automate some of the steps in the Hello World model. In this section we will focus on how to make the model handle arbitrarily large amounts of input data and the key parts of building a neural network.

This version does not yet predict HTML for any site, but we will try to resolve key technical issues here and take a big step towards success.

An overview of the

We can expand the previous illustration as follows:

There are two main parts in the figure above. First, the coding part. The coding part is responsible for establishing image features and previous label features. A feature is the smallest unit of data created by a neural network to connect a design to HTML code. At the end of the coding section, we connect the image features to each word of the previous label.

The other major part is the decoding part. The decoding section is responsible for receiving the features of the aggregated design and HTML code and creating the features of the next tag. This feature uses a fully connected neural network to predict the next tag.

Characteristics of a design drawing

Since we need to add a screenshot for each word, this can be a bottleneck in training the neural network. So instead of using images directly, we extract the information necessary to generate labels.

The extracted information is encoded and stored in the image features. This work can be done by a pre-trained convolutional neural network (CNN). The model can be trained with ImageNet data.

The last layer of CNN is the classification layer, so we can extract image features from the previous layer.

We ended up with 1536 8×8 pixel images as features. Although it is difficult to understand the meaning of these features, neural networks can extract the object and location of elements from them.

Characteristics of HTML tags

In the Hello World version, we use one-hot encoding to represent HTML tags. In this version, we will use word embedding as the input information and the output will still be one-hot encoded.

We continue to analyze sentences in the same way as before, but the way each token is matched has changed. Whereas previous one-HOT encoding treated each word as a separate unit, here we convert each word in the input data into a series of numbers that represent relationships between HTML tags.

The word embedment in the above example is 8-dimensional, when in reality it is between 50 and 500 dimensions, depending on the size of the vocabulary.

The eight numbers in each word indicate weights, much like the original neural network. They represent relationships between words (Mikolov et al., 2013[12]).

That’s how we set up the HTML tag features. Neural networks use this feature to establish connections between input and output data. Don’t worry about the details just yet, we’ll talk more about this in the next section.

The coding part

We need to input the results of word embedding into the LSTM and return a set of tag features that are fed into the Time Distributed Dense layer — you can think of it as the dense layer with multiple inputs and outputs.

At the same time, the image features need to be flatten first. No matter what the original structure of the values is, they will be converted into a huge list of values. Then through the dense layer to establish more advanced features; Finally, connect these features to the features of the HTML tag.

This might be a little hard to understand, but let’s break it down.

HTML Tag Features

First we input the result of the word embedding into the LSTM layer. As shown in the figure below, all sentences are padded to the maximum length, i.e., three tokens.

To mix these signals and find higher-level patterns, we added TimeDistributed Dense layer to further process the HTML tag features generated by the LSTM layer. The TimeDistributed dense layer is the dense layer with multiple inputs and outputs.

Image characteristics

At the same time, we need to process the image. We convert all the features (small images) into a long array, where the information remains the same, just reorganized.

Similarly, to mix signals and extract higher-level information, we added a dense layer. Since there is only one input, we can use the normal Dense layer. To connect to the HTML tag features, we need to copy the image features.

In the example above we have three HTML tag features, so the number of final image features is also three.

Connect image features to HTML tag features

All sentences are filled in to form three features. Now that we have the image features ready, we can add the image features to the respective HTML tag features.

After the addition, we have three image-tag features, which are the input we need to provide to the decoding part.

Decoding part

Next, we use the image-tag combination to predict the next tag.

In the following example, we use three pairs of graphic-tag features to output the next tag feature.

Note that the SEQUENCE value of the LSTM layer is false, so we do not need to return the length of the input sequence. We only need to predict one feature, the feature of the next tag, which contains the final prediction information.

Finally forecast

The Dense layer works in a similar way to a traditional feedforward neural network, linking 512 numbers for the next tag feature to four final predictions. In our words: start, hello, world and end.

The Softmax activation function of the Dense layer generates a probability distribution of 0-1, with the sum of all predicted values equal to 1. For example, the prediction of the vocabulary might be [0.1,0.1,0.1,0.7], and the output prediction would be: the fourth word is the next tag. You can then translate the one-hot encoding [0,0,0,1] into a mapping value, resulting in “end”.

# Load the images and preprocess them for inception-resnetimages = []all_filenames = listdir('images/')all_filenames.sort()for filename in all_filenames: images.append(img_to_array(load_img('images/'+filename, target_size=(299, 299))))images = np.array(images, dtype=float)images = preprocess_input(images) # Run the images through inception-resnet and extract the features without  the classification layerIR2 = InceptionResNetV2(weights='imagenet', include_top=False)features = IR2.predict(images) # We will cap each input sequence to 100 tokensmax_caption_len = 100# Initialize the function that will create our vocabulary tokenizer = Tokenizer(filters='', split=" ", lower=False) # Read a document and return a stringdef load_doc(filename): file = open(filename, 'r') text = file.read() file.close() return text # Load all the HTML filesX = []all_filenames = listdir('html/')all_filenames.sort()for filename in all_filenames:X.append(load_doc('html/'+filename)) # Create the vocabulary from the html filestokenizer.fit_on_texts(X) # Add +1 to leave space for empty wordsvocab_size = len(tokenizer.word_index) + 1# Translate each word in text file to the matching vocabulary indexsequences = tokenizer.texts_to_sequences(X)# The longest HTML filemax_length = max(len(s) for s in sequences) # Intialize our final input to the modelX, y, image_data = list(), list(), list()for img_no, seq in enumerate(sequences): for i in range(1, len(seq)): # Add the entire sequence to the input and only keep the next word for the output in_seq, out_seq = seq[:i], seq[i] # If the sentence is shorter than max_length, fill it up with empty words in_seq = pad_sequences([in_seq], maxlen=max_length)[0] # Map the output to one-hot encoding out_seq = to_categorical([out_seq], num_classes=vocab_size)[0] # Add and image corresponding to the HTML file image_data.append(features[img_no]) # Cut the input sentence to 100 tokens, and add it to the input data X.append(in_seq[-100:]) y.append(out_seq) X, y, image_data = np.array(X), np.array(y), np.array(image_data) # Create the encoderimage_features = Input(shape=(8, 8, 1536,))image_flat = Flatten()(image_features)image_flat = Dense(128, activation='relu')(image_flat)ir2_out = RepeatVector(max_caption_len)(image_flat) language_input = Input(shape=(max_caption_len,))language_model = Embedding(vocab_size, 200, input_length=max_caption_len)(language_input)language_model = LSTM(256, return_sequences=True)(language_model)language_model = LSTM(256, return_sequences=True)(language_model)language_model = TimeDistributed(Dense(128, activation='relu'))(language_model) # Create the decoderdecoder = concatenate([ir2_out, language_model])decoder = LSTM(512, return_sequences=False)(decoder)decoder_output = Dense(vocab_size, activation='softmax')(decoder) # Compile the modelmodel = Model(inputs=[image_features, language_input], outputs=decoder_output)model.compile(loss='categorical_crossentropy', optimizer='rmsprop') # Train the neural networkmodel.fit([image_data, X], y, batch_size=64, shuffle=False, epochs=2) # map an integer to a worddef word_for_id(integer, tokenizer): for word, index in tokenizer.word_index.items(): if index == integer: return word return None # generate a description for an imagedef generate_desc(model, tokenizer, photo, max_length): # seed the generation process in_text = 'START' # iterate over the whole length of the sequence for i in range(900): # integer encode input sequence sequence = tokenizer.texts_to_sequences([in_text])[0][-100:] # pad input sequence = pad_sequences([sequence], maxlen=max_length) # predict next word yhat = model.predict([photo,sequence], verbose=0) # convert probability to integer yhat = np.argmax(yhat) # map integer to word word = word_for_id(yhat, tokenizer) # stop if we cannot map the word if word is None: break # append as input for generating the next word in_text += ' ' + word # Print the prediction print(' ' + word, end='') # stop if we predict the end of the sequence if word == 'END': break return # Load and image, preprocess it for IR2, extract features and generate the HTMLtest_image = img_to_array(load_img('images/87.jpg', target_size=(299, 299)))test_image = np.array(test_image, dtype=float)test_image = preprocess_input(test_image)test_features = IR2.predict(np.array([test_image]))generate_desc(model, tokenizer, np.array(test_features), 100)Copy the code

The output

Generate links to websites:

  • 250 epochs: https://emilwallner.github.io/html/250_epochs/

  • 350 epochs: https://emilwallner.github.io/html/350_epochs/

  • 450 epochs: https://emilwallner.github.io/html/450_epochs/

  • 550 epochs: https://emilwallner.github.io/html/450_epochs/

If you can’t see the page by clicking on the link above, you can choose “View source”. Below is a link to the original site for your reference only:

https://emilwallner.github.io/html/Original/

The mistakes I made

  • Compared to CNN, LSTM is far more complicated than I expected. To better understand, I’ve expanded all the LSTM. You can refer to this video about RNN (http://course.fast.ai/lessons/lesson6.html). Also, understand the input and output characteristics before understanding the principles.

  • It’s easier to create a vocabulary from scratch than to pare down large vocabularies. Vocabularies can include anything, such as fonts, div sizes, hexadecimal colors, variable names, and ordinary words.

  • Most code bases parse text documents well, but not code. Because all the words in the document are separated by Spaces, but the code is different, you have to figure out how to parse the code yourself.

  • Extracting features from imagenet-trained models may not be a good idea. Because Imagenet has few images of web pages, its loss rate is 30% higher than the Pix2code model trained from scratch. I wonder what would happen if I used web screenshots to train a model like inception-resnet.

The Bootstrap version

In the final version, the Bootstrap version, we used data sets from the Bootstrap website generated from the Pix2Code paper. Through the use of Twitter bootstrap (https://getbootstrap.com/), we can combine the HTML and CSS, and reduce the size of the vocabulary.

We can provide a screenshot that it has never seen before and train it to generate the appropriate HTML code. We can also delve into the process of learning this screenshot and HTML code.

Instead of bootstrap HTML code, here we train it using 17 simplified tokens and then translate it into HTML and CSS. This dataset [13] includes 1500 test screenshots and 250 validation screenshots. There were an average of 65 tokens on each screenshot, containing 96,925 training samples.

By modifying the pix2Code paper’s model to provide input data, our model was able to predict web page composition with 97% accuracy (we used BLEU 4-Ngram Greedy Search, more on that later).

An end-to-end approach

Image annotation models can extract features from pre-trained models, but after a few experiments, I found that pix2Code’s end-to-end approach was better at extracting features for our model because pre-trained models were not trained with web data and were intended for classification.

In this model, we replace the pre-trained image features with a lightweight convolutional neural network. Max-pooling is not adopted to increase the information density, but the stride is added to ensure the position and color of front-end elements.

Two core models support this approach: convolutional neural network (CNN) and recursive neural network (RNN). The most common recursive neural network is LSTM, so I chose RNN.

There are many tutorials on CNN that I have covered in other articles. I will focus on LSTM here.

Understand timestep in LSTM

One of the hardest things to understand in LSTM is timestep. The original neural network can be thought of as having only two timesteps. If the input is “Hello” (the first timestep), it predicts “World” (the second timestep), but it can’t predict more timesteps. The following example has four timesteps, one for each word.

LSTM works with input that contains timestep, a neural network that specializes in processing ordered information. As you expand the model, you see that each step down holds the same weight. In addition, the previous output and the new input need to be weighted separately.

Next, the input and output are added by multiplying the weights, and the output of the timestep is obtained by activating the function. Because the weights do not vary with timestep, they can get information from multiple inputs to learn the order of words.

The following diagram illustrates the processing of each timestep in an LSTM using a simple legend.

To better understand this logic, I recommend that you follow Andrew Trask’s excellent tutorial [14] and try to create an RNN from scratch.

Understand the units in the LSTM layer

The number of units in an LSTM layer determines its memory capacity, as well as the size of each output feature. Again, a feature is a long list of values used to pass information from layer to layer.

Each unit in the LSTM layer is responsible for keeping track of different information in the syntax. The following figure shows an example of a cell that holds information for the layout row “div”. We simplified the HTML code and used it to train the Bootstrap model.

Each LSTM cell has a cell state. You can think of the cell state as the memory of the cell. Weights and activation functions can change states in a variety of ways. So the LSTM layer can fine-tune the information that needs to be saved and discarded for each input.

As output characteristics are passed to inputs, unit state is also passed, and each unit of the LSTM needs to pass its own unit state value. To understand how the parts of the LSTM interact, I suggest you read:

Colah tutorial: https://colah.github.io/posts/2015-08-Understanding-LSTMs/

Jayasiri Numpy implementation: http://blog.varunajayasiri.com/numpy_lstm.html

Karphay lectures and article: https://www.youtube.com/watch?v=yCC09vCHzF8; https://karpathy.github.io/2015/05/21/rnn-effectiveness/

dir_name = 'resources/eval_light/' # Read a file and return a stringdef load_doc(filename): file = open(filename, 'r') text = file.read() file.close() return text def load_data(data_dir): text = [] images = [] # Load all the files and order them all_filenames = listdir(data_dir) all_filenames.sort() for filename in (all_filenames): if filename[-3:] == "npz": # Load the images already prepared in arrays image = np.load(data_dir+filename) images.append(image['features']) else: # Load the boostrap tokens and rap them in a start and end tag syntax = '<START> ' + load_doc(data_dir+filename) + ' <END>' # Seperate all the words with a single space syntax = ' '.join(syntax.split()) # Add a space after each comma syntax = syntax.replace(',', ' ,') text.append(syntax) images = np.array(images, dtype=float) return images, text train_features, texts = load_data(dir_name)# Initialize the function to create the vocabulary tokenizer = Tokenizer(filters='', split=" ", lower=False)# Create the vocabulary tokenizer.fit_on_texts([load_doc('bootstrap.vocab')]) # Add one spot for the empty word in the vocabulary vocab_size = len(tokenizer.word_index) + 1# Map the input sentences into the vocabulary indexestrain_sequences = tokenizer.texts_to_sequences(texts)# The longest set of boostrap tokensmax_sequence = max(len(s) for s in train_sequences)# Specify how many tokens to have in each input sentencemax_length = 48 def preprocess_data(sequences, features): X, y, image_data = list(), list(), list() for img_no, seq in enumerate(sequences): for i in range(1, len(seq)): # Add the sentence until the current count(i) and add the current count to the output in_seq, out_seq = seq[:i], seq[i] # Pad all the input token sentences to max_sequence in_seq = pad_sequences([in_seq], maxlen=max_sequence)[0] # Turn the output into one-hot encoding out_seq = to_categorical([out_seq], num_classes=vocab_size)[0] # Add the corresponding image to the boostrap token file image_data.append(features[img_no]) # Cap the input sentence to 48 tokens and add it X.append(in_seq[-48:]) y.append(out_seq) return np.array(X), np.array(y), np.array(image_data) X, y, image_data = preprocess_data(train_sequences, train_features) #Create the encoderimage_model = Sequential()image_model.add(Conv2D(16, (3, 3), padding='valid', Activation ='relu', input_shape=(256, 256, 3,)) image_model.add(Conv2D(16, (3,3), activation='relu', padding='same', Strides =2) image_model.add(Conv2D(32, (3,3), activation='relu', padding='same'))image_model.add(Conv2D(32, (3,3), Activation ='relu', padding='same', strides=2))image_model.add(Conv2D(64, (3,3), activation='relu', Padding = 'same')) image_model. Add (Conv2D (64, (3, 3), the activation = 'relu', padding = 'the same', Strides = 2) image_model. Add (Conv2D (128, (3, 3), the activation = 'relu', padding='same')) image_model.add(Flatten())image_model.add(Dense(1024, Activation = 'relu) image_model. Add (Dropout (0.3)) image_model. Add (Dense (1024, Activation = 'relu) image_model. Add (Dropout (0.3)) image_model. Add (RepeatVector (max_length) visual_input = Input(shape=(256, 256, 3,))encoded_image = image_model(visual_input) language_input = Input(shape=(max_length,))language_model = Embedding(vocab_size, 50, input_length=max_length, mask_zero=True)(language_input)language_model = LSTM(128, return_sequences=True)(language_model)language_model = LSTM(128, return_sequences=True)(language_model) #Create the decoderdecoder = concatenate([encoded_image, language_model])decoder = LSTM(512, return_sequences=True)(decoder)decoder = LSTM(512, return_sequences=False)(decoder)decoder = Dense(vocab_size, activation='softmax')(decoder) # Compile the modelmodel = Model(inputs=[visual_input, language_input], Outputs = decoder) optimizer = RMSprop (lr = 0.0001, clipvalue = 1.0) model.com running (loss = 'categorical_crossentropy', optimizer=optimizer) #Save the model for every 2nd epochfilepath="org-weights-epoch-{epoch:04d}--val_loss-{val_loss:.4f}--loss-{loss:.4f}.hdf5"checkpoint = ModelCheckpoint(filepath, monitor='val_loss', verbose=1, save_weights_only=True, period=2)callbacks_list = [checkpoint] # Train the modelmodel.fit([image_data, X], y, batch_size=64, shuffle=False, Validation_split = 0.1, callbacks = callbacks_list, verbose = 1, epochs = 50)Copy the code

Test accuracy

It’s hard to find a reasonable way to measure accuracy. You can compare words one by one, but if one of them is wrong, you’re probably right about 0% of the time. If the word is removed for simultaneous prediction, the accuracy rate becomes 99/100 again.

I used the BLEU score, which is the best choice for testing machine translation and image tagging models. It divides sentences into four N-grams, gradually expanding from a sequence of one word to four words. In the following example, “cat” in the predicted result should actually be “code”.

To calculate the final score, you first multiply each N-gram score by 25% and sum it up, (4/5) * 0.25 + 0.25 + (2/4) * (1/3) + (.two survivors) * 0.25 * 0.25 + 1.25 + 0.083 + 0 = 02 = 0.408; The sum is multiplied by the penalty factor for sentence length. Since the predicted sentence length was correct in this case, this is the final score.

Increasing the number of N-grams increases the difficulty. Four N-grams are best for human translation. To learn more about BLEU, I recommend running a few examples with the following code and reading this wiki page [15].

#Create a function to read a file and return its contentdef load_doc(filename): file = open(filename, 'r') text = file.read() file.close() return text def load_data(data_dir): text = [] images = [] files_in_folder = os.listdir(data_dir) files_in_folder.sort() for filename in tqdm(files_in_folder): #Add an image if filename[-3:] == "npz": image = np.load(data_dir+filename) images.append(image['features']) else: # Add text and wrap it in a start and end tag syntax = '<START> ' + load_doc(data_dir+filename) + ' <END>' #Seperate each word with a space syntax = ' '.join(syntax.split()) #Add a space between each comma syntax = syntax.replace(',', ' ,') text.append(syntax) images = np.array(images, dtype=float) return images, text #Intialize the function to create the vocabularytokenizer = Tokenizer(filters='', split=" ", lower=False)#Create the vocabulary in a specific ordertokenizer.fit_on_texts([load_doc('bootstrap.vocab')]) dir_name = '.. /.. /.. /.. /eval/'train_features, texts = load_data(dir_name) #load model and weights json_file = open('.. /.. /.. /.. /model.json', 'r')loaded_model_json = json_file.read()json_file.close()loaded_model = model_from_json(loaded_model_json)# load weights  into new modelloaded_model.load_weights(".. /.. /.. /.. /weights.hdf5")print("Loaded model from disk") # map an integer to a worddef word_for_id(integer, tokenizer): for word, index in tokenizer.word_index.items(): if index == integer: return word return Noneprint(word_for_id(17, tokenizer)) # generate a description for an imagedef generate_desc(model, tokenizer, photo, max_length): photo = np.array([photo]) # seed the generation process in_text = '<START> ' # iterate over the whole length of the sequence print('\nPrediction---->\n\n<START> ', end='') for i in range(150): # integer encode input sequence sequence = tokenizer.texts_to_sequences([in_text])[0] # pad input sequence = pad_sequences([sequence], maxlen=max_length) # predict next word yhat = loaded_model.predict([photo, sequence], verbose=0) # convert probability to integer yhat = argmax(yhat) # map integer to word word = word_for_id(yhat, tokenizer) # stop if we cannot map the word if word is None: break # append as input for generating the next word in_text += word + ' ' # stop if we predict the end of the sequence print(word + ' ', end='') if word == '<END>': break return in_text max_length = 48 # evaluate the skill of the modeldef evaluate_model(model, descriptions, photos, tokenizer, max_length): actual, predicted = list(), list() # step over the whole set for i in range(len(texts)): yhat = generate_desc(model, tokenizer, photos[i], max_length) # store actual and predicted print('\n\nReal---->\n\n' + texts[i]) actual.append([texts[i].split()]) predicted.append(yhat.split()) # calculate BLEU score bleu = corpus_bleu(actual, predicted) return bleu, actual, predicted bleu, actual, predicted = evaluate_model(loaded_model, texts, train_features, tokenizer, max_length) #Compile the tokens into HTML and cssdsl_path = "compiler/assets/web-dsl-mapping.json"compiler = Compiler(dsl_path)compiled_website = compiler.compile(predicted[0], 'index.html') print(compiled_website )print(bleu)Copy the code

The output

Links to output examples

Site 1:

  • The generated website: https://emilwallner.github.io/bootstrap/pred_1/

  • The original web site: https://emilwallner.github.io/bootstrap/real_1/

Site 2:

  • The generated website: https://emilwallner.github.io/bootstrap/pred_2/

  • The original web site: https://emilwallner.github.io/bootstrap/real_2/

Web site:

  • The generated website: https://emilwallner.github.io/bootstrap/pred_3/

  • The original web site: https://emilwallner.github.io/bootstrap/real_3/

Site 4:

  • The generated website: https://emilwallner.github.io/bootstrap/pred_4/

  • The original web site: https://emilwallner.github.io/bootstrap/real_4/

Site 5:

  • The generated website: https://emilwallner.github.io/bootstrap/pred_5/

  • The original web site: https://emilwallner.github.io/bootstrap/real_5/

The mistakes I made

  • Learn to understand your model’s weaknesses and avoid blindly testing your model. In the beginning, I tried some random things like Batch Normalization, bidirectional Network, and I tried to do attention. After looking at the test data, I found that these could not accurately predict color and location, and I began to realize that this was CNN’s weakness. Therefore, I abandon maxpooling and increase the step size instead. Results Test loss decreased from 0.12 to 0.02 and BLEU score increased from 85% to 97%.

  • Use only relevant pre-trained models. When the data set was small, I thought a pre-trained image model would be more efficient. The results showed that the end-to-end model, while slower and requiring more memory for training, was 30 percent more accurate.

  • Be prepared for some differences when running your model on a remote server. When running on my Mac, the files are read alphabetically. But on a remote server it is read randomly. The result was a mismatch between the screenshots and the code. It still converged, but after I fixed the problem, the accuracy of the test data improved by 50%.

  • It is important to understand the library functions. Empty tokens in the vocabulary need to include Spaces. When I didn’t add a space at the beginning, I missed a token. It wasn’t until I looked at the final output a few times that I noticed it never predicted a token. After checking, the token was not in the vocabulary. Also, make sure you use the same vocabulary order for training and testing.

  • Use lightweight models for testing. Replacing the LSTM with a GRU reduces the time of each epoch by 30% without much impact on performance.

The next step

Deep learning is a good application for front-end development because it is easy to generate data and today’s deep learning algorithms can cover most logic.

One of the most interesting aspects is the use of the attention mechanism in LSTM [16]. It not only improves accuracy, but also helps us see where CSS is paying attention when generating HTML code.

Attention is also the key to HTML code, stylesheets, scripts, and even backend communication. The Attention layer tracks parameters to help neural networks communicate between different programming languages.

But in the short term, the biggest challenge is finding a scalable way to generate data. This will allow you to gradually add fonts, colors, words, and animations.

So far, many people have struggled with sketching and turning it into a template for an application. Within two years, we’ll be able to draw applications on paper and have front-end code in a second. The Airbnb design team [17] and Uizard[18] have created two prototypes.

Here are some experiments worth trying.

The experiment

Getting started:

  • Run all models

  • Try different hyperparameters

  • Try different CNN architectures

  • Add Bidirectional’s LSTM model

  • Using different data sets the implementation model [19] (you can use the FloydHub parameters “- data” mount this dataset: emilwallner/datasets / 100 k – HTML: data)

Advanced experimental

  • Create a generator that can reliably generate any application/web page using a specific syntax

  • Generate design diagram data for the application model. Automatically convert screenshots of your application or Web page into designs and use gans to make changes.

  • By attention to observe every time predict the image focus, similar to the model: https://arxiv.org/abs/1502.03044

  • Create a framework for modular methods. For example, one model encodes fonts, one for colors, and another for layout, and uses the decoding section to combine them. You can start with still image features.

  • Provide simple HTML component unit for neural network and train it to generate animation with CSS. Adding the Attention module makes it even better to observe the focus of the input source.

Finally, many thanks to Tony Beltramelli and Jon Gold for their research, ideas and answers to various questions. Thank you Jason Brownlee for his stellar Keras tutorial (I added a few snippets from his tutorial to the core Keras implementation) and thank you Beltramelli for the data. Thank you Qingping Hou, Charlie Harrington, Sai Soundararaj, Jannes Klaas, Claudio Cabral, Alain Demenet, and Dylan Djian for reading this article.

A link to the

[1] pix2code paper: https://arxiv.org/abs/1705.07962

[2] sketch2code: https://airbnb.design/sketching-interfaces/

[3] https://github.com/emilwallner/Screenshot-to-code-in-Keras/blob/master/README.md

[4] https://www.floydhub.com/emilwallner/projects/picturetocode

[5] https://machinelearningmastery.com/blog/page/2/

[6] https://blog.floydhub.com/my-first-weekend-of-deep-learning/

[7] https://blog.floydhub.com/coding-the-history-of-deep-learning/

[8] https://blog.floydhub.com/colorizing-b&w-photos-with-neural-networks/

[9] https://machinelearningmastery.com/deep-learning-caption-generation-models/

[10] https://machinelearningmastery.com/how-to-one-hot-encode-sequence-data-in-python/

[11] https://www.youtube.com/watch?v=byLQ9kgjTdQ&t=21s

[12] https://arxiv.org/abs/1301.3781

[13] https://github.com/tonybeltramelli/pix2code/tree/master/datasets

[14] https://iamtrask.github.io/2015/11/15/anyone-can-code-lstm/

[15] https://en.wikipedia.org/wiki/BLEU

[16] https://arxiv.org/pdf/1502.03044.pdf

[17] https://airbnb.design/sketching-interfaces/

[18] https://www.uizard.io/

[19] http://lstm.seas.harvard.edu/latex/