F(X) Team of Alitao Department – Minchao

The introduction

Semantematization of interface elements has always been a puzzle to D2C and AI circles. Semantematization is a key step in artificial intelligence in code generation products (such as D2C), and plays a crucial role in humanized design. At present, most of the common semantic technologies in the world are developed from pure fields, such as TxtCNN, Attention, Bert, etc. Although these methods have good effects, they still have certain limitations when applied to D2C products. This is because D2C has a vision of becoming an end-to-end system, and it is difficult to semantic them only from pure fields. For example, the field of “¥200” is difficult to bind with appropriate semantics, and may mean original price or active price. Therefore, in order to solve the semantic task of interface elements, D2C needs to solve at least two problems: 1. Ability to generate element semantics that match the current interface 2. Reduce the constraint of users and do not require additional input of auxiliary information by users.

In recent years, reinforcement learning has performed well in many fields, such as AlphaGo, robotics, automatic driving, games and so on. Its excellent performance has attracted many scholars’ research. In this paper, aiming at the semantic problems of interface elements, DRL (Deep Reinforcement Learning) based on the idea of games is introduced, and a semantic solution based on Deep Reinforcement Learning is proposed. A kind of Reinforcement Learning and training environment suitable for semantic problems is innovatively constructed. Two different types of deep reinforcement learning algorithms (1. 2. Combined with the comparison experiment of value function-based Actor-Critic algorithm DPPO), the experimental results prove the effectiveness of the proposed method and the superiority of semantic scheme based on DPPO algorithm.

In this paper, the semantic problem of interface elements is regarded as the decision problem in this situation. We directly start with the interface picture and take the interface picture as the input based on deep reinforcement learning. Deep reinforcement learning learns the optimal strategy (i.e., optimal semantics) through continuous “trial and error” mechanism. The specific work of this paper is as follows: 1. In order to better describe the working principle of this paper, the key technologies DQN algorithm and DPPO algorithm are introduced in detail; 2. 2. Construct an interesting training environment for semantic model training based on reinforcement learning; 3. Analyze the experimental results.

Detailed mathematical foundation reinforcement learning techniques on depth knowledge can share documents: reading my reinforcement learning technology overview] [depth (zhuanlan.zhihu.com/p/283438275…

An overview of the technologies used in this article

Deep reinforcement learning algorithm DQN based on value function

In this paper, the value function-based deep reinforcement learning algorithm is used as one of the specific implementation technologies for semantic field recognition. The value function-based deep reinforcement learning algorithm uses CNN to approximate the action value function of traditional reinforcement learning, and the representative algorithm is DQN algorithm. The framework of DQN algorithm is as follows:As can be seen from the above figure, one of the features of the value function-based deep reinforcement learning algorithm DQN is to use deep convolutional neural network to approximate the action value function. Where, states S and S’ are multidimensional data; The experience pool is used to store the successes and failures in the training process.

There are two neural networks with the same structure in DQN, which are called target network and evaluation network respectively. Output value in the destination networkRepresents the attenuation score when action A is selected in state S, namely:Where, r and S’ are respectively the score when action A is taken in state S and the corresponding next state;Is attenuation factor; A ‘is the maximum value that can be obtained in state S’ in the Q value vector scale of the evaluation network outputValue of the action,Is the weight parameter of the target network.

Evaluate the output of the networkRepresents the value of action A when in state S, that is:Type,Is to evaluate the weight parameters of the network.

DQN training can be divided into three stages: (1) initial stage. At this moment, the experience pool D is not full, and the experience tuple is obtained by randomly selecting behaviors at each moment T, and then stores the experience tuple for each step to the experience pool. This stage is mainly used to accumulate experience, and neither of the two networks of DQN is trained at this stage.

(2) Exploration stage. This phase uses– Greedy strategy (Gradually decreasing from 1 to 0) to obtain action A, while making decisions in the network, other possible optimal behaviors can be explored with a certain probability, so as to avoid falling into the problem of local optimal solution. In this stage, the experience tuple in the experience pool is constantly updated, and as the input of the evaluation network and the target network, is obtainedand. Then the difference is used as the loss function to update the weight parameters of the evaluation network by gradient descent algorithm. In order to make the training convergence, the weight parameters of the target network are updated as follows: the weight parameters of the evaluation network are copied to the target network parameters every fixed number of iterations.

(3) Utilization stage. In this stageFalls to 0, that is, the selected actions are all from the output of the evaluation network. The evaluation network and the target network are updated in the same way as in the exploration phase.

DQN, a deep reinforcement learning algorithm based on value function, carries out network training according to the above three stages. When the network training converges, the evaluation network will approach the optimal action value function to achieve the purpose of optimal strategy learning.

Distribution Policy Optimization (DPPO) algorithm for distributed Proximal Policy Optimization

I tried to bind DQN algorithm applied in the field, found it hard to support multiple images together DQN algorithm training, so to find a higher efficient algorithm, the study found that PPO is currently very popular reinforcement learning algorithm, OpenAI PPO as the default algorithm at present, it is conceivable that PPO may not be the strongest, But it’s probably the most widely applicable algorithm out there. In addition, PPO is easier to deal with complex environments, so I want to introduce PPO into semantic tasks. DPPO is a distributed version of PPO. For example, there are 8 workers, and each worker has independent experience. Because the correlation between experience can be avoided and training is faster, DPPO is obviously superior to PPO.

PPO algorithm

Firstly, THE PPO algorithm is introduced. PPO algorithm is based on actor-Critic (AC) architecture and is an AC algorithm that can adapt to the learning rate. In a word, PPO is summarized as follows: Solve the problem that Policy Gradient cannot determine the Learning rate(or Step size). If the step size is too large, the learned Policy will keep moving and will not converge, but if the step size is too small, we will be desperate to complete the training. PPO uses the ratio between New Policy and Old Policy to limit the updating range of New Policy, making Policy Gradient less sensitive to a slightly larger Step size.

PPO algorithm is as follows:In general, PPO is a set of actor-critic structure. Actor wants to maximize J_PPO, and Critic wants to minimize L_BL. The loss of Critic is the reduction of TD error. The new Policy is modified by Actor according to Advantage (TD error) on old Policy. When Advantage is large, the modification will be large, making the new Policy more likely to happen. And add a KL Penalty to control the Learning rate. To put it simply, if the new Policy differs from the old Policy so much, Then KL Divergence will get bigger. We don’t want the new Policy to be too different from the old Policy. If it is, it means using a big Learning rate, which is bad and difficult to convergence.

DPPO algorithm

Google DeepMind proposed a set of parallel PPO algorithms similar to A3C, namely DPPO. Compared with PPO, the differences are summarized as follows:

  • Use OpenAI’s Clipped Surrogate Objective
  • Multiple threads (workers) are used to collect data in parallel in different environments
  • Workers share a Global PPO
  • Workers will not calculate gradients for PPO themselves and will not push gradients to Global Net as A3C does
  • Workers only push the data collected by themselves to Global PPO
  • Global PPO gets the data of multiple workers in a certain batch and updates it (the worker stops collecting when the data is updated)
  • After updating, workers collect data with the latest Policy

The experimental process

Structure of reinforcement learning environment

Important factors of reinforcement learning: agent, environment, reward and punishment function design of feedback, step design. The idea of this paper is as follows: the semantic field recognition task is regarded as a process of playing a game, in which the algorithm model constantly updates model parameters according to the environmental feedback, and learns a law of how to maximize the punishment and reward functions. Specific as follows: Directly choose the module picture as the environment, and the border of the elements in the module as the agent. During algorithm training, the agent (the border of the elements) will move from top to bottom and from left to right, just like walking through a maze. Every step requires a decision to select the Action (the Action is the semantic field we want). Only when the Action is selected correctly can the agent “go” down, when the agent “go” through all the elements, it means that the algorithm model has learned the winning way. The environment structure is shown as follows:

Reinforcement learning is a kind of unsupervised learning, which does not require manual marking. However, in order to facilitate the creation of reward and punishment functions, LabelImg can be used to mark the data set. Each element has a corresponding semantic field. The marking method is as follows (the first picture is module picture, the second picture is element marking information in module, and the third picture is semantic field information) :

Model training

Model training needs to import the above environment code. Taking DQN algorithm as an example, model training and test codes are as follows:

# coding:utf-8

import os
import random
import numpy as np
# import tensorflow as tf
import tensorflow.compat.v1 as tf
tf.disable_v2_behavior()
from collections import deque
import matplotlib
matplotlib.use('TkAgg')
import matplotlib.pyplot as plt
from keras.models import Sequential
from keras.layers import Convolution2D, Flatten, Dense,Conv2D
from gan_env_semantic import semantic_env
from matplotlib.backends.backend_tkagg import FigureCanvasTkAgg
from matplotlib.figure import Figure
import tkinter as tk
import time
os.environ['CUDA_VISIBLE_DEVICES']='0'

ENV_NAME = 'semantic_env'#'Breakout-v0'  # Environment name
#FRAME_WIDTH = 714  # Resized frame width
#FRAME_HEIGHT = 534  # Resized frame height
NUM_EPISODES = 2000000  # Number of episodes the agent plays
STATE_LENGTH = 1  # Number of most recent frames to produce the input to the network
GAMMA = 0.9  # Discount factor
EXPLORATION_STEPS = 10000  # Number of steps over which the initial value of epsilon is linearly annealed to its final value
INITIAL_EPSILON = 1.0  # Initial value of epsilon in epsilon-greedy
FINAL_EPSILON = 0.01  # Final value of epsilon in epsilon-greedy
INITIAL_REPLAY_SIZE = 4000  #4000 Number of steps to populate the replay memory before training starts
NUM_REPLAY_MEMORY = 10000  #10000 Number of replay memory the agent uses for training
BATCH_SIZE = 64  #128 Mini batch size
TARGET_UPDATE_INTERVAL = 1000  # The frequency with which the target network is updated
TRAIN_INTERVAL = 4  # The agent selects 4 actions between successive updates
LEARNING_RATE = 0.00025  # Learning rate used by RMSProp
MOMENTUM = 0.95  # Momentum used by RMSProp
MIN_GRAD = 0.01  # Constant added to the squared gradient in the denominator of the RMSProp update
SAVE_INTERVAL = 10000  #10000 The frequency with which the network is saved
NO_OP_STEPS = 30  # Maximum number of "do nothing" actions to be performed by the agent at the start of an episode
LOAD_NETWORK = True
TRAIN = False
SAVE_NETWORK_PATH = 'saved_networks/' + ENV_NAME
SAVE_SUMMARY_PATH = 'summary/' + ENV_NAME
NUM_EPISODES_AT_TEST = 300  # Number of episodes the agent plays at test time


class Agent():
    def __init__(self, num_actions, FRAME_WIDTH, FRAME_HEIGHT):
        self.num_actions = num_actions
        self.epsilon = INITIAL_EPSILON
        self.epsilon_step = (INITIAL_EPSILON - FINAL_EPSILON) / EXPLORATION_STEPS
        self.t = 0
        self.FRAME_WIDTH, self.FRAME_HEIGHT = FRAME_WIDTH, FRAME_HEIGHT

        # Parameters used for summary
        self.total_reward = 0
        self.total_q_max = 0
        self.total_loss = 0
        self.duration = 0
        self.episode = 0

        # Create replay memory
        self.replay_memory = deque()

        # Create q network
        self.s, self.q_values, q_network = self.build_network()
        q_network_weights = q_network.trainable_weights

        # Create target network
        self.st, self.target_q_values, target_network = self.build_network()
        target_network_weights = target_network.trainable_weights

        # Define target network update operation
        self.update_target_network = [target_network_weights[i].assign(q_network_weights[i]) for i in range(len(target_network_weights))]

        # Define loss and gradient update operation
        self.a, self.y, self.loss, self.grads_update = self.build_training_op(q_network_weights)


        self.sess = tf.InteractiveSession()
        self.saver = tf.train.Saver(q_network_weights)
        self.summary_placeholders, self.update_ops, self.summary_op = self.setup_summary()
        self.summary_writer = tf.summary.FileWriter(SAVE_SUMMARY_PATH, self.sess.graph)

        if not os.path.exists(SAVE_NETWORK_PATH):
            os.makedirs(SAVE_NETWORK_PATH)

        self.sess.run(tf.global_variables_initializer())

        # Load network
        if LOAD_NETWORK:
            self.load_network()

        # Initialize target network
        self.sess.run(self.update_target_network)

    def build_network(self):
        model = Sequential()

        #model.add(Convolution2D(32, (8, 8), subsample=(4, 4), activation='relu', input_shape=(FRAME_WIDTH, FRAME_HEIGHT,STATE_LENGTH)))
        #model.add(Convolution2D(64, (4, 4), subsample=(2, 2), activation='relu'))
        #model.add(Convolution2D(64, (3, 3), subsample=(1, 1), activation='relu'))
        model.add(Conv2D(32, (8, 8), strides=(4, 4), activation='relu',input_shape=(self.FRAME_WIDTH, self.FRAME_HEIGHT,1)))
        model.add(Conv2D(64, (4, 4), strides=(2, 2), activation='relu'))
        model.add(Conv2D(64, (3, 3), strides=(1, 1), activation='relu'))
        model.add(Flatten())
        model.add(Dense(512, activation='relu'))
        model.add(Dense(self.num_actions))
        s = tf.placeholder(tf.float32, [None,self.FRAME_WIDTH, self.FRAME_HEIGHT,1])
        q_values = model(s)
        return s, q_values, model

    def build_training_op(self, q_network_weights):
        a = tf.placeholder(tf.int64, [None])
        y = tf.placeholder(tf.float32, [None])

        # Convert action to one hot vector
        a_one_hot = tf.one_hot(a, self.num_actions, 1.0, 0.0)
        q_value = tf.reduce_sum(tf.multiply(self.q_values, a_one_hot), reduction_indices=1)

        # Clip the error, the loss is quadratic when the error is in (-1, 1), and linear outside of that region
        error = tf.abs(y - q_value)
        quadratic_part = tf.clip_by_value(error, 0.0, 1.0)
        linear_part = error - quadratic_part
        loss = tf.reduce_mean(0.5 * tf.square(quadratic_part) + linear_part)

        optimizer = tf.train.RMSPropOptimizer(LEARNING_RATE, momentum=MOMENTUM, epsilon=MIN_GRAD)
        grads_update = optimizer.minimize(loss, var_list=q_network_weights)

        return a, y, loss, grads_update

    def get_initial_state(self, observation):
        processed_observation = np.reshape(observation,(self.FRAME_WIDTH,self.FRAME_HEIGHT,1))
        state = processed_observation
        return state

    def get_action(self, state):
        if self.epsilon >= random.random() or self.t < INITIAL_REPLAY_SIZE:
            action = random.randrange(self.num_actions)
        else:
            action = np.argmax(self.q_values.eval(feed_dict={self.s: [np.float32(state)]}))

        # Anneal epsilon linearly over time
        if self.epsilon > FINAL_EPSILON and self.t >= INITIAL_REPLAY_SIZE:
            self.epsilon -= self.epsilon_step
        #print("epsilon:{},t:{}".format(self.epsilon,self.t))
        return action

    def run(self, state, action, reward, terminal, observation):
        #next_state = np.append(state[1:, :, :,:], observation, axis=0)
        next_state = observation

        # Clip all positive rewards at 1 and all negative rewards at -1, leaving 0 rewards unchanged
        reward = np.clip(reward, -1, 1)

        # Store transition in replay memory
        self.replay_memory.append((state, action, reward, next_state, terminal))
        if len(self.replay_memory) > NUM_REPLAY_MEMORY:
            self.replay_memory.popleft()

        if self.t >= INITIAL_REPLAY_SIZE:
            # Train network
            if self.t % TRAIN_INTERVAL == 0:
                self.train_network()

            # Update target network
            if self.t % TARGET_UPDATE_INTERVAL == 0:
                self.sess.run(self.update_target_network)

            # Save network
            if self.t % SAVE_INTERVAL == 0:
                save_path = self.saver.save(self.sess, SAVE_NETWORK_PATH + '/' + ENV_NAME, global_step=self.t)
                print('Successfully saved: ' + save_path)

        self.total_reward += reward
        self.total_q_max += np.max(self.q_values.eval(feed_dict={self.s: [np.float32(state)]}))
        self.duration += 1

        if terminal or self.duration % 100 == 0:
            # Write summary
            if self.t >= INITIAL_REPLAY_SIZE:
                stats = [self.total_reward, self.total_q_max / float(self.duration),
                        self.duration, self.total_loss / (float(self.duration) / float(TRAIN_INTERVAL))]
                for i in range(len(stats)):
                    self.sess.run(self.update_ops[i], feed_dict={
                        self.summary_placeholders[i]: float(stats[i])
                    })
                summary_str = self.sess.run(self.summary_op)
                self.summary_writer.add_summary(summary_str, self.episode + 1)

            if terminal:
                # Debug
                if self.t < INITIAL_REPLAY_SIZE:
                    mode = 'random'
                elif INITIAL_REPLAY_SIZE <= self.t < INITIAL_REPLAY_SIZE + EXPLORATION_STEPS:
                    mode = 'explore'
                else:
                    mode = 'exploit'

                print('EPISODE: {0:6d} / TIMESTEP: {1:8d} / DURATION: {2:5d} / EPSILON: {3:.5f} / TOTAL_REWARD: {4:3.0f} / AVG_MAX_Q: {5:2.4f} / AVG_LOSS: {6:.5f} / MODE: {7}'.format(
                    self.episode + 1, self.t, self.duration, self.epsilon,
                    self.total_reward, self.total_q_max / float(self.duration),
                    self.total_loss / (float(self.duration) / float(TRAIN_INTERVAL)), mode))

                self.total_reward = 0
                self.total_q_max = 0
                self.total_loss = 0
                self.duration = 0
                self.episode += 1

        self.t += 1

        return next_state

    def train_network(self):
        state_batch = []
        action_batch = []
        reward_batch = []
        next_state_batch = []
        terminal_batch = []
        y_batch = []

        # Sample random minibatch of transition from replay memory
        minibatch = random.sample(self.replay_memory, BATCH_SIZE)
        for data in minibatch:
            state_batch.append(data[0])
            action_batch.append(data[1])
            reward_batch.append(data[2])
            next_state_batch.append(data[3])
            terminal_batch.append(data[4])

        # Convert True to 1, False to 0
        terminal_batch = np.array(terminal_batch) + 0

        next_action_batch = np.argmax(self.q_values.eval(feed_dict={self.s: next_state_batch}), axis=1)
        target_q_values_batch = self.target_q_values.eval(feed_dict={self.st: next_state_batch})
        for i in range(len(minibatch)):
            y_batch.append(reward_batch[i] + (1 - terminal_batch[i]) * GAMMA * target_q_values_batch[i][next_action_batch[i]])

        loss, _ = self.sess.run([self.loss, self.grads_update], feed_dict={
            self.s: np.float32(np.array(state_batch)),
            self.a: action_batch,
            self.y: y_batch
        })

        self.total_loss += loss

    def setup_summary(self):
        episode_total_reward = tf.Variable(0.)
        tf.summary.scalar(ENV_NAME + '/Total Reward/Episode', episode_total_reward)
        episode_avg_max_q = tf.Variable(0.)
        tf.summary.scalar(ENV_NAME + '/Average Max Q/Episode', episode_avg_max_q)
        episode_duration = tf.Variable(0.)
        tf.summary.scalar(ENV_NAME + '/Duration/Episode', episode_duration)
        episode_avg_loss = tf.Variable(0.)
        tf.summary.scalar(ENV_NAME + '/Average Loss/Episode', episode_avg_loss)
        summary_vars = [episode_total_reward, episode_avg_max_q, episode_duration, episode_avg_loss]
        summary_placeholders = [tf.placeholder(tf.float32) for _ in range(len(summary_vars))]
        update_ops = [summary_vars[i].assign(summary_placeholders[i]) for i in range(len(summary_vars))]
        summary_op = tf.summary.merge_all()
        return summary_placeholders, update_ops, summary_op

    def load_network(self):
        checkpoint = tf.train.get_checkpoint_state(SAVE_NETWORK_PATH)
        if checkpoint and checkpoint.model_checkpoint_path:
            self.saver.restore(self.sess, checkpoint.model_checkpoint_path)
            print('Successfully loaded: ' + checkpoint.model_checkpoint_path)
        else:
            print('Training new network...')

    def get_action_at_test(self, state):
        """
        if random.random() <= 0.05:
            action = random.randrange(self.num_actions)
        else:
            action = np.argmax(self.q_values.eval(feed_dict={self.s: [np.float32(state)]}))
        """
        action = np.argmax(self.q_values.eval(feed_dict={self.s: [np.float32(state)]}))
        self.t += 1
        return action

def main():
    root = tk.Tk()
    root.title("matplotlib in TK")
    f = Figure(figsize=(6, 6), dpi=100)
    canvas = FigureCanvasTkAgg(f, master=root)
    canvas.get_tk_widget().pack(side=tk.TOP, fill=tk.BOTH, expand=1)

    env = semantic_env()
    # FRAME_WIDTH, FRAME_HEIGHT= env.img_width,env.img_height
    agent = Agent(num_actions=env.len_actions, FRAME_WIDTH = env.img_width, FRAME_HEIGHT = env.img_height)

    if TRAIN:  # Train mode
        for _ in range(NUM_EPISODES):
            terminal = False

            observation, curr_rec_class = env.env_reset()
            state = agent.get_initial_state(observation)

            while not terminal:

                f.clf()
                a = f.add_subplot(111)
                a.imshow(observation[:, :], interpolation='nearest', aspect='auto', cmap='gray')
                a.axis('off')
                canvas.draw()
                root.update()

                action = agent.get_action(state)
                observation,reward, terminal, curr_rec_class = env.step(action,curr_rec_class)
                print(action,env.action_space[action],reward,terminal)
                if env.current_rec_num == 0:
                    print("hahahhahahhaaha I am Winner!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")
                processed_observation = np.reshape(observation, (env.img_width, env.img_height, 1))

                state = agent.run(state, action, reward, terminal, processed_observation)
    else:  # Test mode
        for _ in range(NUM_EPISODES_AT_TEST):
            terminal = False
            observation, curr_rec_class = env.env_reset()
            state = agent.get_initial_state(observation)

            while not terminal:
                f.clf()
                a = f.add_subplot(111)
                a.imshow(observation[:, :], interpolation='nearest', aspect='auto', cmap='gray')
                a.axis('off')
                canvas.draw()
                root.update()

                action = agent.get_action_at_test(state)
                observation,reward, terminal, curr_rec_class = env.step(action,curr_rec_class)

                print('This semantic word is ->   ' + env.action_space[action]+'\n')
                if env.current_rec_num == 0:
                    pass
                state = np.reshape(observation, (env.img_width, env.img_height, 1))


if __name__ == '__main__':
    main()
Copy the code

Model evaluation and results presentation

Model to evaluate

The evaluation method of reinforcement learning model is different from the classification detection problem, and there is no measurement index such as accuracy. The measurement index of reinforcement learning model is that the score keeps rising and does not decrease, that is, the game does not start again. The experimental result of this paper is indeed that the score has been rising, there is no redo. The overall score changes as follows:

Model effect display

The training results of DQN algorithm are shown as follows:

The training results of DPPO algorithm are shown below. It can be seen that this algorithm can support multi-image training.

Experimental analysis

At present, although this model can successfully carry out semantic tasks, there are still many optimization points and shortcomings of this model:

  1. Since the algorithm takes pixel images as training data set, the trained model is sensitive to element style and has good recognition effect, but is slightly inferior to pure text without style. In this regard, it can be considered to input pure text into the text classification model for classification.
  2. As a model training framework, the algorithm can be applied to many other tasks.