The author | Sandeep Bhupatiraju
The translator | Liu Zhiyong
Edit | Debra Chen
AI Front Line introduction:More than 100 years ago, in the United States, people used Morse code to send the first telegraph in human history, and mankind began a new chapter. The advent of Morse code has had a comprehensive and far-reaching impact on human society, and even changed the development direction of human history. However, with the explosive development of contemporary information, multiple communication methods have gradually pushed the single telegraph into a corner. The telegraph rose from a pioneering experiment to its zenith in the war years and was forgotten by the modern grassroots. There are still organizations and hobbyists who insist on using the telegraph, but the pitiful amount of usage is nothing like the telephone and other forms of communication. To interpret messages encoded in Morse code, you need a code table to know what the messages mean. How can I break a table without a code? Can YOU use RNN to crack Morse code? Of course you can! That’s a good idea. Cracking Morse Code with RNNs by Sandeep Bhupatiraju, a math and data science expert, let’s look at Cracking Morse Code using RNN.






Please pay attention to the wechat public account “AI Front”, (ID: AI-front)

Spoiler alert: Morse code doesn’t have to be cracked. It is useful because messages can be sent with a minimum of devices. I say it doesn’t need cracking because the code is well known, and the combination of dots and dashes isn’t a secret. But in theory, it is a substitution cipher: each letter and each number are represented by dots and dashes, as shown below:

AI: Morse code is an on-off signal code that uses different sequences of letters, numbers, and punctuation marks. It was invented in 1837. Morse code was an early form of digital communication, but unlike modern binary code that uses only zero and one or two states, it has five types of code: dots, dashes, pauses between dots, medium pauses between words, and long pauses between sentences.

Let’s assume for a moment that we receive a message from Morse code, but we don’t know how to interpret it. Suppose we also have some code examples and their corresponding words. Now, we can speculate that it’s a surrogate cipher, and then we can finally figure out the code for each letter to decode the message.

Or, we could build an encoder-decoder model to guess (almost) all the words! As masochists, of course we choose the latter. With that said, let’s take a good shot at Rocinante, Don Quixote’s steed, and start our journey against windmills.

AI Front: Rocinante, Spanish, Rocinante, the name of the horse on which Rocinante rides in Don Quixote.

Here’s a problem at hand: we have several coding sequences and their corresponding understandable examples. Using these examples, we have to learn some patterns and use this information to predict what the new coded markers (words) might be. Unlike our usual regression problems of predicting numerical results, we have some sequence-to-sequence learning problems at hand, with time structures in the data. This is an immediate reminder that recursive neural networks (RNN for speech and speech data, and a combination of CNN for image data and RNN for image letters) can be useful. Roughly speaking, this belongs to a class of problems: it also includes machine translation problems; The structure of the model is instructive here. See [1] for more information on this topic. Limited by space, we will not repeat the theory of RNN, but please refer to the series of articles in reference [2] for a brief introduction to this topic.

If you’re wondering if the problem could have been solved differently, the answer is: YES. The Markov Chain Monte Carlo method yields similar results. In this case, we will follow the process described in the first example in the excellent paper [3].

The core idea of Monte Carlo algorithm (SHORT for MCMC) is to find a Markov chain in some state space, so that the stable distribution of the Markov chain is our target distribution P (x). Thus, when we randomly walk in the state space, the residence time of each state X is proportional to the target probability P (x). When sampling with MCMC, we first introduce a reference distribution q(x) that is easy to sample, and obtain a candidate sample Y from q(x) in each step of sampling process, and then decide whether to accept the sample according to a certain principle, which is to ensure that the original obtained exactly conforms to the P (x) distribution.

Markov chains, named after Andrei Markov (A.A.Markov, 1856-1922), are discrete event stochastic processes with Markov properties in exponentials. In this process, given current knowledge or information, the past (the historical state before the present) is irrelevant to predicting the future (the future state after the present).

At each step of the Markov chain, the system can change from one state to another, or it can remain in the current state, depending on the probability distribution. A change of state is called a transition, and the probabilities associated with different changes of state are called transition probabilities. Random walks are examples of Markov chains. The state of each step in the random walk is a point in the graph, and each step can move to any neighboring point, where the probability of moving to each point is the same (regardless of the previous walk path).

synopsis

Roughly speaking, we want to start from (x_1,… ,x_n) input sequence prediction output sequence (y_1… ,y_m), which involves the learning of conditional probability.

AI Frontier: Conditional probability is the probability of event A occurring if another event B has already occurred. Conditional probability is expressed as: P (A | B), pronounced “under the condition of B, the probability of A”. Conditional probabilities can be calculated using decision trees. The fallacy of conditional probability is assuming that P (A | B) is roughly equal to P (B | A).

The main obstacle here is to predict variable-size outputs from variable-size inputs. At the meta-level, this is solved by combining two RNNS, where the first RNN maps a variable-size input to a fixed-size output, and the other RNN takes a fixed-size input and returns a variable-size output. An intermediate vector of fixed size, called a context vector, encapsulates information from an input sequence, one character at a time. The mechanism for generating the context vector is to make the RNN useful for capturing the time structure: the context vector is the hidden state after the final timestep or some of its functions. The above conditional probabilities are calculated using the chain rule.

AI Front: The chain rule is used in calculus to find the derivative of a compound function. It is a common method used in calculus to find the derivative. The derivative of the composite function will be the product of the derivatives of the finite functions that make up the composite at the corresponding point, just like a chain, so it is called the chain rule. Obviously do artificial intelligence, learn mathematics very important.

Where h is the context vector. Finally, the conditional probability on the right side of the above equation can be calculated using the Softmax function, which sets the characters y_{i-1},… , the one-hot encoded vector of Y_1 serves as input, and the output and context vector in the recurrent layer in the second RNN. The specific type of RNN used here is LSTM, which effectively overcomes the limitations of simple-RNN, which suffers from the gradient problem and is better able to capture remote dependencies.

AI Front: Simple-RNN is the simplest recurrent neural network and is the foundation of LSTM. Simple-rnn and BP both have feedforward layer and feedback layer. But simple-RNN introduces a time-based (state) loop mechanism.

Data preparation

We will introduce a special character (11) to represent the Spaces between each letter of the code. For example, the SOS code would be expressed as:… 11 – *… (replace… -…). . We do this to ensure that the word corresponding to a given code is unique. Next, we’ll use English words from our compiled data set (words_alpha) instead of randomly generated letter sets for our data. To get a feel for the data, see the histogram of word length shown below. As you can see from the histogram, there are more words that are longer than 5 letters than words that are shorter.

Networks trained on data containing encoded words will generally predict long words. Remember that the network does not compute a “formula” for the generated data, that is, it does not learn the diagram in Figure 1.

We begin data preparation by building a function that encodes the input English word into its Morse code and outputs it.


For illustrative purposes, we will generate training and validation data from a given fixed-length word. In this case, we’ve fixed the length at 9, because a number of words of length 9 is large enough (see histogram above). Note that this will mean that the length of the output word from the network will be fixed, but the Morse code input will not always be of the same length. Suppose we know that each letter has a maximum length of 4. (Instead of using this specific assumption, we can use max_length_X Morse code, which is the longest, to train the data.) Thus, if the length of a word is N, the corresponding Morse code is at most 4n+(n-1), where n-1 corresponds to the number of 11s. We populate the code with Spaces on the left so that they are of the same length, which means that our vocabulary for entering characters is {‘. ‘, ‘-‘, ‘*’, ‘ ‘}. In general, we let the output character vocabulary be all special characters for letters and Spaces. Going back to the comment about the network guessing long words, we meant that the network would tend to guess with fewer blanks because of the imbalance caused by the number of long words. In the code snippet below, output_list will contain English words, and input_list will contain the padded Morse code.


Now, we construct a unique thermal coding vector of the characters in the input to fit the input data into the neural network. To do this, we built a class object (similar to the example in the Keras documentation) that will help with encoding and decoding, and decoding Morse codes and English words. We assign classes to objects with the appropriate character set.

AI front: Single-hot coding is one-hot coding, also known as one-bit effective coding. Its method is to use n-bit status register to encode N states, each state has its own register bit, and only One of them is valid at any time.


Split the data to generate a training set X_train, and extract y_train from a quarter of the entire data set X and Y. We will reserve the remaining three quarters as verification sets X_val and y_val. Note that we are better off using one part of the training set as a validation set and the rest as a test set, but considering our arbitrary setup, we are more interested in model building than parameter tuning. Now that we have the training and testing (validation) data ready, we can continue to modify the network.


The simplest way to build a neural network is to use the Keras model and the sequential API. Since we don’t need all the functionality and flexibility of TensorFlow, we chose Keras.

Encoder Decoder model

The model topology we chose combines a powerful variant of simple RNN called the Long Short Term Memory (LSTM) network.

AI Front: The LONG short-term memory network is a temporal recursive neural network suited for processing and predicting important events with relatively long intervals and delays in time series. The system based on long and short term memory network can realize machine translation, video analysis, document summary, speech recognition, image recognition, handwriting recognition, control chat robot, music synthesis and other tasks.

The first LSTM acts as an encoder, taking a variable length input sequence, one character at a time, and converting it to a fixed length internal underlying representation. The other LSTM acts as a decoder, taking the potential representation as input and passing its output to a dense layer that predicts one character at a time using the Softmax function.

The encoder and decoder components of the model may have multiple LSTM layers, and it is often not clear which topology is best for use. In general, the deep Web works better for machine translation. As a rule of thumb, we expect layers to learn higher-level representations of time, so when the data has some hierarchy, we use it. For us, one layer of each is enough.

The model is built using Sequential() and adds only one layer at a time. The first LSTM layer takes 3D tensors as inputs and asks the user to specify input dimensions. This can be done simply with the input_shape specified in the code, where the first component represents the number of time steps and the second component represents the number of features. For us, the number of features is the number of elements in the vocabulary of the input sequence, which is 4, because we have “. , -, *, and whitespace characters. Since we provide only one unique thermal encoding vector at a time, the number of time steps is max_len_x. We will also specify the number of memory units (or blocks) in the layer (represented here by the latent_DIM parameter; we used 256), which is the dimension of potential representation. Note that we want to return the final hidden state of the LSTM as a potential representation, which contains information from all the time steps, i.e. the entire input sequence. If we use the return_SEQUENCES = true option, we get the output of the hidden state for each time step, but this only contains the sequence information to that step.


This results in a simple encoder model. We will build a similar layer as a decoder. But the output of the code snippet above is a 2D array. We repeated the output time number max_len_y by using the convenient RepeatVector layer, which was converted into a 3D tensor and used as the next LSTM layer (decoder). Now we use the return_SEQUENCES = True option in this LSTM to output hidden state sequences that we need to use for prediction. For this, we use a TimeDistibuted dense layer that outputs a vector of length max_len_y, and we use the Softmax activation function to select the most likely letters. For a quick understanding of the timedia Buted layer’s purpose, see Jason Brownlee’s blog post: How to Use the TimeDistributed Layer for Long short-term Memory Networks in Python.

How to Use the TimeDistributed Layer in Keras


Here is a quick summary of the network and the various input and output dimensions.

We fit the model into the data, train on the sets X_train and y_train, and use X_val and y_val to see how we are working. The last set of parameters you need to set are the number of rounds (EPOCHS) and the batch size. The batch size is the size of the training set passed over the network in the gradient descent algorithm, and then the weights in the network are updated. The batch size is usually the maximum amount of computer memory that can be processed. A round number is a full training through the training data of these batches. Here, we set the batch size to 1024 and used 120 rounds. As you can see in the figure below, there is no significant increase in accuracy after about 100 rounds. In general, it takes trial and error to see which parameters work. Now we use the FIT () method to fit the model.



Finally, as you can see from the chart above, we can get about 93% accuracy on validation sets, which doesn’t seem to be a bad result. Of course, we can do even better if we scale up the training data. Here are the predictions for a group of randomly selected words.

Type the code on the left, the corresponding word in the middle, and the prediction on the right. If the prediction is correct, the word is green, otherwise it is red.

As you can see, being wrong isn’t so bad. We have to remind ourselves that cracking the code is not cracking the code, that is, figuring out what each letter stands for. In fact, we can enter the code for the letters and view the code for the single letters of the network predictions, as shown below, and we’re nowhere near the goal!

As another example of the Encoder-decoder model, you can try Caesar Cipher or other code to see how effective this approach is.

AI Front: Caesar encryption, also known as Caesar encryption, Caesar transform, transform encryption, is one of the simplest and most well-known encryption techniques. It is a substitution encryption technique in which all the letters in the plaintext are replaced with ciphertext by a fixed number of alphabetical shifts backwards (or forwards). For example, when the offset is 3, all letters A will be replaced with D, B with E, and so on. The encryption method is named after Julius Caesar, who used it to communicate with his generals. Caesar ciphers are often used as a step in other, more complex encryption methods. The Caesar cipher is also used in the modern ROT13 system. But like all alphabet-substitution cryptography, Caesar passwords are very easy to crack and do not guarantee secure communication in practice.

References:

[1] Sequence to Sequence Learningwith Neural Networks

http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf

[2] Understanding LSTM Networks

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

[3] THE MARKOV CHAIN MONTE CARLO REVOLUTION

http://www.ams.org/journals/bull/2009-46-02/S0273-0979-08-01238-X/S0273-0979-08-01238-X.pdf

For more content, you can follow AI Front, ID: AI-front, reply “AI”, “TF”, “big Data” to get AI Front series PDF mini-book and skill Map.