The entire training process (Encoder + Decoder) can be summarized in the below diagram: Let's now try to understand the setup required for inference. The LSTM will read this sentence word by word in 5 time steps as follows. Now a sentence can be seen as a sequence of either words or characters. If you want to improve the quality of translations, I will list some suggestions towards the end of this blog. In our case we have nothing to output unless we have read the entire English sentence. If you like my explanations, you can follow me as I plan to release some more interesting blogs related to Deep Learning and AI. Sequence to Sequence (often abbreviated to seq2seq) models are a special class of Recurrent Neural Network architectures typically used (but not restricted) to solve complex Language related problems like Machine Translation, Question Answering, creating Chat-bots, Text Summarization, etc. We also compute the vocabulary sizes and the length of maximum sequence for both the languages. The most common architecture used to build Seq2Seq models is the Encoder Decoder architecture. Encoder reads the input sequence and summarizes the information in something called as the internal state vectors (in case of LSTM these are called as the hidden state and cell state vectors). What are their sizes (shapes) and what do they represent. Recall that our problem is to translate an English sentence to its Marathi equivalent. Depending on the context of the problem they might sometimes be used or sometimes be discarded. Try using bidirectional Encoder LSTM. We discard the outputs of the encoder and only preserve the internal states. If you are interested to improve the quality, you can try out below measures: a. Basically its the summary of information till time step 3 which is stored in the vectors h3 and c3 (thus called the states at time step 3). In some sentences we can even note that the words predicted are not correct but they are semantically quite close to the correct words. We train the network for 50 epochs with a batch size of 128. e. Try using beam search instead of a greedy approach. Without going into too much details, I will assume the reader to understand the below (self explanatory) steps that are usually a part of any language processing project. More specifically in case of word level language models each Yi is actually a probability distribution over the entire vocabulary which is generated by using a softmax activation. Combined together these are internal state of the LSTM at time step i. Below is a very high level view of this architecture. The purpose of this blog post was to give an intuitive explanation on how to build basic level sequence to sequence models using LSTM and not to develop a top quality language translator. You will have a review and practical knowledge form here. Both encoder and the decoder are typically LSTM models (or sometimes GRU models). Unlike the Encoder LSTM which has the same role to play in both the training phase as well as in the inference phase, the Decoder LSTM has a slightly different role to play in both of these phases. Note: The size of both of these vectors is equal to number of units (neurons) used in the LSTM cell. c. The initial input to the decoder is always the START_ token. This intuitively means that the decoder is trained to start generating the output sequence depending on the information encoded by the encoder. f. We break the loop when the decoder predicts the END_ token. However, the decoder now has to predict the entire output sequence (Marathi sentence) given these thought vectors. Then we make a 90–10 train and test split and write a Python generator function to load the data in batches as follows: Then we define the model required for training as follows: You should be able to conceptually connect each and every line with the explanation I have provided in sections 4 and 5 above. However the technical details apply to any sequence to sequence problem in general. e. At each time step, the predicted output is fed as input in the next time step. Know High School Linear Algebra and Probability, c. Have working knowledge of LSTM networks in Python and Keras. Top quality translators are trained on millions of sentence pairs. The Unreasonable Effectiveness of Recurrent Neural Networks (explains how RNNs can be used to build language models) and Understanding LSTM Networks (explains the working of LSTMs with solid intuition) are two brilliant blogs that I strongly suggest to go through if you haven't. Obviously the translated Marathi sentence must depend on the given English sentence. In very simple terms, they remember what the LSTM has read (learned) till now. Below we compute the vocabulary for both English and Marathi. The decoder is just a language model conditioned on the initial states. These states coming out of the last time step are also called as the "Thought vectors" as they summarize the entire sequence in a vector form. We finally get definition based on common meanings and most popular ways to define words related to we finally get. Since we are using Neural Networks to perform Machine Translation, more commonly it is called as Neural Machine translation (NMT). However for now, let's see some results generated from the above model (they are not too bad either). Recall that given the input sentence "Rahul is a good boy", the goal of the training process is to train (teach) the decoder to output "राहुल चांगला मुलगा आहे". However, we will use the built-in Embedding Layer of the Keras API to map each word into a fixed length vector. And in case of characters, it can be thought of as a sequence of 19 characters ('R', 'a', 'h', 'u', 'l', ' ', ……, 'y'). Thus we will discard the Yi of the Encoder for our problem. They have real time applications in speech recognition, Natural Language Processing (NLP) problems, time series forecasting, etc. During inference, the input to the decoder at each time step is the output from the previous time step. Finally the loss is calculated on the predicted outputs from each time step and the errors are back propagated through time in order to update the parameters of the network. This helps in more faster and efficient training of the network. The purpose of this blog post is to give a detailed explanation on how sequence to sequence models are built and to give an intuitive understanding of how they solve these complex tasks. Finally, what about Yi at each time step? During the training, we use a technique call teacher forcing which helps to train the decoder faster. Finally we create 4 Python dictionaries (two for each language) to convert a given token into an integer index and vice-versa. The entire inference procedure can be summarized in the below diagram: Nothing beats the understanding developed when we actually implement the code, no matter how much efforts are put in to understand the theory (that does not however mean that we do not discuss any theory, but what I mean to say is theory must always be followed by implementation). For example: h3, c3 =>These two vectors will remember that the network has read "Rahul is a" till now. e. Intuitively, the encoder summarizes the input sequence into state vectors (sometimes also called as Thought vectors), which are then fed to the decoder which starts generating the output sequence given the Thought vectors. Thus each Yi is a vector of size "vocab_size" representing a probability distribution. Thus if the input is a sequence of length 'k', we say that LSTM reads it in 'k' time steps (think of this as a for loop with 'k' iterations). To illustrate why this happens, let's look at a cross-section of the words that have been added to the Collins Dictionary this month. And after the last word in the Marathi sentence, we make the decoder learn to predict the _END token. For some technical reasons (explained later) we will add two tokens in the output sequence as follows: Output sequence => "START_ राहुल चांगला मुलगा आहे _END". Below we compute the vocabulary for both English and Marathi. Thus the Decoder LSTM is called in a loop, every time processing only one time step. (Expected) Output Sequence => "राहुल चांगला मुलगा आहे". And best of all it's ad free, so sign up now and start using at home or in the classroom. For example in case of words, the above English sentence can be thought of as a sequence of 5 words ('Rahul', 'is', 'a', 'good', 'boy'). c. Decoder is an LSTM whose initial states are initialized to the final states of the Encoder LSTM. You will see a Dictionary icon on the notification bar to start the app quickly. a., b., Hands-on real-world examples, research, tutorials, and cutting-edge techniques delivered Monday to Thursday. You must also understand what type of vectors are Xi, hi, ci and Yi. Also, another point to be noticed is that the results on training set are a bit better than the results on test set, which indicates that the model might be over-fitting a bit. If the article appeals you, do provide some comments, feedback, constructive criticism, etc. Recurrent Neural Networks (or more precisely LSTM/GRU) have been found to be very effective in solving complex sequence related problems given a large amount of data. Finally, we generate the output sequence by invoking the above setup in a loop as follows: At this point you must be able to conceptually connect each and every line of the code in the above two blocks with the explanation provided in section 6. As already stated the Encoder LSTM plays the same role of reading the input sequence (English sentence) and generating the thought vectors (hk, ck). The most important point is that the initial states (h0, c0) of the decoder are set to the final states of the encoder. These vectors (states hk and ck) are called as the encoding of the input sequence, as they encode (summarize) the entire input in a vector form. Since we will start generating the output once we have read the entire sequence, outputs (Yi) of the Encoder at each time step are discarded. This blog nicely explains some of these applications. In Marathi, the subject of a sentence comes first, followed by the object and finally the verb. Hence the name 'Word Level NMT'. Say, we have the following sentence, Input sentence (English)=> "Rahul is a good boy", Output sentence (Marathi) => "राहुल चांगला मुलगा आहे", For now just focus on the input i.e. the English sentence. These are the output (predictions) of the LSTM model at each time step. Now we will understand all the above steps in detail by considering the example of translating an English sentence (input sequence) into its equivalent Marathi sentence (output sequence). The meaning of the name is … This name names a lot of Marathi girls, and it is expected that the woman will love their name when they are growing up, become mature and finally understand the actual meaning of their name. But one question that we must answer is how to represent each Xi (each word) as a vector? This will be used as the stopping condition during the inference procedure, basically it will denote the end of the translated sentence and we will stop the inference loop (more on this later). So, referring to the diagram above, we have the following input: X1 = 'Rahul', X2 = 'is', X3 = 'a', X4 = 'good, X5 = 'boy'. These vectors are typically initialized to zero as the model has not yet started to read the input.

