Deep Learning and NLP: Attention – Welcome back. In this article, we’re going to talk about the concept of attention. When we have an encoder that sends an intermediate representation to a decoder, we’re sending just one element in the algorithm of attention. We’re going to send more input elements from the encoder so that the decoder can look at specific states and decide which of these hidden inputs are the ones that matter to generate the output. So what have we seen so far during the week.
We had feed forward neural networks, which took an input, did some computation with a hidden layer, and produced an output. We have neural networks that can transmit data across executions of themselves. So for example, if you have a sequence of words, and you have recurrent neural network to predict the next word, like in I am, you will get the input I, you will produce the prediction output am, but you will also pro – produce a hidden vector that will go on to the next iteration of the recurrent neural network.
This is unidirectional translation and transmission. By the way, you can also have bi-directional transmission, where after training, you can have contributions from the word that comes before you, and the word that comes after you. So you’re getting three inputs to your recurrent neural network, the input itself, and each element from the word that came before, and another H element for the one that came after, making it bi-directional.
So we have neural networks that can transmit across sequences, and we also have architectures that can encode a sequence, into an intermediate form, and then decode it into some other form. We call these encoding decoding architectures, and generally deal with sequences to sequences, like the words in a question to the words in an answer, the words in an article to the words in summary of the article. This article will look at the concept of attention.
When you’re producing the decoded output, maybe you want to look at more elements other than just your intermediate representation, we’re gonna have attention across encoder/decoder. We’re also gonna have self attention, we’re just looking at my own sequence when I’m encoding. And we’re gonna look at a couple of architectures that use this.
They’re called transformers. They’re – we’re gonna look at the concepts here and then on your exercises, you’re gonna get to play with a few transformers. So take a look at this matrix here. Here we have an English sentence, the agreement of the European Economic Area was signed in August 1992, on top, and then the French translation from top to bottom, L’accord sur la zone economique europeene a ete signe en aout 1992.
What we have here is the translation but also a matrix that tells you how much attention you’re paying to each element in the English sentence when you’re generating the French element. So for example, when you are generating – the meaning of 1992, the French element here, what are you looking for in the input, you’re looking at the number 1992. Of course and this is very obvious if you just have to sequence 1 9 9 2, it’s obvious that it only means the sequence 1 9 9 2 to generate something similar. However, most things in language are not so easy.
When you’re generating, for example, the word signe, signed, you do need to look at the English verbs, signed. But because this has agreement with the subject, if this was feminine for example it would have another e, you also need to pay attention at the subject. So most – when you’re generating signe, most of your attention is going to the word signed, but also a little bit of your attention is going to the word agreement, when you’re generating the word, ah you need to pay attention to was and signed in English in order to get the correct tense in the French.
So sometimes, you only need to look at one element to make an input out – an input in English appear in output in French. But sometimes you need to look at more than one element to make an input in English appear as an output in French. We call this process attention. How does attention happen? We can get attention by transmitting richer information between the encoding and the decoding.
A student. So when we are transmitting the encoding, you could transmit just an intermediate vector, but you could also transmit all of the little H hidden states that you generated when the recurrent neural network was running forward. When you, when your lstm was running, so all these intermediate hidden elements get transmitted to the decoder along with the last intermediate representation and then during training, the decoder can decide which of these is better to generate the correct output.
That way, it can generate a mapping to what elements it should be paying attention to when it’s generating an output. For example, if it’s generating the word je, I’m sorry, if it has the word je, and it wants to generate the English I, it takes the – this generation of I, has all of the hidden states, and it pays most of its attention to just je because obviously je is equivalent to I. However when you’re generating the word suis for example you can see that most of the attention, I’m sorry, when you’re generating the word am, because you you’re translating from French to English, when you’re generating the word am, you need to pay most of your attention to suis, but you need to pay some attention to you as well, and you pay no attention to etudiant.
When you’re generating the word a, you pay attention to suis, and you pay attention to etudiant, so by transmitting a richer set of information from the encoder to the decoder, the decoder can take advantage of it, and decide what are the elements that matter. If you made a huge vector with the weight of how much the first element is going to matter, the second, the third, of the fourth, you could implement attention and you’d also really be freed from the constraint that you need to have a one-to-one relationship. Because then you’d have vectors that map the end-to-end relationship between word and its translation. You can also have self attention where if you have a sequence of tokens or words, you could try to decide which of the words is getting the most input from other hidden states.
For example, the word it in the sentence, The animal didn’t cross the street because it was too tired, the word it is paying a lot of attention to the animal, is paying little attention to cross, and it’s paying a little bit of attention to street. This is because – because during its training phase, it must have seen that it had correlate positional correlations with animal, then it had some positional correlations to street, and almost no correlation to cross, for example.
So this allows it, this allows the encoding to pay attention to itself, and to the other elements in the sentence. In your exercises, you read the about the programming implementation of self attention, but in summary during training, you generate a series of matrices that tell you how much attention elements should be putting, should be given to other elements. So for example, the word it when you have the word it, and you’re inputting the it into the network, so it can predict some other element, it needs to pay a lot of attention to previous nouns, like robot, and it should pay almost no attention to previous verbs, like obey, for example.
So it’s going to take these elements, and with them generate an attention vector. So if you have the first law of robotics, a robot must obey the orders given it by a human, and you want, you have the word it, and you want to generate the next word in the first law of robotics, you’re gonna have two vectors: the embedding of the word it, and a vector of weights of how much weight we should be giving each of the inputs in the sequence. And from training we will be able to know that the word robot needs a lot of attention, and that the word the needs very little attention.
And by the way, attention just means the embedding for the word robot multiplied by an attenuating factor, by 50% for example, plus the embedding for the word obey. multiplied by 0.1, plus the embedding for the word orders, multiplied by 0.05. So you’re gonna get a combination of these settings which give you the elements that you should be paying attention to. We’re just gonna look a lot like robot plus the embedding for the word it, the input and the attention, and this gives you really cool information, really cool performance improvements.
For example, if you are standing on the word chasing, you can analyze the attention vector, and see that it’s paying attention to things like is, and FBI, for example. You can also do this with multimodal attention. If you have the encoding – if you’re encoding a picture, like the color picture here on the left upper left, if you’re encoding a picture and decoding the description of the picture, like a woman is throwing a frisbee in a park, then you will get a vector of the encoding, of the hidden information in the encoding, plus the final intermediate state.
And as you decode that in two English words, you will know which parts of the encoded input, meaning which pixels it was paying attention to, so for example, when it generates the word frisbee here, it’s paying a lot of attention to the pixels that have to do with the frisbee. Again, in your exercises, you will see more about how this is generated, but what this is really good for is that it allows us for more complex architectures.
What we have here is a transformer. The transformer is a kind of encoding- decoding where you have half here that’s encoding and another half here that’s decoding. By the way, when it has – it says Nx, it means that it can have multiple levels of these, and by multiple, I mean dozens. It has, for example, if you want to input the word je and have it come out as the word I, it will take the embedding of the words je in French, it will correct it by a positional encoding, telling you how important order is. If it comes into first position, it will set this to an attention head, which tells you how important every other word in the input is to je, then it will use a neural network to generate the correct kind of intermediate representation.
It will take that, and take as the – as the start of the output, like a token for new phrase. So it takes the new phrase from here, the intermediate representation of something like je into here, it enters into even more attention heads, which get the information from attention from the encoding phase, ultimately produces a softmax vector with like thousands of words that it has, and the one with the bigger probability is gonna be I.
So je, attention to immediate form generation that goes into the attention, softmax, and the output I. That’s the transformer. It’s a complex model, but very powerful. And what it can do, there’s two types, there’s many types of transformers, many. I’m just gonna focus on two that are particularly important: BERTs and GPT-2s. And inter – interestingly, they’ve discovered that you only really need half of the transformer. You can use only the encoder as in BERT, or only the decoder as in GPT-2 A BERT, for example or bi-directional encoder representation, is an algorithm that takes two inputs that you are going to encode, and produces a vector, some representation of them.
What do we want this for? The cardinal use of a BERT is to predict words to function as a language model. So for example, if you give it an input like, I want to mask the car because it is cheap, the BERT – a BERT will produce the output buy, because it’ll give – it will receive the embedding of all of the words. It’ll pay attention to the fact that the input had car want cheap, and will produce a vector for the mask that has all of that attention information, and it tells you that the correct candidate for those attention words be buy.
And by the way, this is gonna be in your exercises. Yeah just how to look at a BERT generating this output. And by the way, it does this again with 340 million parameters. So none of these are light models to generate or run. What’s really cool about the BERT is that once it generates that vector you can do a lot of things with it. You can for example fine tune a BERT so that if you get an input and generate a vector you can use that vector for classification, so you can use it to decide if an email is spam or not.
You can use it to define – to see if a movie review is positive or negative. Here, for example, we would get the words of an email and we would have a token, a CLS, for the classifier. This token will take all of the attention information from the other words of the email, and then we will train an additional neural network to decide if this classification vector is spam or not spam. And this is what makes BERT – BERT really powerful. It can be fine-tuned to many tasks.
For example, QNLI is about language inference. So if you give it two sentences, it can tell you whether one can be inferred from the other, for example, or whether one is the answer to another one – to the question. If this is a question, this is an answer, it will tell you, yes or no. You can use it for sentiment analysis, tell you if something is positive or negative. You can use it for question answering tasks, this is what Squad does. You give it a bunch of questions, and paragraphs where the answers are, and it’ll give you the positions where the answer starts and ends.
BERT can also be used to find named entities, or proper names, and it will tell you where they start and where they end. So they can be adjusted for many tasks and they are very powerful. A different type of transformer is the GPT-2 It only uses the decoder part. It is very good at generating predictions of what the next word will be, so if you give it planet it will generate the next word, for example here, if you give it The spaceship entered orbit around the planet, it will then give you something like Once in orbit… and so forth.
You can again experiment with the transformer there. And it’s 40 gigabytes, and it’s 1.5 billion parameters in order to generate all these. It has multiple attention heads. It has numerous phases of this, so it is a very heavy model. As a summary, we have an idea called attention, which is that you can pass information about all of the elements into supplementary vectors or matrices so that when you are decoding, you could take advantage of all that information and essentially establish end-to-end relationships telling you, oh if I’m looking at the verb, if I want to generate the verb signe, I have to look at the word signed, but also the subject agreement.
You use – you can do this across encoding/decoding stages, but also you can use self attention to look at your own items in the encoding. This has been used in architectures that are called transformers, BERT and GPT-2 are just two examples of them, and they can use this property of paying attention to multiple parts of the input to generate really powerful and flexible output. But all this comes at a price which we will analyze in the next article for the week.
Web enthusiast. Thinker. Evil coffeeaholic. Food specialist. Reader. Twitter fanatic. Music maven. AI and Machine Learning!