 So welcome to class Wednesday, 9.30 in the morning, New York City. And today we have a guest lecturer from Oni Hanunan. And Oni is a research scientist at the Facebook AI Research Fair Lab, focusing on low resource machine learning speech recognition and privacy. He earned a PhD in computer science from Stanford University. Prior to Facebook, he worked as a research scientist at Baidu Silicon Valley AI Lab, where he co-led the Deep Speech Project. And so thank you again, Oni, for joining us today. I'm looking forward to listening to your awesome lectures. And so as I remind all the students, if they have any kind of question during the class, they can, as usual, type in the chat. I will be reading live all questions to you and such that we are going to be all on the same page. Is this fine for you, all right? Yes. So I will just eclipse from here. And I will just be joining with you on your own, OK? OK, perfect. Great, I'll get started. Thanks for having me, everyone. So I'm going to talk about speech recognition and graph transformer networks. OK, so here's a quick outline of what I'd like to cover today. First, we'll start with a pretty high level introduction to speech recognition and modern speech recognition specifically, why it's become so good. But what are some of the problems still? And then I'd like to do a pretty deep dive into the connectionist temporal classification loss, one of the primary criteria we use to train state-of-the-art speech recognition systems. I'd really like people to leave with a good understanding of how that works. And then a similar discussion of beam search decoding. So how do we produce a good transcription given that we've trained our speech system? And then I'd like to leave you all with some kind of exciting revivification of graph transformer networks that I've been working on myself, that Yan and others at Fair, Leonbo too, namely, worked on in the 80s and 90s. And I'll talk about those and what we're doing with them and how they work. Great, so if you have questions, please don't hesitate to ask, especially as we get into some of the more technical content. I really do want people to feel free to ask questions. OK, so modern speech recognition. Just so everyone's on the same page, our goal is we start with a snippet of audio, some speech. We feed it through our speech recognizer and we desire the transcription. So in this case, the transcription would be the quick brown fox jumps over the lazy dog. Just a bit of fun trivia. The reason you see that phrase so often is because it's actually a pangran, which means that it uses every letter in the English alphabet. It's not the shortest one, but it is a pangran. So why has automatic speech recognition gotten so good? It has really improved a lot since a while, but particularly since 2012, it's really continued to get rapidly better. And on some academic benchmarks, if we compare the performance to, say, human level performance, if I were to ask humans to transcribe the same speech, we'd see that the machine performance is as good or better. So it's really good on some of these benchmarks. But it's not solved. There are still places where speech recognition struggles, in particular, conversational speech. If you and your friend have a very lively conversation, that's going to be hard for a speech recognizer to transcribe well. Lots of background noise. And then importantly, when there are underrepresented groups where by underrepresented, I mean, their accents or certain features about them are not well represented in the training data, that speech recognition doesn't work well. And this is real, I mean, and this is recent. You'll find articles about bias and speech recognition from just a few weeks ago. And their claim is that there is still considerable bias in speech recognition. State-of-the-art speech struggles with things like gender, age, speech impairments, and accents to name a few. So Facebook and other places, we really strive to have speech recognition, which works in many languages, hundreds of languages, and in many contexts across the board of types of people and types of speech. So we're a long ways away from having solutions for that. So I asked the question, why has ASR gotten so much better? It's not solved, but it has undeniably gotten much better. So let's talk about why. So before 2012, there were some problems with speech recognition systems. I'll kind of call these more like traditional speech recognition systems. One of the problems was that they consisted of lots of many hand engineered components. I call this alphabet soup because you'll find just acronyms littered in speech systems from pre-2012. And because of these kind of superfluous number of hand engineered components, it really meant that when I came to add more data, for example, it didn't really help because the models were overly hand designed and not able to learn well together from data. So data sets remained small, and larger data sets were not useful. Combining these modules only at inference time instead of learning them together allowed for errors to cascade. They didn't learn to work well together. And then importantly, and often underestimated, is when you have a really complex system like this, it's hard for researchers to know how to improve it. So when I started my PhD, I didn't really know what to do to make a speech recognition system better. And we worked on random parts of it, and it was fun to learn. But we were kind of shooting in the dark, especially at first. The learning curve was quite steep. So that was difficult. So why has ASR gotten better? You've probably seen figures like this for many different applications. The idea is the same. We replace a lot of the traditional components with deep learning. Thanks, Jan, and more data. And the two together work in a virtuous cycle where we add more data, the deep models can get better, makes more data more useful, so on and so forth. So this is not meant to be understood. This is kind of a picture of what speech recognition looked like pre-2012, consisted of these many different components. You start with your speech on the left, and you move it through all of these components, featureization, speaker adaptation, so on and so forth, all the way through this decoder, which takes in a bunch of different models. So it's complex. And slowly we figured out, well, we can get rid of some of these pieces and replace them with deep learning. Can replace speaker adaptation and some of the smaller models with just a big acoustic model. We can keep going. We can get rid of some of the components that specify how the transcription is processed and simplify the pipeline further. We can keep going. We can simplify even more. So now we're starting to look at what looks like a production speech system today. It's actually quite simple. It's got some featureization from the speech. It's got an acoustic model, which can be a fairly complex, but single deep learning, deep neural network. And then it produces word pieces or letters, which go into a decoder, which tries to find the best transcription. So it's already looking quite a bit simpler. And then this is in production and in research we've gone even further. We haven't quite landed these improvements, I would call them into production yet, but we're on the way. So things like removing the decoder entirely. Why do we need this complex decoder? Can we just produce a transcription directly? Why do we even need features? Can we just learn from the raw audio directly? So these are things that we're doing in research that are showing to help, but they're still kind of either too expensive or don't help enough yet, or there just hasn't been enough time to get them into production. So that's where things are headed. Let's talk about connectionist temporal classification. So as I said earlier, CTC is one of the more commonly used these days, loss functions for training state-of-the-art speech recognition systems. So I'd like everyone to kind of understand how that works and kind of why it works the way that it does. Okay, so as we said earlier, we're given some input speech and in this case it will be this utterance X, which consists of frames of audio. So X has T frames, X1 to XT. Each frame is usually in speech recognition, usually like 20 milliseconds of speech. So we'll take a given snippet of speech and we'll chop it up into 20 millisecond slices. They overlap a little bit and that will be the features that go into the model. And then we desire to produce a transcription, which I'm calling Y here. And we'll think of our transcription as consisting of the letters of a sentence. So Y1 is the first letter, Y capital U is the last letter. And so what do I need to train a model? Well, I need the ability to compute a loss. I need to know how good a given transcription is for a given piece of, how good a transcription is given some input audio. I need to compute a score for that. And then if I'm able to compute that score, I can then differentiate that score with respect to some parameters and then bring all of our standard kind of optimization methods to bear to improve the score by tuning the parameters. So to recap, we need to compute this conditional probability, the probability of a transcription given an audio snippet. And ideally it should be differentiable with respect to the model parameters so that those would be like the parameters of a deep neural network. And then we can train the deep neural network to maximize this probability for our training set. So we're given a training set of Y and X pairs. And we want to maximize the probability of all of those Y of X pairs by kind of changing the parameters so that those scores are high. So how do we compute this score? That's what CTC is gonna do for us. We're gonna build up towards CTC in a sequence of steps and then that will show us how to get this score. So the first question would be what is the input representation? What are X1 to XT? You said there are 20 milliseconds chunks, but how do we encode this? Good question. So usually in say state-of-the-art production speech systems, the 20 milliseconds would be encoded as what are called mel scale filter banks. So you can think of these as the way that that works is I'll take kind of a 20 millisecond segment of audio. I'll compute the Fourier transform of just that 20 milliseconds. But instead of kind of binning frequencies evenly, I will bin them such that they correspond to how the human actually perceives speech. And so it turns out humans, we perceive larger differences between lower frequencies than higher frequencies. So the bins are going to kind of change in size in order to reflect that fact. So that's what the mel scale is. So it's basically like a short Fourier transform on 20 milliseconds. And then we put those, then we bin the Fourier coefficients according to how humans perceive speech. And the choice of 20 milliseconds comes from? 20 milliseconds, that choice comes from, well, it's a great question also. It's really been used for a very long time now, 20 milliseconds, but it has been kind of cross validated over considerably, it's mostly an empirical finding that it works well. And it turns out it's a good trade off between kind of time resolution and frequency resolution. So as you increase that window size, you can get better frequency resolution but you lose temporal resolution. You lose the ability to distinguish between changes on a fine grain time scale. And so you kind of need to trade that off to have good performance in 20 milliseconds seems to work pretty well. Okay, makes sense. Great. So where are we? Okay, so we're going to compute the score. Let's look at a quick example. Let's just start with a really simple example. So say you're given some audio, you have three frames here. So this would be like 60 milliseconds of audio. You never have such short audio in practice but just bear with me. So you have three frames, X1, X2, X3, and then you have your output transcription which is the word cat and in letters C, A, T. So you'll notice there's three input frames and three output letters. So we can do something really simple here. If I want to compute the score, I can just kind of say, well, the first letter maps to the first input frame, the second letter maps to the second input frame and the third letter maps to the third input frame. I'll just compute the scores of those individually in log space, I'll sum them up and I'll get my log probability. That's straightforward and it works well when the number of input frames matches the number of output frames. So you can see here, we have, let me see, I can't, we have our correspondence between inputs and outputs, X1, X2, X3, and then the letter C, A, and T. Okay, so what happens if we have a fourth input frame? What do we put there? I mean, there's this problem now where I have to decide how to score this fourth input frame, right? Like I already put C to the first input frame, A to the second, T to the third, what do I correspond to the fourth? And so maybe what I could do is something like this, right? Where I can decide to allow multiple inputs to map to one or more outputs. So in this example, I have four input frames, the last two inputs both map to the T in the output. So maybe that's what I wanna do. I'll call that an alignment between the input and output. So I can allow that T to repeat one or more times. In other words, I can have each input map to one or more outputs. Or rather, each input map to an output more than one time. So there's multiple ways I could do this for this example. Actually, there's three different ways. I could have the A repeat twice in the middle, I could have the C repeat twice at the beginning, or I could have the T repeat as in the first case. So there's three possible alignments for this case. In other cases, there will be many more, but in this case, there's only three. But there's still the question of, well, which alignment should I use to compute the score? I mean, if I only had one alignment, it would be easy. I would just use that one. But since I have three possibilities, it's not obvious which one I should pick. And ahead of time, we don't know which one is the best because we don't know how the transcription aligns to the input. It's just given to us as CAT, and the input is given to us as four frames. And the model, maybe it prefers one over the other initially, but that will just be arbitrary. So we don't really wanna just let the model choose from the beginning. So in fact, what we're gonna do is use all of the alignments. And rather than fix one, we're going to try to increase the score of all of them. And then kind of let them just hope that the model sorts things out internally. It can kind of decide to optimize these different alignments and kind of weight them accordingly and learn which one is the best. But we're not gonna tell it ahead of time which one to use. We're just gonna let it use all of them. So here, what I'm showing is that if we wanna compute the score of Y given X, we can take the sum of the probabilities of all possible alignments for Y to X. And if you recall, when I computed the score of an individual alignment a couple of slides ago, that was in log space. So I got the log of the probability of the alignment. So it turns out when we do this, this is kind of important for later. We're given scores of alignments that are in log space and we need to actually compute the log of the sum of the probabilities. So I think you guys are familiar with this operation. You call it actual softmax, which I like. So I'm gonna call it actual softmax also. And we're gonna use this actual softmax to compute the actual softmax of the scores, the log scores of the alignments. So concretely, if I'm given kind of two probabilities that are in log space, I'd like to compute the log of the sum of those probabilities. And then actual softmax lets me do that. And if you also recall, I think you learned this in a previous lecture, there is a stable way to do this. And it's very important that you use that stable way because otherwise things will not work is one of the most important tricks for kind of sequential data and especially long sequential data and machine learning. It's the log one plus kind of trick of removing the max. So I think you've all seen that before, but that's important here. So it's important that we do everything in log space for numerical stability. And that also lets us use this trick. Okay, so we're gonna use the actual softmax later. So we said, okay, so we're gonna use all possible alignments, right? So in this case, when we had cat with four frames, we had three alignments. We're gonna compute the actual softmax of all three of them. And then that's gonna give us our score, the log of the probability of our transcription given X. So we're good, this works for us. As an aside, I'd like to show you how we can encode the set of possible alignments of why the transcription cat to an arbitrary length input as a graph. So this is a graph which represents the set of possible alignments of the word cat in letters to an arbitrary length input. And I wanna explain this graph because it's gonna come up again when we start talking about graph transformer networks. So this graph, you'll sometimes hear it called the weighted finite state acceptor WFSA. What it does is, well, it has a start state which is the bold state at the beginning, zero. It has an accepting state which is the concentric circle, mark three at the end. And what it does is on each edge, there's a label and a slash to the right of the label, the weight in this case are all zeros. We don't care about the weights in this case. In some cases, we will care about the weights but here we just care about the labels. So this graph is encoding the fact that I can, it encodes a set of alignments basically. Any path through this graph is a possible alignment. So I can traverse the C one or more times and then I get to the state from zero to one. I can traverse the A arc at the state one as many times as I like, but then I have to actually kind of traverse A at least once to get to the next state two. And I can traverse the T on the self loop any number of times, but I have to actually traverse the T once to get to the state three, which is my accepting state. So this graph says that any alignment must output at least CAT once, each of them once, but for each of them, it can also output them more than one time in succession. And so this doesn't fix a length. This encodes all possible lengths, but it encodes the fact that, you have to output the letter in the transcription one or more times. So these graphs are gonna come up again when we talk about graph transforming works. Okay, so back to CTC. So we were saying we're gonna use all possible alignments and that's great, we're gonna do that, but there is a problem. And the problem is the X input audio can have lots of frames, it can have lots of time steps. In practice, this can be as high as a thousand. The Y transcription can have lots of letters. In practice, it can be a hundred or more. If you work out the numbers, this is an astronomically large number of alignments. So we can't just do what I did earlier, which is kind of delineate all of them, compute the score of each individual one, sum up all of those scores using our actual softmax, that's not gonna work in practice. As a fun little exercise, I encourage you to kind of do the combinatorial math to actually compute that number, to show how many alignments you can actually produce given an input of length t and an output of length t. But it's a lot. So what are we gonna do? Well, luckily, there's an algorithm which lets us compute the sum over all possible alignments efficiently. It's called the forward algorithm in speech recognition. It's basically a straightforward dynamic programming algorithm and I believe you are experienced with the Viterbi algorithm in a previous assignment or lecture and the forward algorithm is actually very similar. The main difference is kind of the operation that we use instead of kind of looking for the largest scoring or shortest path. We look for the sum of all possible paths. So we use a sum instead of a max or a min, whichever you used in your assignment. So how does the forward algorithm work? We'll talk about that for the simple case of the alignments that I showed you earlier. Okay, so we start by specifying this forward variable and then call it alpha subscript t superscript u where the subscript index is kind of where we are in the input and the superscript index is where we are in the output. And what this variable represents is the score for all alignments of length t which end in the output yu. So let's make that more concrete. So here's an example. Suppose I have the input, the four frame input that's a typo there should be x1, x2, x3, x4 and the output y cat. And I want to compute the forward variable alpha 2c. So that in words, alpha 2c is the score of all possible alignments of length two up to the first two frames that end in C in the first output of my transcription. And so there's only one possible alignment which satisfies that. That's the alignment cc. The first frame gets the c and the second frame gets the c. And so this one is simple to compute. It's just the sum of the scores of the first two of the first two frames of p of c given the first frame and p of c given the second frame. Similarly, I can compute the forward variable alpha sub two superscript a which is the, in words, the score of all possible alignments of length two which end in the second output a. And again, there's only one possibility here. There's the first input aligns to C and the second input aligns to a. So I can do the same thing. We can compute the score of alpha 2a. Okay, now things get a little more complicated when I wanna compute alpha 3a. So I wanna compute the sum of all possible alignments of length three which end in the second output a. And if you kind of think about it, there are two possible alignments. There's cca and caa of length three which end in a. So I can delineate those two alignments. I can compute their individual scores in log space as I show below log p of a one and log p of a two. That's the log score of each of those two alignments. And then I combine them using my actual softmax and that gives me alpha 3a. So this would be the kind of the naive approach to compute alpha 3a. But if we kind of stare at these equations for long enough and refer back to the previous two slides, you'll notice that the first part of each equation is something we've already computed. So the equation for the alignment that ends in a, the first one consists of alpha 2c plus the score. And the second one consists of alpha 2a plus the score. So we can reuse those alphas that we already computed. And that makes things simpler and more efficient. This is the kind of the recursion that we're gonna build up to. So we can reuse those. We just plug them in. And instead of recomputing the prefix, we can compute the new score just by adding the score at the third frame of a given the third frame. So there's another observation which we can use, which is that if we plug these two scores into our actual softmax, things factorize quite nicely. And it turns out that the actual softmax of alpha 2c plus alpha 2a plus the log of the score of a at the third frame is actually the alpha 3a that we wanted to compute. So you see here, I can compute alpha 3a just by adding the score of a at the third frame to the previous two alphas at the alphas at the previous two time steps. So this leads us to a general recursion for computing the forward variables. And this general recursion, this is kind of in the simple case for the alignments we've specified looks like this. So let's unpack this for a second. If I'm trying to compute the forward variable at time step T corresponding to output U I will take the two forward variables at the previous time step, one which corresponds to output U and one which corresponds to output U minus one. And I will consider extending them by the output U. So I'll take the actual softmax of those two previous forward variables and I'll simply add to them the log of the Uth output in my transcription given the Tth frame. So this is the forward algorithm. And as we said earlier, it's very similar to the Viterbi algorithm. The main difference is how I combine these forward variables and how I define them. So instead of using Viterbi, instead of using an actual softmax and a sum would use something like a max, a max and a sum or something like that. But otherwise the idea is the same. And then the final score, the score of P of Y given X, the sum over all possible alignments. Remember that's what we wanted to compute in the first place is just given by the forward variable alpha, capital T, capital U, the number of input frames and the number of output frames. So let's look at it. Let's look and see what that looks like more visually. So this thing is a 2D graph. It encodes on the Y, on the vertical is the transcription, CAT on the horizontal is the number of, is the input frames, let's say there are five. And at each step, we're going to kind of compute the forward variable or forward variables for that step based on the forward variables in the previous step. So you can get a visual sense of how this algorithm is proceeding. So to start, it's simple. The forward variable of length one for the output C is just the score of C given X one. So we have that, we can read that directly out of the output of the network. So then at the next step, I can compute both forward variables of length, which correspond to alignments of length two. There's one for the alignment for ending in A and one for ending in C. And to compute these, I just add in the scores of P of A given the second frame and P of C given the second frame for the first one and second one respectively. So I keep proceeding in this fashion and I build up my forward variables. And over time, I get to the end and I, you know, the alpha at the end equals the actual softmax of the previous two time step alphas plus the score of the kind of the trans, the output at that level. So kind of alpha sub five, alpha five with the output T is the sum of the actual softmax of alpha four of the two alpha fours with the score of T given the fifth frame. So there's a question over here. Why are we doing this instead of the Derby? Where are we doing the summation of multiple paths rather than taking the most likely path? Right, it's a good question actually. And the answer is you could take the best path. You could do it. The simplest answer I can give is it works better to do it this way rather than pick the best path initially, rather than pick the best path from this graph, computing the sum over all possible paths leads to better, better, more accurate models in general. I can give some intuition for why that's the case. And the intuition kind of is that, initially the model does not know what a good alignment is between CAT and the input sequence, the five frames, X1 through X5. If I use the Derby algorithm to start, then I'm basically saying, well, I'm gonna force the model to choose early on which alignment to pick, and hence which alignment to increase the score for. And I'll only pick one. And maybe it's not a good one. Maybe the initialization of my model chooses some really funky degenerate alignment, which it then tries to optimize the score for. That would be bad. So rather than kind of letting something like that happen, which the Derby algorithm can, we're gonna instead try to do, we're gonna instead sum over all possible alignments. And that way the model has the freedom to distribute the probability mass as it chooses between different alignments. And if given that freedom and enough data, it turns out that it will actually start to choose the right alignment. So even though we give it the freedom to increase the score over all possible alignments, as we train the model, it actually starts to zero in on what is a good alignment. And so you'll see that like one of these paths as the model learns starts to get the most probability. And so the Ford algorithm in a sense kind of becomes a Derby algorithm as we train, even though we don't switch explicitly, as we train the model, it starts to kind of choose the best path on its own and give most of the mass to that path. So you could do something like, and this would probably work pretty well in practice, you could start with the Ford algorithm and switch to the Derby algorithm after converging a little bit. And that would be fine, but it's really kind of that early on, figuring out which alignment to use that's very important. That's one intuition. You can probably think of other hypotheses for why this is a good thing to do. Any other questions? Yeah, it makes sense. Also last time in the homework, we actually pre-trained the network for doing some classification in order to get some already good starting points for recognition of the different phonemes or characters. Right. Cool. Yeah, it makes sense. Good. So what I've described so far, it's a fairly simple algorithm. It's not CTC, it's in its entirety. It's an algorithm that would work probably in practice, but it doesn't work that well. So people don't really use it for speech recognition, that is. And one of the reasons it doesn't work that well for speech is because when you get me a snippet of audio that has human speech, it's not only human speech, there can be frames, like 20 millisecond frames, which are silence, just nothing. There can be noise, stretches of noise, there can be laughter, all sorts of things. I don't wanna force the model to output a token of my transcript for every one of those frames. And so if I have silence, say, at the third frame, what can I put there instead of putting one of the letters from my transcription? And so CTC, one of the things that it does differently than what I've described so far, is it has this notion of a garbage token or a blank token. And the blank token just says, this is, there's nothing here, basically, or nothing that I care about at least. So if I transcribe the, if I align the blank token to this third frame, after I produce my final predictions for each frame, I'll just remove all those blank tokens because I don't care about them. And that will give me the final transcription. In CTC, the blank token is actually optional. So we don't have to use it ever, but we can use it, the model can use it, and it can use it one or more times in between any of the outputs or before any of the outputs or after any of the outputs. So let's get a little familiar with kind of what this blank token buys us. So like I said, the blank token is optional. So say we have an input with five frames and we have the output CAT as in before. So this shows kind of a set of three possible alignments. The first alignment has the blank token in the middle. It's allowed because when I remove the blank and I collapse the repeats, the repeat T's, I get CAT back and that's good. That's what I want. The second alignment is also allowed. As we said, the blank is optional. There's no blank in that alignment. That's okay. We don't need to use the blank. When I collapse the A's and collapse the T's, I get CAT back, which is what we want. And the last alignment is also valid because it produces CAT as well for the same reasons. What about this one? Is this alignment allowed? CAT blank T? Everyone think about this for a second. Is this, does this alignment make sense? If I remove the blank, am I gonna get the transcription that I want CAT? It does not make sense. This alignment is not allowed. And that's because when you remove the blank, you're left with two T's, not one. So this would correspond to the transcription CATT, which is decidedly not CAT, so it doesn't work for us. And this gets us into another subtle aspect of CTC, which is that if I have repeat tokens in my transcription, I must incorporate a blank between them. Otherwise, there's no way to disambiguate repeats from non-repeats. So in general, blank is optional between any two letters of my output. The model can always output a blank between any two letters or not, as it's so desires. But when there are repeat tokens, consecutive repeats, it must output at least one blank. And so, if the blank were not there, when we went to construct the final transcription, we would just get FOD and not F-O-O-D food, which is what we want. I do have a question here, though. So the double O in English, it has its own phoneme, right? Which is gonna be the O sound, like food. It's not for the, so how do you actually perform the, how do you connect the speech to the actual phoneme to the actual text? That's not clear to me. Okay, good question. So first, the first answer is, there is no concept of a phoneme in this system. So you can just get rid of that concept from being a unit of a modeling unit, which we're gonna use explicitly. When we decided to make our system more end to end, we got rid of this phoneme concept. Our model is predicting letters directly. So we don't really care about phonemes and as much as they're not useful to us as an explicit modeling unit. Implicitly, the network may choose to kind of represent things internally as phonemes. That's up to the network and probably it does do something like that. And maybe in some cases, it would make sense to try to choose your letter representations instead of using letters using kind of higher level tokens, something like word pieces or syllables, which are more faithful to the sounds that we make. But in practice, we just let the network figure out which letters to produce from the data and it can figure that out on its own. Did I answer your question? So I'm thinking you said each of those X1, X2, X3 are taking 20 milliseconds, right? So in the world of food, there is no, I don't see how there is like a blank in between the O's. Like it's just a same sound, no? Of food, no? I don't see how it's like food space, food, no? Right, right, right. Yeah, so the network has to kind of figure out that when it's, okay, so I've glossed over a couple of things here. So first of all, I'm showing that each letter corresponds to one input, but in practice, the overlap is actually very large. So even though X1 and X2 differ by a few milliseconds for where they're selected, like their center point in the audio, they overlap considerably. And so to produce a given letter, the model actually has a very large window into the input. I see, okay. Not only that, usually we, yeah. So that's the main thing. So it can start to figure out like, oh, this looks like a sequence, F-O-O, that I've, you know, this looks like that sequence. And if I see that sequence, I know I need to put a blank, a blank at this kind of position because it has access to enough context to do that. So it's mostly like learning to do it, even though there's nothing in the audio, there's, it really needs to rely on the context that we give it in order to figure out to put a blank there. It can't, if you were to use just a very thin sliver of input, it wouldn't work. Okay. So there is a question here, which is sort of asking connections between what you just explained with the blank with our homework. So in our homework, we enforce a space, a break, after each character. So we had like basically F blank, O-O blank, O blank, D blank, right? Instead here, you're mentioning, you're saying that these blanks are optional. We don't have necessarily to go through the blank, right? That's correct. The blank is optional here. The only time it's not is when there's repeats in the output. So was that the question or what was it? Yeah, I guess maybe it was more technical. Maybe we should be handling this on our side. It was like more making connection with the homework where we actually enforce a break after every character. I see, okay. Yeah, here we don't enforce a break after every character. So we do have a space character, which I didn't show here, but the model would output a space in between words. So it can, we can ask the model to do word segmentation for us. The blank is distinct from that. It's really, it can appear anywhere in the output. And it's optional. Okay, should I move on? Yeah. So in CTC, like we said, the blank is optional. So that means the recursion has three cases instead of just the simple case that I showed earlier where you can compute the forward variable from the previous two forward variables. In CTC, there's three cases. So there's the simple case where where the blank is in between two distinct letters in which case it's optional. So you can transition from, you know, you can transition from the previous letter at the previous time step, you can transition from the blank at the previous time step or you can transition from the current letter at the previous time step. So basically this is saying that I can, I can compute my alignments of length T plus one from alignments of length T either by extending the alignments, all the alignments which ended in the previous letter of the output A, extending all the alignments which ended in blank or extending all the alignments which ended in the current output B. This is the first case where the blank is optional. The second case is that the output is not optional. You know, you have to output everything in your transcription at least once. So I'm not allowed to go from like blank to blank skipping outputs. So I can't transition from alignments of length T which ended in blank before the A to alignments of length T which ended blank after the A. I have to output an A in between. So I'm only allowed to transition from these two previous notes. And the third case, which looks like the second case in practice, the math is the same is the blank is not optional when there are repeats in the output. So if I have A and A surrounding the blank, I'm not allowed to include alignments for the first A I'm only allowed to include the blanks that alignments ending blank and the alignments ending in the second A. So those are the three cases for the CTC recursion. When you combine those three cases and you can construct your forward algorithm from those three cases. And, you know, we've accomplished our original goal which was to compute the score of the transcription given the input and to do it efficiently and importantly to do it in a differentiable way. If you recall that all these scores are done using our actual softmax. Actual softmax is some combination of logs and exponentiation and addition. And so everything is differentiable. We can differentiate through this graph traversal and back propagate as usual, get gradients with respect to parameters and learn a model which optimizes the transcription given our input. So if you recall, I showed you that graph earlier which was for the simple set of alignments where you just kind of have a C one or more times and A one or more times and a T one or more times you can draw the same graph for CTC. It looks more complicated. It is more complicated actually. It's definitely much less regular. So make CTC more complex to implement in practice. But the graph, but kind of viewing this algorithm as a graph is actually quite useful. And when you start to gain that skill of looking at these computations and envisioning kind of the graph, we can then start to think of them as operations on graphs. And that will be important for later on when we talk about graph transformer networks. So I just wanna step through this graph very briefly just to kind of keep exercising that muscle of looking at this as the set of alignments as a graph. So like we said, we have a starting state which is the bold state. And in this graph, there's two accepting states the two concentric circled states at the end. On every edge, there's our label which is on the first self-loop would be the blank token and then slash the weight. And here we don't have any weights. We don't care about the weights but in the future we will. So what this graphing codes is, if you look at this zero with state you can kind of output zero or more blanks. The blank is optional to start. But you have to output a C and you can output more Cs as you like or none. And once you've output your Cs, you have a choice. You can either output your A directly transition from one to three along the bottom edge. Or you can output a blank one or more times transitioning from one to two and then two on the self-loop and then output an A. Either way, you have to output the A. You can't ever skip producing an A. But you can optionally choose to produce one or more blanks. And so we keep going through this graph and what we end up with is a set of possible alignments which include the optional blank of arbitrary length for the transcription C18. So that was the introduction to the CTC loss and how it works. So to recap, we've gone over modern speech recognition and how it's gone from kind of complex to end-to-end. We talked about the CTC loss, which is one of the more commonly used, not the only loss functions used to train these more end-to-end speech systems. And so now that we have these train models, we know how to compute the score of a transcription given some input audio. We know how to compute its conditional probability. We would like to be able to solve the problem. Okay, given some new input audio, how do I find the transcription that's best according to the model that I've trained? And that's what decoding with the insert is gonna do. So like I said, the goal is we're given some input speech X. We wanna find the best transcription. So we assume we have two models. We've got these models that have been already trained, they're handed to us. One is the model which gives us the score of any transcription given some input, some speech. And the second is a language model which gives the score of just the transcription, not conditioned on speech. So the goal of the language model is to assign high probability to sequences of words or letters which are likely for human speech and low probability otherwise. So I wanna talk a little bit about where this language model came from and why we still use it. So the main reason we still use a language model is because it lets us, it can be trained on a much larger text corpus than what we can train our acoustic model on. So if you think about it for the acoustic model to train it, we need these pairs, right? We need so-called paired data or transcribed data. We need audio and their corresponding transcriptions. And typically generating, if you're a company like Facebook, you can kind of pay people to transcribe like say public videos and the speech and public videos, but that's expensive. Like paying someone to transcribe what's being said in a video is costly. So that limits really the size of the data sets that we can use that are paired where we have transcriptions. But it's super easy to collect huge text corpuses which don't have paired audio and just crawl the web. And so that lets us train a language model on a huge text corpus. And then I can use that language model to help figure out what the right transcription is. You can imagine that if I produce a transcription which is semantically or syntactically odd, the language model should be able to say, no, don't do that, instead favor this other one. So another really nice feature about the language model is it lets us rapidly tune a system to a given application or even a user. So when you're using say your phone and you say like, call x, y and z, call Alfredo, it turns out your phone actually does something really sophisticated. It constructs on-the-fly sort of a language model of all the names and contacts in your phone. And then it uses that language model to figure out what you said we intended to call because the names on different people's phones are so different and sometimes quite distinct or unusual. That language model biasing really helps figure out which contact you said. That feature would work much less well if we didn't use these kind of on-demand user-specific language models. So typically these language models will be Ngram language models. They'll just be based on counts of kind of co-occurrences of say three, of length three sequences of words, sometimes length five, it just depends on the use case. But more and more these days, especially in research, we've been gravitating towards things like recurrent neural network based language models and even transformer based language models. But typically in production systems, these are still Ngram language models which are efficient and can be trained on gigabytes, sometimes even terabytes of text data. Okay, so we said we were given these two models, our acoustic model and our language model. And we'd like to compute the transcription which maximizes the sum of the score of these two models. So you hand me some speech, I've never seen it before. I need to find a transcription which maximizes the sum of the score of these two models. So you can imagine searching over the space of all possible transcriptions, each transcription, I kind of feed it through both of these models. I look at the sum of the scores and then I take the one which is the best. Of course, we can't do that explicitly because there's a huge number of transcriptions. So we need a way to do this efficiently. And in practice, we're also not gonna do it exactly but we're gonna do it approximately. So Y star, this best transcription isn't going to perfectly maximize these scores but it will approximately maximize them. It'll come pretty close. So the first thing you might think to try is okay. Maybe, so okay, so first of all, basically finding this best transcription boils down to a graph search. I'm looking for the lowest scoring path in a graph. And at each point in my graph, I can extend the node that I'm currently at by all possible next letters. So you can say at the first node, I can output any possible letter. Let's say our alphabet is ABC or simplicity. Then, if I output an A at my first node, that will be the first letter in my transcription. And the slash, say that A on the edge here is the first letter in the transcription. The three is the score. And then when I get to the next node, I can consider all possible extensions, all possible letters for the second time step. So that'll be ABC for the second time step in their scores. Just a quick aside here, I switched from looking for the highest scoring to the lowest scoring. So these should be interpreted as negative log probabilities. Sorry if that's confusing but just make the switch in your head real quick. We're looking for the lowest scoring path in this graph. So the first thing you might think to try is just a greedy search, right? You might say, okay, at the first time step, I'll choose the letter which has the lowest score according to the model. In this case, that would be C. Its score would be one. And so that will be the first letter of my transcription. At the second step, I will consider all possible extensions to the first letter C. I'll look at those scores. I'll take the best one. I'm just looking at this local set of possible extensions. In this case, the best one is a B with a score of two. That gives me a total score of three. So my transcription is CB and has a total score of three. At the third step, I do the same thing. I consider all possible extensions of CB. I take the one that has the lowest score which is again a B with a score of eight and that gives me a final score of 11. And so the best path so far is a sequence CBB which has a score of 11. And let's say that our input has only three steps. So we'd be done at this point because our output, we'd have processed the three steps of the input and we'd have produced the output CBB. But it turns out because of this greedy process, I actually missed a much better path there was this path ABA which I missed because I didn't consider A as a possibility in the beginning. And so greedy really is can quickly miss really good scoring paths, especially if you have kind of uncertainty early on when you're doing the search, which you often do. So greedy is not gonna work too well for us. We want something better. So instead, we're gonna use a beam search and beam search is actually a very simple algorithm and also very important to the speech recognition system. And the way that it works is it's just these two steps. The first step is at any given stage in the algorithm, I have a set of candidates. Let's call that set of size N. So I have a set of N candidates, current possible transcriptions. I'll consider extending each of those N candidates by all possibilities in my alphabet. So ABC, so then I will get N times the size of the alphabet candidates. I will in my second step of my algorithm, I will sort all of those new candidates by their scores and I'll just chop off everything but the top N. And then I'll repeat this. So the invariant is that when I start this loop, I have a set of N candidates. And then when I repeat, I always kind of have this set of size N. So I consider extending that set, then I truncate it to the best N, then I consider extending that truncate, so on and so forth till I've consumed all the input. So let's look at a quick example of how that looks. So say I started the first node, if N equals three, meaning I'm gonna keep a set of three candidates at all times, well then we'll just take all three of the first candidates because our alphabet is also size three. So we have A as a first possibility in our transcript, B and C and they have the corresponding scores in the nodes. At the second step I consider, like we said, we consider extending each node by all possible letters. So there are nine total candidates. I will look at the scores of all nine of those candidates and I'll take the three best. In this case, the three best are the three that we've highlighted here. A, B, which has score six, C, A has score four and C, C has a score of five. So that's my new and best list, my new set of candidates. And then I just do the same thing. I extend those by all three possibilities each. I sort those by their scores. I take the top three and I'm left with three more, the three newest set of candidates. So let's say that at this point if my input has a length three, I'm done. I've looked at all three steps of the input and I've got three candidates. And what I'm gonna return from this algorithm is just those three candidates sorted by score. We'll call that an end best list. And so the highest scoring one, which will be the ultimate transcription that we'll use is C, C, B or rather the lowest scoring one, the best score. And then we'll also return the other two just in case sometimes it's useful. We don't have to. So that's beam search. And then one thing you might be wondering is where did these scores on the edges come from? Well, the scores on the edges come from a combination of the language model and the acoustic model. So that's where we kind of integrate in the two models that we were using, that we were given earlier. So each kind of step, we have to query the language model and the acoustic model to produce the score on a given arc. But otherwise, this is the whole decoding process. It's a beam search, which keeps track of an end best list and consumes the input and when you're done, you return the best path. And so like we said, we can use this beam search to approximately find the transcription which is optimal under the acoustic model and the language model. And so that's how inference works. It's quite straightforward at a high level. In practice, there are lots of things one has to do in the details to make it efficient, especially because these language models and acoustic models can get very large and require a lot of context to evaluate. But these are the main ideas. So we talked about... Hold on, actually, there is a question. I missed it. Has there been any research in differentiating the beam search so that we can directly optimize what we do in inference? Yes, yes, there has. And I myself have participated in such research. So basically, I mean, the answer is yes. There's a couple of good reasons to try to make a beam search differentiable. One reason is because then we can make training time conditions more similar to test time conditions. So you notice there's this mismatch between training time and test time, right? We're at training time, we're using CTC and marginalizing over all possible alignments. At test time, we're doing this beam search using a language model. These are two very different processes. If I make my beam search differentiable, I can actually use it at training time as the loss function directly. And so that's a good thing to try to make those two consistent with one another. To some extent, you could say that graph transformer network, which we are about to talk about is actually an attempt at differentiating through a beam search, even though the beam search itself is not differentiable. That's exactly right. Basically, what I was going to say next. So that's one of the motivations for graph transformer networks in the first place. No, I'm glad you said the same thing. Okay, so let's figure out what these graph transformer networks are and how also they differentiate from the other graph networks that we're going to be learning soon. Right, right, right. Okay, so yeah, so let's talk about graph transformer networks in the remaining time. And so I'm going to start, I'm going to start, well, just reintroducing this data structure that we talked about a little bit. And then I'd like to do kind of like a high level discussion of graph transformer networks a little bit on the history, where they came from, what they're used for. And then go into some of the low level details of how we can construct these graphs, operations on them, and then a couple of examples. So just to orient everyone to start though, that the graph transformer networks, at least the ones that we're using to do research with at Facebook are built on top of this data structure, which we call a weighted finite state automaton. And this is the same graph I showed you earlier. It encodes the set of alignments for why the transcription CAT. And really what GTNs are, they're a way to use these graphs to perform operations on these graphs coupled with automatic or with differentiation, in our case automatic differentiation through those operations. So you can think of GTNs as instead of tensors, you have graphs like this graph that I've shown you. And instead of matrix multiply, convolution and point-wise operations, you have different and new and interesting graph operations. And just like the operations that you can do on tensors, like matrix multiply and convolution, those graph operations are actually differentiable. And when I say differentiable, I mean you can differentiate the output of the operation with respect to the input, specifically the arcs of the input graphs, the weights on the arcs of the input graphs. We'll talk a little bit about what these operations are and what the weights are and where they come from. But first I want to talk a bit about the history of graph transformer networks So, you know, as I said earlier, these models were developed by people like Leon, but two at fair, Jan and others at AT&T in the early 90s. And Jan, correct me if I'm getting any of this history wrong, please. And then one of the first applications was in a state-of-the-art automatic check reading system, which was actually deployed using GTNs and deployed widely to the best of my knowledge. So it's kind of one of the early success stories of AI actually being used in practice. And I find that very cool in the early 90s. These things were already deployed. So if you go back and, you know, you kind of follow the explosion of deep learning over the past decade or so, one of the papers that gets most cited is this paper from Jan and company, which kind of introduces convolutional networks and even like modern deep learning as building blocks constructed from these convolutional networks for image processing. This is actually a long paper. It's like 40-some pages. And most of the convolutional networks specific part of it is actually just the first half, like the first, even the first third, 16 pages. But it turns out, you know, the whole second half of this paper is about graph transformer networks. And, you know, it's less frequently read, unfortunately, because it's actually very interesting. I'd say even more interesting than the first half. And so do yourself a favor and read it if you have time. But it's also to me kind of, in general, this paper is remarkably prescient in terms of how much of what we do in modern deep learning it was already doing, albeit at a smaller scale. And hopefully it's prescient of what we will be doing with GTNs, which we're starting to now and I'm hoping people will adopt more in the future. So on the right here, you can kind of see a little figure, which really summarizes that I've taken from their paper, which really summarizes the parallel between graph transformer networks and I'm going to call it traditional deep learning, which is kind of funny because there's nothing yet traditional about deep learning, but maybe graph transformer networks will make them traditional. So really the main distinction is you're operating on a different data structure, which is a graph. And the operations will of course be different. So to draw this parallel even further, with deep learning, with neural networks, our core data structure is a tensor. It can be like a 1D or ND tensor. And with GTNs, our core data structure is a graph typically some kind of like weighted finite state automata like the one I showed earlier. Although there are variations of this graph data structure. I'll give you another variation later. And the operations that we do, they all have nice parallels. So in matrix multiplication, the parallel in GTNs would be something like the composition or intersection operation of two graphs. The reduction operations, their parallel would be shortest distance operations, which include forward algorithm and the Viterbi algorithm, but on general graphs instead of these very finely structured graphs that we've been talking about. And similarly, there's unary and binary operations, which take a single graph or two graphs and compute a new graph. There's parallels there as well. So where are these graphs used today? Because they are actually used today and they've been used for a while, especially in speech recognition. So WFSTs, WFSAs, weighted finite state acceptors are not new. They've been used a lot and are currently used a lot. But the main distinction between how they're used today and what we're, what Yan and Leo were trying to do with graph transformer networks and what we're trying to do is that they're only used at inference today. You can kind of view your model as its output as a graph or the model itself as a graph. But if you only use them at decoding, well, you're really limiting what you can do with these models because they can be used at training time as well. And this is one of the things we were hinting at earlier, which is if you use them at training time, you can start to bridge the gap between what we're doing at decoding and what we're doing at training, such as making a beam search available to us at training time. So more concretely, why are we interested in GTNs? Why are we interested in these finite state acceptors that can be with automatic differentiation? Well, for one, it's much easier to encode knowledge about the world in one of these graphs than it is in a generic tensor. Like, if I have some knowledge that a word consists of letters or word pieces, it's not obvious how do you encode that in a tensor. But you can encode that in a graph quite easily, actually. If I have some knowledge about the set of alignments that should be allowed by a model, such as, like, a blank is optional, I can encode that in a graph, as I showed you earlier. But encoding it in a tensor, it's kind of not clear how to do that, right? So it's much easier to encode priors in these graphs. And then the second reason is basically what we're saying earlier. Using these graphs lets us bridge and bring together training time and test time conditions, which avoid the common issues that result when we treat these two things as separate processes. And then the third, which is really one of my favorite aspects of this approach is that it facilitates research. So when you separate data from code, really any time you do this, you make it much easier to develop. So in our case, the graph will be the data and the code will be the operations on graphs. When we treat those two things as separate, rather than trying to encode the data and the code itself, the graph and the code itself, it makes it much easier to explore different ideas because then all we have to do is change the graph and all of a sudden we have a new and interesting algorithm. I'll give an example of that later. So it turns out a lot of sequence criteria like connectionist temporal classification, CTC, can be specified as the difference between the forward score of two graphs, two weighted finite state acceptors. And so the two graphs in particular, one is a function of both the output and the input, we'll call that the graph A. So it's constrained by the transcription as well as the input. And one is just the function of the input, call that the graph Z. And the graph Z kind of serves as the normalization graph. So what we're trying to do intuitively is the graph which is constrained by the target Y, we'd like to increase its score because it encodes all the paths that we like, that we think are good. And the graph Z, which is not constrained by the target, is all possible paths, including the ones that we think are good, but also many, many more. And so we want to decrease the score of those paths because then relatively speaking, the score of the ones we care about will be higher. And so this actually, this kind of intuition should sound familiar from things like energy-based models, which I think you all learned about. This is one way of combining graphs and computing criteria. It's not the only way. But this would be like using a softmax in your denominator where you code all possibilities in the denominator and then you have just the ground truth and the numerator. So like I said, there are many criteria we can specify with these graphs, including connections, temporal classification, as well as others that are commonly used in speech recognition, as well as also many more, especially if we consider other applications other than speech. So just as a little teaser, if you start to implement things in this framework of using graphs, so say you take CTC, you look at some common implementations. There's an implementation called Warp CTC. There's an implementation from a speech recognition system at FAIR. There's an implementation in PyTorch. If you look at the number of lines of code of their custom CTC implementations, it's in the thousands. And if you implement the same thing using graphs, it only takes about 30 lines of code. Of course, the work is still being done. It's just being done in a more generic and in a way which it can be applied to lots of different algorithms. The code itself is now in the operations, which are general. And then to construct CTC, I only need to string together a few of those operations, which is why it's so much simpler. And also the same graphs can be used at decoding. I no longer have to reimplement or implement a custom separate decoding cycle. So there's a really big win in terms of development, which translates in terms of the things that I can quickly explore in research. Okay, so that was like a high-level discussion of graph transformer networks, weighted finite state acceptors, and so on. I'm going to talk now about some of the operations, some of these graphs, some of the operations we can do on them, and then get into a couple of examples. So here's a very simple graph. We said earlier the bold state at the beginning is the start state. The accepting state is the one of concentric circles at the end. And so this graph will say it recognizes two sequences. The first is AA, and the second is BA. So the first is A from the zero to two, and the A from two to one. And the second is B from zero to two, and A from two to one. And the score, you just read off the weights of each edge and sum them up. So the score of the AA would be zero plus two, which is two, and the score of BA is one plus two, which is three. So this graph in summary recognizes two sequences, and we can get the score of those sequences as the sum of the weights on the edges. So we call that this kind of graph an acceptor, because it accepts sequences. There's another kind of graph, which we call a transducer, because it maps input sequences to output sequences. Very similar concept, the main difference is, instead of having just a label on each edge, we'll have an input label colon and output label slash a weight. And so the input label basically is the input and then the output label. Corresponding output label is what it maps to. So this graph, we would say it transduces two sequences. The first sequence, it transduces AB to XZ, because the A maps to X and the B maps to Z. And as in the first graph, we just get the weights by summing the weights on the edges. So that's a transducer. Map sequences to sequences, instead of just accepting sequences. So there's some different types of graphs that we can have, you know, different structures. Cycles are allowed. You can kind of go from zero to one to two back to zero. That's fine. Self loops are allowed. Not all operations support self loops and cycles, but in general, we allow these. You can have multiple start nodes. So the zero and the one year are both bolded circles. They're both start nodes. You can have multiple accept nodes. The three and the four states here, both accept nodes. That's fine. You'll find different flavors of this in different implementations, but in general, these, these things are allowed. One of the more subtle and more useful components or properties of, or, you know, features of these graphs are epsilon transitions. And so these are really subtle takes I'm getting used to, but the basic idea is. Epsilon is synonymous with, with like nothing. So this graph, the easiest way to think about it is this graph accepts two sequences. One is a B and the other is just B because I can transition from zero to one without accepting without, you know, using any tokens just by transition to epsilon. And then, and then I have to use the B token to get from one to two. So this graph accepts two sequences, the AB and then just the B because the epsilon says I can make that transition assuming any tokens. And just as in, just as in acceptors, epsilons are allowed in transducers. So this graph transduces the sequence A, B, A, for example, by following the self loop in the arc from zero to one and arc from one to two. But the output would just be an X because the output on the first two edges are epsilons and that corresponds to nothing. So the final output is just an X. And so you see what the epsilon buys us here is the ability to map variable inputs to variable outputs. So now instead of ABA mapping to something of length three, I can actually map ABA to something of length one. And I can also have epsilons on the input so I can map variable, you know, shorter inputs to longer outputs, for example. Okay. So let's talk about a few different operations. I'm just going to touch on a couple of them. Definitely this is not comprehensive but just to get a flavor. So there's some very simple operations we can do like union. And then there's a few more complex more complicated ones which are actually the main work courses. So the union graph, the union of graphs is the graph which, you know, it accepts all paths which are accepted by any of the input graphs. So like G1, G2, G3 are the input graphs on the left. They each accept some number of sequences. The output graph on the right which actually kind of looks the same, but the distinction is it's a single data structure instead of three separate data structures. The single data single graph with three start nodes instead of three graphs each with a start node is the union of those three graphs. And you see it's very simple to construct this graph because I just kind of, you know, stitch them all together by making a new graph which has three accepting nodes or three start nodes. And it recognizes all the sequences that are recognized by the inputs, input graphs. There's another operation called cleanly closure which computes the closure of an input graph. So the closure is any number. So if the input graph accepts a sequence, the closure of the input graph accepts one or more or zero or more repetitions of that sequence. So our input graph here accepts ABA. The closed graph accepts, you know, the empty sequence or one or more copies of ABA. So fairly straightforward. The way that you can construct this graph and it can be done very efficiently is just by stitching together the output, the accepting node to the input node with an Epsilon transition and making the input node accepting to allow for accepting the empty stream. Okay. So this next operation intersection intersect is more sophisticated. I'm actually going to go through how to do it because I think it's perhaps the most important operation. It's kind of like matrix multiply or convolution for WFSTs or WFSAs. So it's good to get familiar with this one. The idea of what it's computing is straightforward though. So if I have two input graphs and I wanted to compute their intersection, that means I want the graph which encodes any sequence which accepts any sequence which is accepted by both input graphs. So it's just like intersection and union of sets. The intersection of two graphs is the graph which accepts any sequence which is accepted by both input graphs or any sequence which is accepted by both input graphs. And the weight of those sequences will be the sum of the weights from the input graphs. So like if graph one accepts a sequence X and graph two accepts a sequence X, then the output in the intersecting graph will also accept that sequence X and the weight will be the sum of the weights of the two input, the weights that the two input graphs assign to X. So how does this operation work? Let's say we have these two input graphs so I want to compute their intersection. So you can kind of stare at this and say, okay, well, the graph on the right accepts the sequence AB. The graph on the left also accepts the sequence AB. So that should be in the intersection. Is there anything else? I don't think so. I think that's it. So the intersection is really going to be a simple graph which accepts only the sequence AB. So we're looking for like a graph with three nodes where in between the first two it's a transition on A and between the second two it's a transition on B. So this is really simple. You can kind of get the answer in your head just by looking at these two and practice when the graphs get more complicated. The intersection becomes, you can't just compute it in your head like that, not even close. So let's actually go through the algorithm to see how it works for these two. So we start by considering the set of starting states in both graphs. And we construct in our intersected graph the combined starting state. So our intersected graph will have a starting state which is the combined starting state from the first two graphs. And then at each of those starting states we explore all possible outgoing transitions. And we ask the question, do these transitions have the same label? So we might consider the outgoing transition on A that does have the same label. So that edge will add to our intersected graph, at least our current hypothesis for the intersected graph. And we'll also add the node pair which it leads to in each of the input graphs. So in the left input graph it leads to zero on the self loop. In the right it leads to a one. So we'll construct this new state in our intersected graph which is zero comma one which just encodes the kind of combined state zero one from zero and one from the two inputs. So then I keep exploring the pair of all possible arcs leading out from the two states in the input graph. So I look at the pair BB, those two have the same label, they match. So I add that arc to my input graph, to my intersected graph. And they lead to a new state in my intersected graph, one that I haven't seen before which is the pair one one, the combined state one one. Then I consider the C, the outgoing transition from the zero state in the second graph. There's no match in the first graph. The first graph doesn't have an outgoing transition on C. So that edge gets dropped, we don't use it. Okay, and then once I'm done exploring this pair of states I move on to the next pair that I've added to my intersected graph. So that would be the pair zero and one. And I asked the same question, what are the outgoing arcs, outgoing transitions from these two states which match? And so I see they match on the A. So I add that A as an outgoing transition in the intersected graph. And I actually have reached a new node in my intersected graph which is the combined state zero comma two. So I add that new node as well. I look at the B, the B matches. So I add the B as an outgoing state or an outgoing transition in the intersected graph. And actually something interesting has happened here which is that the B transitions if I follow them in the two input graphs I'm led to a state which is accepting in both input graphs. And so that means it should be an accepting state in the intersected graph. And I've marked it, we've marked it as such. So that will now be, we now have an accepting state in our intersected graph. Again, there's no match for the C so that gets ignored. Okay, now I explore the next state that I added to my intersected graph which was the combined state one one. Turns out this state is a dead end. There's no way to leave the state in the two input graphs which have the same label. So we've added this dead end path to our intersected graph. We can just remove it because there's no use in having it. So we remove it. And then we explore the next state that we added to our intersected graph which was the combined zero two state. Again, this state is a dead end. There's no way to leave this state in arcs which have the same label in our input graphs. So we just remove that state. And then we look at the last state we added to our intersected graph. Again, there's no arcs to explore here. It is a dead end. But since it's an accepting state, we don't remove it. We keep it because it's an accepting state. And there's no more arcs to explore. And there's no more new nodes. There's no unexplored nodes in our intersected graph. So at this point we're done. We've computed the intersected graph and lo and behold, it is exactly what we wanted. It encodes the sequence AB and the corresponding score is just the sum of the scores from the input graphs. So that's how intersect works. And intersect is, you know, if you didn't get it on that first pass which is totally reasonable it's kind of a sophisticated algorithm. It is one of the most, if not the most important operations to GTNs and the differentiable WFSAs. And so, you know, kind of encourage you to go through some examples and just like work out, work through the steps of this algorithm you start to get an intuition. There is a question about how do you deal with the numerical values alongside A or B while performing intersection of graphs? Right, so let's look at a quick example. So like early on when I added this A transition I add the, so this isn't a great example. Sorry, because the first graph has only zeros. But when I added the new arc in this intersected graph I just added the weight for that new arc is the sum of the weights on the arcs that... from the input graphs. So it's 0.4, that's zero plus one but if it had been like one from the input graph it'd be 1.4. So here's another example of intersection which I've included in if you guys go back and refer to the slides you can kind of work through work through how you construct this this intersected graph from the two inputs. So composition is basically the same thing as intersection but the distinction is that instead of operating on acceptors remember acceptors accept a sequence it operates on transducers which are the graphs which map one sequence to another sequence. And so instead of looking for the paths which match on what they accept we're going to match instead on the inner sequence of the two graphs. So what I mean by that is if the input graph transduces X to Y and the output and the second input graph transduces Y to Z then we want our composed graph to transduce X to Z. It will match on the Y kind of splice it out and transduce from X to Z and that's what we want our composed graph to do. The scores will be the sum of the scores of the paths in the input graph just like an intersect. And in fact the algorithm which you use for this is virtually the same instead of matching on input labels you just match on those inner labels instead. So I'm not going to go through it again but here's an analogous example which you can kind of work through which shows if I'm given the input graph G1 and G2 I can construct their composition. The main thing about composition is it lets us map from different domains. So like say I have a graph which maps from say letters to words and I have another graph which maps from words to sentences or phrases if I compose those two graphs my composed graph will map from letters to phrases. That can be a powerful concept kind of composing these graphs through multiple hierarchies of representation. So just as in we had our forward algorithm with CTC we have a general forward algorithm on these graphs. It's virtually the same algorithm but instead of kind of having a fixed number of inputs at each node we just take all of the possible inputs and instead of having just one output at each instead of the node being used for a fixed number of outputs it can be used for arbitrary number of outputs. But the way we kind of combine scores at each node is the same. The forward algorithm does assume that the graph is a DAG that it's directed and that it doesn't have cycles. So this forward algorithm what it's doing is it's giving us a way of computing efficiently the sum of the scores of all of the paths represented by a graph. So that's a useful operation. It kind of lets us say like does this kind of what's the score of this graph overall? So for this example you can compute the forward score you can do it explicitly by delineating all possible paths. This graph has three sequences which it accepts ACA going from zero to one to two to three. The one state is also accepting here so we can start at one and so the path CA is also allowed and then the path BA so the forward score of this will be the sum of all the actual softmax of all of the scores of those paths. So as we said earlier we can kind of construct sequence criteria we can construct loss functions from these graphs. So you know if you remember we had the graph the CAT graph from earlier what I showed you when we were kind of looking at alignments for CTC we'll say that we have very similar graphs our target instead of CAT is just AB as the graph on the upper left so that encodes that graph on the upper left just encodes the set of allowed alignments of the sequence AB then the graph on the right would be something like the set of all possible sequences of length four which are which you know just the set of all possible sequences of length four and the alphabet being ABC so the way to think about that graph on the upper right the emissions graph is that at each kind of in between each two node each two nodes you have a set of logits from your network a distribution or you know unnormalized distribution of over your alphabet at that time step so that's encoded by the weights and then if I intersect these two graphs I compute this target graph this target constraint graph which I'm calling A and it represents all possible alignments of length four for the sequence AB and if I then do the forward score of this graph the target constraint graph that gives me kind of the sum over all possible alignments and then I can also normalize by the sum of all possible unconstrained alignments in summary I have the sum of all possible alignments for the target that I care about which I want to increase the score of and I have the sum for all possible sequences which are not constrained by the target and I want to decrease the score of those so that would be how we do a very simple kind of sequence loss function using these graphs this isn't CTC but it's approaching CTC okay let's get familiar with a little bit of code so I'm almost done here I'm going to give a couple of examples and code and then it will be finished but if there's any questions feel free to jump in so so just to give you a flavor for so we have this framework called GTN which actually lets us construct these graphs and then compute on them do the operations and then do automatic differentiation so that we can learn using them so to give you a flavor for how to construct such a graph in this framework I make my graph initially so you can see the first line after importing the framework for this graph I don't want a gradient so I just specified that I don't want a gradient in the parameter calc gradient I can add nodes to the graph and designate whether or not they should be starting or accepting so I add the 0th node and make it a start node the final node is an accepted node and then I add arcs and the way I add an arc is I just specify the source node, the destination node and the label and in some cases the weight if I have a weight and then we can draw these graphs and this graph was drawn using exactly this function by you know there's a utility that lets us draw them and I have to specify the label map so I know kind of like what the integer labels corresponded another way to make so there's helper functions to make graphs from kind of arrays or tensors so if you take the output of a network as a 2D array of logits where you have four time steps and each time step you have three logits so that's that numpy array here the emissions array I can make what we call linear graph using the logits for each time step and that graph looks like the graph you see below that's the emissions graph it's just in codes in between each two nodes the set of logits for that time step and for this graph we want a gradient so calculate should be true and I set the weights I make the graph it has the right structure and then I set the weights based on the array the numpy array in this case so now I can compute that loss function that I was showing you earlier in GTN this is what the actual code looks like I'm given my emissions and the target graph these are two graphs I compute their intersection to get the constrained graph the graph which encodes the alignments of the length that we care about I have my emissions graph which is the Z graph I don't need to do anything to it I compute their forward score this is a function in GTN it's an operation I compute the loss which is the difference of the two forward scores essentially and then in GTN what we can also do is we can clear the gradient stored on the graph and then automatically differentiate through all the operations we've just performed and so when I call backward on the loss what happens under the hood is this chain computing gradients through operations to the very kind of weeks of the tree that we constructed and then that's it I can return my loss I can turn the gradients and use them as I please so this is the full loss function as you can see not a lot of code the main code was constructing the graphs themselves and then the loss function itself is actually very generic in fact this loss function if I wanted to then make CTC from it well what has changed actually nothing changed it's the same code exactly the only difference is how I specified the target aligned graph using that simple structure I used the graph that I showed you earlier which encodes the fact that there can be blanks and those blanks can be optional but the loss function where I write the code to compute the loss is identical this is one of the big benefits of operating on graphs makes it really easy to try different algorithms so the other thing I want to point out before I conclude is this code is really meant to give you a flavor of how things work in GTN there's parallels between this framework and something like PyTorch if you didn't know that this was GTN you might think this was PyTorch it's computing the loss on some data structures it's computing gradients it's calling backward to automatic differentiation the only difference is the data structure and of course the operations that we can use on the data structure but otherwise it's very synonymous there's quite a lot of parallels between operating on graphs and operating on sensors so that's all I have on GTNs and everything thank you all for listening and for inviting me for the lecture today that was great I think this lesson was absolutely magnificent thank you so much for the lecture my pleasure there were quite a few questions that we answered in the chat while you were talking thanks for doing that happy to share slides or anything like that some kind of references here for people that are interested in learning more if students are interested in communicating and writing something to you how can they get in contact? my email is on the first slide but if you google my name it will lead to my website where my email is also available that's great thanks again for being with us everyone have a nice rest of the day and rest of the week take care everyone