 Thank you for the opportunity to speak today. I'm going to talk a little bit about conversational language understanding, just wait for it to come up. Actually, it's going to be a bit of a review of what are some of the approaches that people are doing. But first, I'd like to kind of motivate this area of research and this type of work by giving a demo. Actually, it's a recorded demo. So this first one is me trying to find a restaurant in Edinburgh with a view of the castle. Is there any way that has a nice view of the castle? I have a review of the outsider saying, did any table with a view of the castle as well worth it if you can? What's it like that the other say that? One review of the outsider said, fun, good food and friendly service. Do they have vegetarian options? According to one review, however, the restaurant had a good choice of vegetarian options. Can I pick a table for two people? Sure, let's make a reservation at the outsider. What date would you like his book to work? For tomorrow. What time would you like the reservation for? For 7 PM. OK, so he won reservations for two people at 7 PM on the 20th of June, 2018 at the outsider. I'll take those slides again. OK, let's start again. Cool. So there you could see that I was able to hone in on a particular restaurant, ask kind of free farm questions about it, and get satisfactory answers, I think, and then go into booking where it needed for me some information, like how many people, when, as a time and as a date. So just to demonstrate the open questions that you can ask here's another quick clip. What was that cafe where JK Rowland wrote Harry Potter? According to one review of the Elgin House, a lot of cafes say that they are where JK Rowland wrote Harry Potter. Does this place have good service? One customer wrote, this place is rather crowded at the service was rather slow. And what's their address? According to the address of the Elgin House is 21 George to 4th Bridge. OK, can I have a look at the table? Sure, let's make a reservation at the Elgin House. What day and time and for how many people? Cool, so that hopefully is some motivation of the type of stuff that we're building, Poly AI. I'm now going to go into conversational language understanding type of technology that powers these spoken language systems. What is conversational language understanding? Well, for me, it's some method of deriving a representation which is then used by a spoken language, spoken dialogue system to decide what to say next. And that representation could either be some explicit representation, like the food has to be French, it cannot be Italian, and it needs to be in the King's Cross neighborhood. Or it could be something like a vector in a vector space, an implicit semantic representation that's a hidden layer somewhere in the middle of a neural network that's trained on a more entwined task. I'm going to talk mainly about three paradigms for doing language understanding in the context of a conversation, slot-based systems, ranking systems, and generative models. So first of all, slot-based systems. So here's a dialogue in a slot-based domain. It's finding a restaurant in Cambridge. On the right-hand side, you can see the dialogue state, which is composed of slots and values. So at the beginning, they're asking for cheap price range in the central area. And you can see that in the second term, they can add to their goal, they can change their goal from food equals Indian to food equals Chinese. And you can have special slot value relations indicate that they're requesting the phone number at the end. So that was an example of a dialogue from the dialogue state tracking challenge, which has slots with categorical values. So five areas, three price ranges. And there are many more food types. But you might be sort of thinking, is that really covering restaurant search in a city? I would maybe ask about the atmosphere, if there's good service, if they have vegetarian options and stuff, like in the demo. But I'll come back to that topic later. Table reservation is maybe more of an obvious slot value domain where you need to constrain the date, the time, and the number of people. There, the values aren't really categorical. There's a whole different range of times, for example, and a whole different range of ways to refer to those times. Flights is another example that might be slot value-based. We need to query a database at some point to see if there is a flight from this airport to this particular point on this particular date. So how do people track the dialogue state? Well, one method is word-based recurrent neural networks. Here, the recurrency is through the dialogue turn. So each step, the network takes some input, which is text or spoken language from the user. And what the system had just said, it then updates its state internally and then produces distribution over the slots and values at that turn. So that way, it learns to jointly understand the latest input and to update its state and then to give a new value. And then typically, you'd have one of this type of model for each slot in your domain. One of the big issues that you come across with the slot-based stuff is that the label space is very sparse. So in the training set, it might not contain examples of labels that you will then see in the test set. There are many different food types, and in the dialogue state tracking challenge, there were several food types that didn't appear in the test set. But also, there are lots of different ways that you might specify that you want that type of food. So if you did it naively, you would need an example of each different way for each different type of food. And that would mean you need a lot of data, which you just can't get. So one way of dealing with that is what's called de-lexicolisation, which is where you replace mentions of slots and values with some generic tag. So here, all of a sudden, these first three sentences look very similar to each other. And what this allows you to do is to generalise across different slot values. It allows you to track new values that you've never seen before in training, so long as you can de-lexicolise. And it also lets you bootstrap to a new domain. So you might have a model that's trained on restaurants. It might do something for the slot values that are important in rising hotels. Another way of getting the generalisation to deal with the sparse labels is to exploit pre-trained word embedding spaces. So this is a diagram of the neural belief tracker. And the key idea here, it's not super important to understand the diagram altogether, but this d equals c times r. That means that it's comparing a representation of the context c with a representation of candidate slot value r. So long as you're able to provide a vector for a food type, like, say, Persian, and a vector for a word in your sentence, like Farsi, you hope that in the pre-trained vector space, they would be close together. And you'd be able to identify that this was a mention of that slot value. And it would generalise in that manner. So I mentioned the label space. When you're developing a slot-based dialogue system, you need data to train the natural language understanding component. There are three main ways to get this type of annotated data. The first is just to launch some system and to log the interactions it's having with people and to label the data. Obviously, that assumes you have some initial system that you can launch, which might be an issue. Another idea, number two here, is that you could actually simulate fake conversations. So I have a computer talking to a computer in the label space, so in the explicit semantic annotation space. And then higher mechanical Turk workers are similar to translate from the labels to natural language, so like real English or whatever language you're using. And then you end up with dialogues that are in natural language that are fully annotated. And the last idea is you can do a Wizard of Oz type setup. So you have one worker paired with another one. One worker is playing the part of a user, say, for example, trying to find a restaurant in Cambridge. And the other is playing the part of the system, so they've got all the controls that the system might have. And you just record everything they do. And you end up with fully annotated dialogues. Actually, it gives you some idea of what a human might do in that situation, so it might help you to train a decision-making component as well. Some of the challenges of slot-based dialogue systems, I think the main one for me is that it can impose a artificial structure on dialogue that's suboptimal. So I mean, if you were to look at real conversations about restaurants, I doubt that you could annotate them with the slot-value pairs that are mentioned in, for example, a dialogue-state tracking challenge. While it's useful for connecting to APIs and database queries, it's maybe not exactly how people talk. And in order to really exploit the system and make it useful for you, you kind of need to know what the slots and values it has, what kind of things it can deal with ahead of time. So how do you let the user know? We deal with area, but we don't deal with the quality of the service at a restaurant. Another issue with the large label space, apart from generalization, is that usually the way people deal with it is to factorize the dialogue state into different components. And then have a model that tracks or classifies different aspects of the state. And that just means you've got a lot of models to take care of, which is an engineering problem. Every time you want to deploy something new, you need to check that each model is getting good enough accuracy and that they all play well together. And also resulting from this difficulty in finding data that you just don't end up having as much data as you would in other areas. And you can apply machine learning to this task, but machine learning excels when you've just got millions of examples. And here you're not going to have millions of examples. So the next paradigm in terms of conversational language understanding I'm going to talk about is ranking. I'll explain this in context of the smart reply system, the power smart reply in Gmail. There it's using a model that scores input emails and responses together. So it'll give a high score to a response that goes with this email and a low score to a response that looks random in the context of this email. And actually that score is a dot product between two vectors, one vector that represents the input email and one vector that represents the response. And those vectors are the final hidden layer in some deep network. So there are a couple of reasons why factorizing it as a dot product is a good idea. One is during training. So if you imagine that these x and y are emails and responses that go together. So x1 is a vector representation of the first email, x2, the second, and so on. And then y1 is the vector representation of the response that goes with x1 and so on. Then this matrix here is the n by n matrix of all possible scores in a training batch. It's just a fast matrix multiply away from competing the x and y. We would want the diagonal to be very high for scores of emails and responses that go together to be high. And then the off diagonals below, which we can learn with a softmax loss. It gives us a lot of signal in our learning. It's very efficient. We'd manage to do n squared comparisons but kind of linear complexity in a way. Also during inference, we can pre-compute all the vector representations of the responses and then just do a simple search. So here, training data is much easier to collect. It doesn't need any special annotation. We just need pairs of, in this conversational context, this was the response. And we learn what responses are plausible in different contexts. It's very easy to constrain the output of this because we control the candidate set of things that it's ranking. So we can ensure, for example, that each sentence was written by a human, that each sentence has been approved maybe, if you're worried about what it can say. And also, a big thing is that it has implicitly learned its representation of meaning. So it's learned that vector space. It wasn't hand engineered like in the slot-based systems. Instead, it was directly optimized as part of the learning process on the data for the task at hand. So the last paradigm I'll talk about is generative models. So this would be sequence-to-sequence. The first application of this to dial up would have been Oriel Vinyals's neural conversation model. The idea there is that you train a neural network to encode a conversational context and then produce responses word by word in a generative fashion. So here, you've fed it, how are you? And you give it some special transition token. And then it generates a language model. It generates fine thanks, word by word. So this is how state-of-the-art translation works, where you'd feed it, say, French, and then it would output English. But I see translation as a bit more of a slightly simpler mapping than the mapping of inputs to responses. And as a result, these models, obviously, they just model the data that they're fed. They can have this issue of learning a kind of blurry model over the data, so that when you go to generate from them, they will just produce very common responses. It doesn't seem to depend on the input. So there are special tricks to overcome that, but if you didn't do anything, it would just say, I don't know, or I love you, or whatever is common in the data. Also, of course, this is complexity that depends on the sequence length. So you have to run the RNN for each sentence that you're outputting, whereas ranking is sort of just one pass through the network. And it's hard to constrain the output in the same way that you could constrain the output for the ranking system, so it's kind of difficult to apply this to a useful task. The demonstration of that might be if you were to talk to this model, if you trained it on some conversational set, you ask it, what's your name? It would just sample a name from the distribution of names in the data, like I'm Bob. If you asked it again later, there's no guarantee it would be consistent with itself, but come with some other name from the data, possibly. So if you want to train it to have a particular personality or do a particular task, it's difficult. So I'm going to summarize the pros and cons of the different paradigms. First, we have the slot-based systems. It's really good if you want to have explicit structure, if you need to do a particular API call like Bucca Table, for example. However, that structure that you impose can lead to artificial dialogues. Who says that that's how people will approach your system? It's an engineered label space, and you don't end up with a lot of data. So for a ranking, you can use a lot of data. You can constrain the output of the system to make sure that it's like human-written and constrain it in any other way that you like. And it's learned a simplistic semantic space. On the other hand, it's challenging to connect it to an API call. So how would you extract some explicit representation in the same way that you can from the slot-based system? So for generative methods, it has the one advantage over ranking that it can generate new sentences. But that's also a disadvantage that it can generate new sentences. If you're a company that cares about your image and you have an assistant that represents you and it can theoretically generate any possible English sentence, there are some obvious issues with that. This issue of domain adaptation is kind of tricky. So there are ways to tweak your model that's trained on general text to have a particular personality or something that's very difficult to measure and it's very difficult to get correct. So thanks for your attention. So we're representing Poly AI. We're a new company. We're about 12 people. Two people are in Singapore, including me. We're working on conversational systems. So if there's something that you're interested in, then please talk to me. Thank you very much.