 Yeah Okay, so there will be three parts Today one will be this talk that I will give this morning which will be about as I said the modeling dialogs as a sequential decision-making problem and I will explain all Theoretically you can Optimize this kind of sequential decision-making problem Then in the second part this afternoon Melisa will talk about how to do this in practice That mean how do we scale up the algorithms and how to represent? dialog in this framework and in the third part We will have a practical session where we will have the opportunity to use real dialogue data and try to Manipulate this data and cast it into this framework of statistical modeling of dialogue system So first of all, I want to thank people who worked with me on all the things I will explain today So I had colleagues and former students Matthew Geist was a colleague of mine when I was working previously at Superlake in France sentil was here in the room but also you see Leila, Edouard and Bilal who were PhD students of mine and worked on this on these topics with me And of course, I thank Yanis to invite me here and have a wonderful week. Thanks to him So yeah, so let's start about our problem of spoken dialogue systems So first what is a spoken dialogue system a spoken dialogue system is actually a system That you interact with through speech So the idea is that all the internet you want to have interaction with the machine and you will only use speech So you speak to the machine and the machine will speak to you and that's the only module modality that you will use To interact with the machine So the machine according to some computation and some knowledge it has will answer you Smothly you expect and will understand what you say and according to the task it has to achieve then It will try to build a dialogue with you So it will be a long-term interaction with the machine when I say long-term it can be a few minutes or a few hours or Whatever, but not just a one-shot Dialog what one shot interaction? It's not that you just give some input speech input You receive a speech input and that's it. No a dialogue is more elaborate. You have to Interact for a while with the machine to build a dialogue over time With it and so there are different types of dialogue systems in in in the That you can find around and I will focus on one type of dialogue system We just which are the dialogue system that we call goal directed So they try to achieve some specific task to provide you some specific information It's not about chatting with the machine about anything else that you want to chat with them With some body or whatever. It's it's really that you want to achieve some tasks And you want to do it through a dialogue with the machine So it's for example outlines for information services like you want to have information about the the weather For the next week and things like that. So you call a machine to get information about Yeah, as I said either the what the the weather or Train schedules and things like that. So you have a really focused application and So it's we are I'm not talking about Other fancy applications you could imagine about chatting with other things like that So what is the problem with dialogue system actually? You probably have already interacted with dialogue systems in the past because when you most of the Telecom operators for example would optimize some part of Their hotlines for example, they would try to identify what is the problem you are you want to solve before Passing some specialists of your problem for example, you have your modem is not working So you will call some system and you tell that want to Talk about problems about modems it will recognize that you have problems about models and probably Send your call to some specialists about modems and what will Those people do usually they will follow some decision tree they will ask you specific questions, you know, what is your problem etc, and then they will figure out that you actually have a very common problem and Usually all this can actually be automatized. Yeah, I mean this there is no Usually no need of interacting with a real human because your problem is really simple and it's the same if you want to Book a train ticket. You actually don't have very elaborate problems You want just to go from one place to another at a specific time on a specific day So you call some human and this human will always has the same questions So the idea of spoken dialogue system is to automatize these Very simple dialogues and It's the idea is that you want to process a maximum of easy interactions automatically and You want to avoid bad experiences from the user. So You also probably have experienced some problems when interacting with a machine. It doesn't recognize correctly what you say It always asks you to confirm what you say etc. So that's not Today, it's not very natural and you may have some very bad experience with machines. So what we want to do is to avoid this And to avoid this actually we are true. We need to Understand that building a style a dialogue system is not just about putting speech recognition together with text-to-speech synthesis It's it's about building a real interaction over time with with a human and building is this interaction over time with With a human means that you have you need actually to have a strategy on how to interact with the machine with the human so it's if you want to build such a System by hand if you want to build a dialogue system by hand you will need to List all the situations in which the user can be during the dialogue and cast again match some decision about this situation to what Decisions the dialogue should dialogue system should take so it actually you you really have to Yeah, I have a list of all this the possible context in which the dialogue system and the user can be and Provide decisions to the system. So that's very actually very difficult to do Especially because the system a dialogue system is that kind of system. I'm not Melissa will talk a bit more about all this this afternoon But I just wanted to say that a dialogue system is actually a system through which you interact with the machine by Well Some channels and the channel is noisy So actually the channel you use to talk to a machine is actually composed of speech recognition speech understanding Also text-to-speech and is all this is noisy because all these see all these systems are not perfect You know speech recognition doesn't work 100% speech understanding does it work? 100% and so all these modules that you are using All these modules that You are implicitly using when you talk to a machine produce some noisy observation That you actually have to use to decide what to say to the user according to what to perceive from the speech The spoken input of the user and see so there are two key components of a dialogue systems That will be the focus of this of the The two talks today So there is one which is a task model that will transform the input in some understandable stuff Understandable material for the machine and a dialogue manager, which is actually a module that decides what according to this Representation of what you say decide what to say to the user what to reply to the user so this will be the the topic of the talk of Lisa this afternoon and this will be more the topic of this talk this morning So you need to Take into account the fact that you are when you build such a system that takes decisions You need to take into account that you have a user speaking to a machine through a noisy channel And so this is actually difficult to handle to handle in decision-making So more specifically what is the dialogue management management system doing? It's actually the let's say the brain of the system. It's actually decides what to do What to say what to ask and want to say things to the user so it's also responsible for Accomplishing the task of the dialogue system. So of course you when you decide what to say to a user It's according to the task you are trying to achieve if you are you are trying to sell Train tickets for example for someplace then the dialogue should be conducted by the machine So that at the end you get a train ticket from the machine So it has to take complex inputs. Yes, sorry not the error It has to take complex input as Complex data as inputs It has to determine what to say and what to ask to the user and want to say it It is also responsible for recovery recovering from errors So if the system didn't understand correctly what you said you cannot go on with the dialogue like this you need to recover from errors and you need to Yeah, try to get confirmations or try to get corrections from the user about the information you didn't understand so this is a very a Very big problem dialogue system because when you when you go on with a dialogue with some machine And it didn't understand from the beginning what you asked and it goes on and goes on asking you question about Something that didn't understand at the beginning. It's very difficult to come back at the beginning from To start again and or to recover from these kinds of errors So the inputs are complex and the outputs are also complex because there are decisions which are Concepts or semantic representation of what you want to say to the user as I said So after that the dialogue manager did a decision then it has to go through some natural language generation So it it helped put some some Concepts you go through natural language generation to transform this into text and then from text to speech with TTS So Let's take an example of a dialogue so that you better understand how it works. So here you have a dialogue about some tasks that will probably come It will come again, you know during the presentation. So I will often come back to this kind of example of a dialogue task so the task is you want to find some restaurant into a city and you actually have three different features that you want to Specify to the machine so that it gets you the correct restaurant that you want So the three different information you want to provide to the machine are the Location the price range and the type of cuisine you want in the restaurant So we interact with the machine and the machine should give you the name and the address of some restaurants that is That matches your Requirements so of course the dialogue will always start with some greeting So this column is actually some internal representation of The dialogue that will be used by the system to take decisions about what to say next And this is actually what the user and the system will really say so at the beginning that the Dialogue management will decide to greet the user so it will ask what can I do for you? Oh, I oh may I help you? Then the user say well, I don't really know So You get a suit the dialogue manager knows that the user said well One thing that I forgot to say is that I will consider here in this example that the machine understands perfectly What you what the user said which exactly actually a very Ideal situation, but just to make it simple. I Use it like this so Because the dialogue manager didn't get any information from the human at that point it asks which location the The user wants to find in which location the user wants to find some restaurants So this is actually has a location. It's the way of representing internally there are many ways of representing internally the The information but this is some convenient way that we often use in this talk so This translates into okay Do you have any favorite information and then the user will say well? I'd like an Italian restaurant by the river so it actually provides two information the type of cuisine and And the location so it comes to the machine that it gets actually two information from The user and then the dial goes on and goes on and then after the user after the system got all the information It provides the name of a restaurant to the user the user is happy and the dial completes everything was Perfect the system understood perfectly it decided to provide information at the right time It did it decided to ask for missing information here because it didn't get the information from the user before so There are different decision by the data the dialogue manager which are taken according to the information It got from the user during time So it got to information here it missed the price range It asked for one from for the price range then found some restaurant corresponding and then concluded the data So what we want is to actually learn this the these decisions made by the dialogue manager automatically So that it can be transferred from test to task. It can be learned from data Etc. But why is what is it so difficult to do this? Why is it difficult to learn? The dialogue strategy it's actually because you are interacting with a human and interacting with a human makes some Troubles arising. So first Human is non-deterministic. So if you ask a question to a human In the same situation the same human but a different day will answer you differently. So I mean a different user with asked will actually Answer differently to the same question. So actually you cannot expect to get the same answer To the same question in different situations or in the same situation, but with different users, etc So this this is very hard to model because if you wanted to model human You would have you would need some model of human cognition and all speeches processed by the brain, etc and all the personality of people would modify the way they They talk to a machine, etc. So this is very difficult to model. So We should avoid making models of users of humans There is also a problem of risk management. So it's not really a Physical problem when you interact through speech. It's okay. You really Don't risk a lot, but you can you actually risk to annoy the user or to Give a bad experience to the user when when it interacts with the machine. So you should try to avoid risks Because the the user can just hang off because he's fed up talking to a machine. So that's that's a risk that you have to manage There is also a problem of non-stationarity. So non-stationarity is that as I said, are you even if it's probabilistic? So even if a human is probabilistic and has different probabilities of saying something Answering something to some question These probabilities can change over time because actually you get experience with the system For example, you call this this is the third time you call the system to get the restaurant Information and so you know all this system works So you will change the way you interact with the machine over time because you know you better know than the system Know than you knew it Three weeks three weeks before for example, and this makes the probability of answering Something to the machine changing with time. So this something also very difficult to handle usually is the at least by humans if you actually try to write and craft dialogue strategy By human experts this non-stationarity will be very difficult to handle because then you You have to think about all the user will modify his behavior with time That's very difficult to do and then there is also this problem of uncertainty in observations So when you observe the human activity through speech you actually do that through speech recognition and speech Understanding and this is actually, you know, you don't observe directly the mind of the the people You don't really know what they want. You actually only have observations through speech Speech is already noisy. You usually don't even tell what you really think what will really mean by speech, but it's been translated into some Internal representation by automatic systems and this is actually noisy and makes your observation of what the user wants uncertain So you have to deal with all these things when you interact with people So you have to handle the stochastic user behavior You have also to handle the fact that the user goal can change during the dialogue You and for example, you actually figure out that there is no cheap Italian by a river So you may actually change your mind and go for an Italian restaurant through the train station Close to the train station because it's there. There is a cheap one. So you change your goal with with time So there is the uncertainty due to imperfect inputs processing You have the the fact that you don't you want to the user to have a good experience with the machine And this is very, you know, it's easy to say, but it's very hard to model. What is a good experience? What is a bad experience? So it's very hard to model this Before you interact with the machine actually you can you can know after interacting that it was bad But you cannot tell in advance that it will be bad. That's that's difficult So as I said user, so she's may evolve with time You would like also to reuse what you've done Across tasks. So this is also very difficult if you uncraft a strategy if you move from train station Yeah, train ticket booking to information about tourists or about restaurants How do you how do you use the things that you've done for the first time to another test? That's very difficult to to do so we'd like to do that automatically and there is a very big issue which is the The large amount of possible states possible state situations in which the system can actually be There's the the number of things that the user can say is very large and You have to handle all these things and take decisions about what you've worked on So this is why oh Because of all these challenges we think about using machine learning because machine learning can automatically and handle all these things stochasticity non-determinism Maybe non-stationarity etc But and actually this is done everywhere in dialogue systems nowadays And especially it's been done in speech recognition for the case like in the 70s with HMMs though it's Deep neural networks the fancy stuff to do speech recognition or speech understanding is deep neural networks today I mean it machine learning evolved through time in spoken dialogue system, but mainly states here and here and Here and here in the blue parts of the of the dialogue systems You can find today a lot of Literature about using machine learning to solve the problems of speech recognition speech understanding generation and TTS There is fewer work about Task modeling that will be the topic of this afternoon and few work about dialogue management how to use how to use machine learning to Learn a good dialogue strategy So this will be the topic of this morning actually in all the blue Modules you have here you can find mainly supervised or unsupervised learning. So you will find classification or Clustering and things like that. So if you if you do speech recognition, you will train a speech HMM if you train an HMM you need actually examples of speech example with phonetic translation and then you you train by EM Algorithm your HMM and it's a matter of supervised learning you actually can use unsupervised learning for example for a Initialization of HMMs or of by or of deep networks No, no idea. I think deep networks can be trained directly with supervised learning But that yeah, let's say five years ago. You need it some unsupervised learning to initialize the deep network But Can we use supervised learning or unsupervised learning to to train dialogues to to find automatically dialogue strategies actually? No, we cannot at least I will explain later that you actually can but no At first site It's very difficult to cast the dialogue strategy learning as supervised learning Why is that it's actually because the dialogue Finding a strategy for a dialogue is actually finding a sequential a Sequence of decision. It's not a static decision. It's not matching one situation to one decision It's actually trying to find what will be the best sequence of decisions That will lead the system to provide you the good information over time as I said the dialogue is built over time And what you want is to learn what are the different decisions? That will lead the most naturally or the most efficiently to provide you with the the good information So it is a sequential process and it's not a static problem. It's not about matching Situation inputs to outputs. It's about finding a sequence of decisions Of course, there is also the problem that the inputs are multi-dimensional Ibrit inputs Ibrit I mean by hybrid I mean that can it can actually actually be a Numbers or words, you know, you are talking to machines or you have to trans you are transmitting words to the machine. This is actually Symbolic information so you can actually have numerical information that you translate that you send to the machine and you can actually have Semantic information so this is very hard also to to use I mean how do you model such I such hybrid space and Another problem that will arise when you When you try to learn By supervised learning so let's say that okay you there is a method actually there is no but Let's say that there is a method that would allow you to learn a sequence of decisions by supervised learning Then you would actually have you would need some example of perfect sequences of decisions to match By supervised learning and this actually doesn't exist human strategies human handcrafted strategies have no guarantees of optimality, you know Because you cannot think about all the possible situations in which the system will Arise you cannot think about all the possibly the possible inputs of the human then you cannot have optimal decision for all these inputs So there is no way to guarantee the optimality of uncrafted strategies. Yeah It's not a machine Human human has experience. Yeah, but it does it does it of course now the admittance So that means we try to use Yeah I think this is an important knowledge Yeah, okay, actually So for the the second argument machine, you know in machine learning there is the I will come to that later but there is the what we call the vias and variants trade-off actually Having too many having too many data is not good neither to generalize so it's not it's not only about data It's about all how to correctly model this the the state space the generalization framework So for example as VMs are non-parametric systems and non-parametric machine learning methods where you actually find Automatically a way to represent your data and to generalize as well as possible So this is more about generalization than about getting good data You need the right data not many data You need the right data and the right representation of your of your input space. This is the For the first for the second and for the first for the first actually you human our experience to talk with humans without noise or with noise as you explain but with noise, I mean It's it's If you We would have to model what is the noise arising in speech in a speech recognition So that we have the correct information by a human You would have to simulate what speech recognition errors are Then transmit this information to a human and look what what would be the decision of you of a human in face of this kind of noise and Humans are not expert to interact through speech recognition systems So this is why you cannot just trans transfer human knowledge to systems because systems are More stochastic than humans actually so this is this is actually this is why Encrafted policies are not so There's no guarantee of optimality, but of course it is it is a good starting point and you can of course use this An encrofted strategy as a starting starting point for learning But you cannot expect it as to to be the example of what you should learn Difference Welcome and by the way if you have any question just feel free to stop me at any points because there might be some Tough points at some point. I will have dialogue. Yeah. Yeah, the idea is to have a dialogue about dialogue systems For classrooms To write okay, so we actually have other constraints that we want to Think into account to model to find the best machine learning technique for Spoken dialogue systems, so you act you would actually like to use recorded data as just as Jan is just said so you would there are many dialogue systems around They all use logs So you can actually use the logs of existing dialogue system You would like to use the logs of existing dialogue systems to learn from these logs What is the best strategy? So you would like to learn batch But of course all these dialogues were not perfect So you want to improve over time and you will you would like also to handle non-stationary So would like you also would like to learn online So this is a constraint You want a system that is able to learn batch and to learn online you want of course a sample that is a system that is machine learning technique that is simple efficient because they are Data about dialogues are very hard to collect because you have to annotate a lot of semantic information Etc. So this is this is very hard to collect and to annotate so you would like to be sample efficient so that you need a very small data set to learn and You also and if you will learn online you don't want the user to be just the trainee of the system for years So you you actually would like to learn online very efficiently so that each interaction would bring some Improvement into your system So you want to learn without disturbing the user. You're just not to try and see what happens So that's that's that's something that you want to avoid you want of course to scale up Dealing with the certainty you want to reuse across tasks. You want to track optimal solutions so tracking is also a very important features you would like to To get as I said since the user can change his behavior You want to track over time. What is the optimal solution according to this moving behavior? And of course the stochastic behavior of the user So no, let's see what we have in the literature in the literature We have supervised learning like neural networks as VMs HMMs etc So what is what is supervised learning formally the supervised learning is a Method that learns a mapping between inputs and outputs given some data So an oracle so you have a data of X and Y and you want to find a function f of X that gives the same Y for any X in the in the input space so You have labeled examples and you want to find the real mapping between inputs and outputs And this is supervised learning unsupervised learning is actually a way to structure data So in unsupervised learning you got a bunch of data and you don't have labels for these data But you know that they are from n clusters for example And you want to cluster the the data in n clusters so that in each cluster you have similar data But you don't know you don't have a label for the for each data point So there is no example provided It's just a bunch of inputs and you want to cluster these inputs this is unsupervised learning So this will not work because You would need examples of perfect Sequences of decisions to match So you would need this you would need an oracle to give you this Perfect this is sequences of decisions This will not work neither because you wouldn't you would need to cluster if you wanted to use unsupervised learning to learn Dialogue strategies you would actually need to have plenty of different strategies and cluster good ones and bad ones and The problem is that how to represent the strategy a strategy is usually you can represent a strategy by its output So the the the sequences of dialogues that it will generate but these sequences can be of different length and usually supervised unsupervised learnings Assumes that all the vectors you try to cluster are the same man the same size So this is very difficult to you. It would be very difficult to use unsupervised learning tools to Learn dialogue systems direct strategies and there is a third one. There is a third Learning method that you can find in the in the literature, which is actually reinforcement learning. So That is usually not known to people who knows about reinforcement learning in the room So reinforcement learning is a third type of of learning which is actually learning by Being rewarded. So what you do is to learn you learn to behave you don't learn a mapping from input to input But you learn to take decisions over time It's normally an online Learning I will show you after that we can make it batch that it's naturally an online learning. So you actually try things and Keep the best so you learn by interacting. It's not that you usually you don't learn from observing a Patch of data, but you learn from interaction and it learns sequential decision-making and how does it work? actually It's actually works based on the rewarded systems. It is a rewarded. It's you learn from being rewarded for good Decisions and being punished for bad decisions So the idea is you have an agent that is supposed to learn and this agent is interacting with some environment so Interacting means I decide to do something so I apply an action I Select an action in my environment environment and my environment will change With time my observations of the environment will change with time So for example this this agent which is a mouse Wants to interact with a maze to find some cheese So what are these actions the mouse can do you can actually go up down right left? These are the actions. So if I decide to go down Then my observation of the environment will change I don't see the same walls anymore at the same place So my observation will change and my position will change. So the state my state my My situation will change the way I perceive that the environment changes and the way in my situation my actual situation changes with time Okay, so if I move down actually I don't get any cheese So I get a reward which is zero because I don't get any cheese and I will only get one reward of one when I get cheese so if I move down then Right and then down and then go around And I find cheese there. I will get a reward of one and Actually, this is very this is a very important modeling for that system because as I said You can actually tell after you interacted with the machine that it was bad But it was good So you interact the machine will interact with you and after the end of the interaction you just say well You know the machine has to the user. Well, was it good and you say no was not good so minus one Okay, or yes, it was good. Okay, plus one and you learn what to do so as to reach to so as to collect the maximum of Rewards so the idea is not about saying them to the machine what to do But no, it's not it's not about saying to the machine how to do things It's not about saying to the machine. Well at that position in the dialogue You should have said that and not that because it was that It's like here for the mouse. You don't say to the mouse Well there you should have gone you should not go this way, but you should go this way No, it's just that once it reached the keys It knows that it was good and for a dialogue system. It's actually the same once the dialogue is over You can say it was good or it was not good But you even if you cannot say where was the error of the dialogue system You can at least say well, this was good or this was not good Okay, so is it is it a way for the reinforcement learning paradigm? You all got it. Yeah Yeah, I could the reward procedure is a supervise is done by Okay by a transcriber by by human by yeah, well, it's Different it's different for different applications, but it could be I mean but you could also have expert information and so into You could put in expert information into the reward like you don't want the dialogue to be long You want the user to have to get this info is information at the end So all this can be included in the reward and say and you optimize this reward So this could be actually the user saying I don't I didn't like I liked or this could be an expert saying well I want short dialogues. I want efficient dialogues. I want natural that etc So you could actually use different ways of representing the reward okay, so no other questions about the reinforcement learning paradigm. I mean I will talk about this for the next hour and a half so You'd better catch it Okay, okay, so actually form more formally reinforcement learning has been introduced. I think Reinforcement learning has been introduced in psychology in the beginning of the previous century. So it's a very old model but formally It's been introduced like this by Bellman in the 50s You actually have an agent which tries to interact with an environment, which is a back black box So this is black by purpose. This is a black box. You don't know it or it works But you act you don't have any idea of how it works inside All what you know is that when you do some action it gives you some state you observe that it's it's the state of the environment changes and It gives you some reward Okay, so the idea of reinforcement learning is that I have to select the action according to the state in which I am so that I maximize Accumulate Accumulate function of rewards, so I don't want to in the it's real It's really important to understand that you want to accumulate rewards over time and not to Maximize the immediate reward You don't want to select the action in a given state that will provide you with the maximum reward in that state But what you want is to collect the maximum of rewards over time This is how you learn sequential decision-making and not static decision-making because you want to to To learn how to accumulate the maximum of rewards for example You play chess what you want when you play chess is to win the game. You don't care about losing pieces You don't you don't care about the opponent to catch pieces over time if you win the game So you can take decisions that look bad locally If it's leads to winning the game. Yeah That's That's that's the whole purpose of the of reinforcement learning is all to learn from delayed rewards So I will explain. This is exactly what what I will explain. It's all to learn from delayed rewards so this is by normally by essence it is supposed to be a trial error Try on an error process you so you you try you get rewards and you know whether it was good or not and so all the Reinforcement learning problem is all to learn a mapping which we will call a policy and will not buy it for policy Between states and actions. So you want to given an estate as an input You want to have a function that that's the state to an action so that you accumulate the maximum of rewards over time So this is the idea of reinforcement learning of course Actually reinforcement learning is defined as a problem and not as a solution The problem of reinforcement learning is all to find a mapping between states and actions so that I accumulate the most Reward over time so there are solutions to that to that problem algorithm for reinforcement learning But reinforcement learning is actually a problem and not solution. It's defined as a problem Okay, so all do we that now come coming back to dialogues So how do we model dialogues as a reinforcement learning problem? Well, I shouldn't have written this How do I how do I cast a dialogue problem into a reinforcement learning problem? I have actually to define what is a state what is an action and what is a what is the reward and Then I can model I would be able to use the Reinforcement learning algorithms to solve the problem of dialogue system of dialogue strategy learning So what is the state actually the state will be the So the state will be The context in which I am so I am in in I Gathered information so far from the user and this amount of information I gathered is actually the context the situation in which I am in a dialogue system. So for example in my tourist information system at some point. I knew what was the what was the Location and the type of cuisine these information come are In the state. This is the state. I know this Okay, so the actions the dialogue system can Perform our communicative acts like greet the user Asking for a question to the user asking for confirmation about some information I got from the user. These are actions the system can perform and the reward is the most difficult stuff I need to define in a in a In this dialogue in such in the casting of dialogue systems into reward in reinforcement learning is the usually we would like to Optimize the user satisfaction. So user satisfaction is that well you got what you wanted in a natural way, etc This is very difficult to to model in practice. So yeah, there are work There are still ongoing work to define what is the best reward But of course the easiest the easiest way To optimize a dialogue system would would be to ask two people was it nice or was it was it bad? Okay, so if I come back to my Dialogue system my the dialogue task. I've been talking about just before actually you have You have three pieces of information you want To get from the user and this will be part of the state. Did I get this information or not? So we will here build a very simple State representation which doesn't take into account. What was the answer to the question in which location? Would you like a restaurant, but did you answer this question? That's all so there are three in pieces of information that you want to get from the system and These three pieces of information will be part of the state. Did I get slot one slot two and slot three? And then I will add to the two more information into the state space The first one is did I greet the user because I should start with this So did I did it did I do it? Did I greet the user and did I provide information to the user if I provided? Information to the user. It's probably time to conclude. So this information I are important to take decisions Did I get the information and did I greet did I inform the user with some? Did I give information to the user about the restaurant so the decisions the actions that the system can Perform are asking for a slot confirming for a slot reading the user Inform of results and saying bye so closing the dial and the reward can be just plus one if you provided information about the restaurant to a To the user and zero other words. So this is a very simple way of describing the dialogue It's not the best way. Of course, I'm aware about this, but it's just to show you how this will evolve with time so At the beginning I've shown you I've shown you a dialogue with spoken outputs Here it's the same dialogue, but with state of evolution evolving with time So you greet the the the system greet the user so the states goes from zero everywhere to one for the greeting The user didn't say anything. So the state doesn't change then the The system has for the location the the user provides information about the location, but also about the The type of cuisine location and type of cuisine Then the system has for the price range then you get information for the from the price range and then the system informs you about what the which restaurant corresponds to your Requirements and all the the state will change to one everyone one everywhere and When it's one everywhere, then it's time to conclude that the system should say bye, okay? So this is all the state will evolve with with time and it's according to this to this vector that the the machine should learn that If I greet the user and I don't get any information then I should ask for one slot for example So this is the decision learn From to match this state Okay, so yeah Okay, so in this case we make the assumption that we the system cannot ask The same question twice am I right? No, it could but it would be stupid here because as I said we we assume that there is a perfect recognition by the system Of what you said so if you say If you say I got I want a cheap I want an interior store and close to by the river The system catches it and can fill the slots in Internally, so it doesn't has to ask twice so this is actually a perfect dialogue where everything everything goes well and the system learn to perform well as well by asking the poor the right question, but I Don't make any assumption about If I asked some question, I cannot ask it twice There is no such assumption it could ask it twice, but it learned that it should okay, so Go through reinforcement learning now So I explain you what what was the problem and no I'm trying to Tell you about some solutions Okay, the first thing is that It's to cast the the problem in some statistical representation So reinforcement learning will be based on what we call mark of decision processes So what is a mark of decision processes is actually a state? State machine Where you actually have two different nodes, so it's a graph in which we have two different nodes There are there are nodes for states which are the big empty circles and there are nodes for actions Which are little small plane circles So you we will model The the environment so we don't know all the black box works But we will model it as a mark of decision process and what is a mark of decision process. So in some state You actually have You can take decisions. It's modeled by the pie here. So in state s the In state one and this will lead To state two so I if I select action a two in state one I go to state two see and This gives me a reward associated to this transition There is also a probability that I will select action one and if I select action one in state one I can either end up in ST which is a terminal state or also in state two so this models the the non-determinism of Dialogue systems. It means that if I asked for some something to the user. I asked the question to the user. I May have a question. I may have an answer to this question, but also to other questions It's exactly what happened in the dialogue. I asked for the location and I Like me of the location and the type of cuisine So I could have just a location and I could have type of cuisine in addition So there is a probability from from if I select action one in the state one to go into two different states And so I would get two different rewards Okay, so formally mark of decision process is actually a tuple in which you have a set of state a set of actions a Set of probabilities of stepping from one step to another from one state to another state given the action I have chosen in the in the current state So I am in the current state. I set up an action. I have already to step to another state So there is a problem. This is a set of probabilities of stepping from state to state given the action I've taken and Rewards so this is a set of rewards that are associated to each transition from one state to another And this is gamma. I will tell you later about this this term so the probabilities are Supposed to be Markovian. This is why it is called the mark of decision process It is a mark of decision process because the probability of going from one state to another is from one state to another Sorry, it's actually only a function of the current state and the current action and not what I did before So this is very also very important because when you build a dialogue system It means that you will meet you need to meet the mark of assumption to use reinforcement learning That means that everything that you need to take a decision has to be known in this the current state So you don't you have to gather all the information that you collected in the past into the current state So that you can take decision for the next state So the the whole history of the dialogue should be Contained in the current state. You don't need to remember what happened three turns around in your dialogue between three turns ago and the same for the reward the reward is only a Function of the current transition and not of what happened before So this is very important the mark of Markov property is about the Probabilities transition probabilities, but also about the reward the reward has to be something related to the current transition And what not about what happened in the past? okay, so Again more formally what we want to find is a policy a policy is a mapping from states to actions So in the general case you can actually have states to probabilities of her actions So this would be a non deterministic policy. It means that you will learn The probabilities of selecting actions in given states, but usually you will have deterministic Policies that is a map deterministic mapping between state and actions So in one state one action, but you could actually have a probability over actions. It also works And so now comes the question of how to learn from delayed rewards. So we actually Define a Specific function which we call the value function for each state So each state will be give will be associate to each state will be associated value function Which is the expected cumulative reward over time? So if I start in in state s and Follow the policy pi I Will accumulate rewards after each interaction and I will accumulate this reward over time. I Will do that many many times so I will start from the same state many many times Applying the same policy Make the expectation over to all these times and this will give me give me the value function Accept this for the moment. Of course, it's not possible to do that But the idea of the definition of the value function is I start from s many times apply the policy collect the rewards and Make the the average over all the trials I have done And there is this gamma here gamma is smaller than one and bigger than zero Why is this? Convergence, yeah, because of course if if I have positive rewards all the time and I sum over an infinite number of steps It always goes to the to infinity So every policy is optimal because every policy gives you an infinity an infinite number of what so that So you use this gamma for convergence so that the value function is not going towards to infinity Okay This is the first definition you really have to Get because it the rest of the talk would be a lot of about this And there is another function which is actually giving you one degree of freedom for the first action So you are in a state you can choose any action a You step to another state and from that other state you will follow policy pie So there there is a new value that is associated to the current state and any other action any action in your action set for this state So what is the purpose of this function compared to this one? This one is actually evaluating what you are doing and this one is saying well If there is an action for which this function is greater than this one Then it means that probably this action is better than the one that is in that is in your current policy You selected an action in the first state Which is different from your current policy the one you've learned so far and this leads to a higher Accumulated reward in average So you should change your policy to take this action instead of the action is which is in your policy This is the key For example, yeah, you can start with an uncrafted policy measure What is the value for every state of this inquiry of this uncrafted policy? And you can try other actions for the state the first state and see whether this improves your Uncrafted policy. Yeah Yeah, yeah, yeah, yeah because you have to find the maximum so Find the maximum you could find the maximum on the continuous function, but it's much harder So you would have to derive this it's easier so that so yeah, but usually that's what happens here That's a finite number even even if you do robotics you go you go You can just get I see space the action space But that's that's an open question. I want to do that for continuous question continuous actions Okay, so know that I've defined. So is it okay for me these two functions your Okay Now that I've defined those two functions What do I actually want what I actually want is the policy pi star which is the optimal optimal one The optimal one is the one that for all policies will give you the maximum value for every state So this is actually a way to say that for every state s in my state space I want to find the policy that will provide me the maximum cumulative reward for every state so This is the same for Q But what does that mean for Q? It means that actually can choose the Actually, I actually can derive The policy from the Q function of the optimal policy So the idea of this line is that if I know the Q function of the optimal policy Then I just have to select the action That maximizes the cumulative reward in every state So and then I can have the maxi the the optimal policy But what does that mean? That means that finding an optimal policy is about finding it's about learning this function Okay, if I can learn this function, I can find the policy and this is a Real key in reinforcement learning is that I cast it policy learning into function learning which is easier It's easier to find a function than a sequence of decision so All this is just all about All what I said so far is just all about casting sequential decision-making into learning a function So that doesn't say how to learn this function, but It comes So yeah, well there are different ways of defining The cumulative reward I will use this one as we said because of convergence Properties, but we could actually use if we have finite origin learning We could learn we don't have problems about convergence if we if the interactions doesn't go for an infinite number of time of steps Then it never goes to infinity so we could use this one But we will use this one because there we have There are a lot of proofs of convergence for algorithms with this one and not for this one We could also use the average game over time But so what I mean is that in the definition here I use this With gamma, but I could also use this or this but Little work well There are a lot of work in the literature about convergence about of algorithms based on these Definition of the the cumulative game But it's very difficult actually this is easy to demonstrate all the algorithms the convergence of all the algorithms are easier to demonstrate with this than this so Okay, so well, I will not talk about this one is actually already Already talked about that You want to break now Yeah I go for five more minutes if you Okay, so what we actually want in reinforcement learning is so as I said to maximize this and This value and or we want to learn this one for the the optimal policy So yeah, these are formal definitions of what I've said before but what is important? No, so if you look at this Look at this definition of the the reward. So actually I stepped into a cable. I mean This is to say that at time t and I'm in ST and so I have to add It's exactly the same definition as before but if you look actually at this definition You can extract the first reward and the first reward will be with gamma exponent zero So it's one so you can extract this first reward from the the sum and then sum over one and Continue, so that's what what I do here just extract from the from the Summation the first reward and then I have exactly the same here as before Except that I started to and I have gamma in from so what does that mean it means that thanks to the mark of property because Everything that will happen in the future only depends on what happens. No and not what happened in the flat in the past I can actually rewrite this expectation as some function of the current reward Plus the value function plus gamma the value function of the next step But this what does it mean? It means that if I if I'm in one state I step to the other state I get a reward Then the accumulative reward I can associate it associate to the current step is the immediate reward Plus the cumulative reward I will get from the next step and the cumulative reward I can get from the next step is the value function of the next step Okay So this means that it is it is a recursive Problem See and actually if my policy is stochastic I can average overaction and I can also write V in function of Q So we have V in function of V V is of S is a function of E of S prime So This means actually that the values of each state cannot be are linked to the value of other states around If I can if I step from one state to another I Get the immediate reward and the cumulative reward from the next step And there are different probabilities of going to the other steps. This is why this is the expectation You see the expectation is here is about the transition probabilities So I rewrote the expectation according to to the transition probabilities from state to state So the idea is that I am in a state. I do an action. I have probabilities to go in different states and The reward I can expect in the first state is the immediate reward the average immediate reward Plus the average value function of the possible next state Okay So and if my policy is a stochastic one If I want to find the value function of the current state I have to average over all the actions I can take in this in this state so if I my next slide if I rewrite this so I If The Q function is a function of the the Q function of SA is a value function of the next state It's function of the value function of the next day I can average over actions and if I combine to these two equations. I come up with this equation Which is actually saying that the value function of my current state Depends on the value function of the next state the possible next states if it's a stochastic process Because of the Markov property. Okay, so actually it's a part of the solution. This is this is a system of equations That you can solve it's a system of linear equation that you can solve by any methods you like for linear for linear systems And so you can evaluate Your Current policy thanks to this you can evaluate your current policy if you know the probabilities associated To your just to the action selection in each state for your current policy you can evaluate it and know whether it's good or not You can know you can Exactly know what will be the reward associated to each state the cumulative reward that should see it This has been done I took this from the book by Saturn and Bartow Who did who wrote the? Most cited book about reinforcement learning So the idea of this task is is a role like a grid world task So you are an agent in a grid world you can go north south west if you arrive in this Date if you end up in this state it will immediately Brings you in this state and provide you with a reward of 10 if you are here You will come here, then it will automatically bring you there and give you a reward of plus five and I want to evaluate what is the value the value of the random policy I'm in a state and I do random just to see what happens and this is actually the value of the random policy So you see that this state as the highest value So if I want to improve my policy not to be random anymore, what should I do when I'm here for example? so I'm doing random and I accumulate the most reward when I'm here, so I should try to go there So I change my policy Going from random to when I'm here. I'm not I'm not random anymore I just done left and I will accumulate more rewards So this is a first step towards learning the best policy is to evaluate the current policy Your current policy is evaluated And you can tell you can say well I should change my policy because it's better going this way than the other way Okay, so actually this is the equations for these are the equations for the the optimal Value function actually the optimal value function is since it's the maximum over p of all the value functions for any for all the policies it's actually the maximum over a of the the value of the The Q value of the optimal policy and so you can actually write that the value function is Maximum over a and not the average over all the possible a given the policy of this Value So remember if you look here I said that V as a function of Q was the average over all the action If you want the best action you take the maximum you don't take the average Okay, so This is also what I should Probably skip this. I'm gonna skip this but yeah, the idea is that you could You could also get the best value function out of this this either Having the max here if you change Q for his for his value You will see that the only what the only place where you have an action is in the the next Q value So the max goes only in front of this. So I mean the Q function here will be the Q function for the for the The optimal policy is given by this. It's also a recursive Equation and it's also a set of You see this is also a set of equations that you could solve but it's not a linear System anymore because of the max and that's the problem Just before you put for evaluating the policy It was easy because it was a linear system, but know that you want the best policy You need to introduce a maximum Operator and this maximum operator is not linear and so you cannot solve this easily because it's not a linear system anymore Okay So for evaluating your policy, it's easy for finding the best one. It's not so easy Actually, and I will conclude this first part here. Actually, there is a Quoting that Bellman showed in the 50s that You see he says that he rewrites this as Reuters like this actually Q is a You is actually equal to an operator applied to Q and this operator is noted B and This operator is all this thing and what he has shown is that actually this operator is a contraction So everybody knows what is a contraction or no Okay, so a contraction is a mathematical operator that makes closer to Function if you so for example, I apply B to Q1 and B to Q2 This will be smaller than alpha one minus you to With alpha like this. So what does that mean? It means that When I apply B to two functions, then the two functions get closer to each other Okay, and it's the same for Well, an example of Contraction is the division by two division by two for example. I have 16 and I have eight I applied division by two So I had eight so it goes from height four to eight This is equal to four. So it's smaller. So it's going it's Makes things closer to each other And the good thing is that any contraction as a fixed point a fixed point is some Some value some function that will not move if I apply the operator again For the division by two for example, the fixed point is zero zero Divided by two is zero. It doesn't change Okay, so if I want to find the fixed point of Any operator of any contraction? I just have to apply So I do X minus the fixed point. I apply The division by two so it's x2 So that means that if I Because this one doesn't move This means that if I apply recursively starting from any point I applied the operator to the My starting So there is a way that's that just means that there is a way to find the best the the Q function of the optimal policy I wrote it by for V here, but it also works for Q star So if I if I use this Inition Yeah, if I this operator is actually a contraction as well So if I apply this operator to any starting Q, I will turn towards to start Actually, I just stop after that. There are two different Algorithms to find the best policy. The first one is called value iteration and it's based on the first Evaluation scheme. I told you before so you can actually Evaluate your current policy and improve it as I told you so you you see what is the value of your current policy And you see that there is an action that improves this policy in some states So you change your policy improve it according to the To be the current evaluation of your policy That's makes a new policy that we evaluate again and the evaluation is done by your linear system so you can You can loop over cycles of evaluation and improvement And you will also tend toward the best policy Or you can do what I just said before You can solve the So that was policy iteration and so value iteration is the the thing I told you just before so value iteration is Solving this by applying recursively development operator to any starting point and There is another I go in which is actually called that policy iteration Which is based on the evaluation and the evaluation is linear. It can be solved by any linear programming system or any solver for Systems And once you have the evaluation of your current policy you can improve it Because you've seen that there is an action that there's a best value a better value than the current one So this gives you a new policy. This new policy you can evaluate it again using this but you change I mean you yeah, so you evaluate your new policy according to the The new set of actions You have a new evaluation you watch again at this evaluation You check whether there is another action that is better than the one you selected before etc and once all the action all the actions you take provides you the Largest Reward for every state then you are done. That's the best policy So from what I've said two algorithms one is applying the Belman optimality operator iteratively from any starting point and you will converge to the to the best one the Q function of the Optimal policy second algorithm is policy iteration policy iteration is evaluating the current policy improve it Evaluating again the improved policy improve it until it doesn't change by you can improve it anymore And that's tends to work the optimal Okay, so I stopped here For five minutes and then well, let's start again just before I start again. I wanted to stress that Wanted to stress that the old what I said here has been developed in the 50s so value iteration is from 57 and Policy iteration is from 1960 So, you know, it's been there for a while but applied to reinforcement learning quite recently and To a dialogue modeling quite recently So actually we have a problem with what I've said so far is that as I said we to to apply these two Algorithms you need to know this operator and this operator the Belman operator includes the transition probabilities Of course, you don't know the transition transition probabilities when you interact with humans You don't know that the human will have the probability of that amount to say something and that amount to say something else So these transition probabilities are unknown So this is why we switch to another Baradigam what the two algorithms I stole I've told you about now are called dynamic programming Algorithms they've been named like this by Belman in the 50s It's them in programming dynamic programming suppose that assumes that you actually know Transition probabilities the reward function everywhere in every states for every action etc. So this of course is not known in that system You don't know either the Then you don't know the reward the transition probabilities And maybe you don't even know the rewards if the user if you ask the user what is Your your feeling about the dialogue you don't know the probabilities about getting this or this reward according to the to the dialogue So this might also be unknown so you are in an unknown environment But you still want to use this paradigm of reinforcement learning so our first method which is actually naive but Works so-so It's actually you would you could actually sample the environment to learn those probabilities Just what happens when I do this in this state this action in this state And you see how you step to another state you can't the number of time you stepped to another state And that's the probability of Stepping from one state to another according to the action you made so this is called adaptive Dynamic programming and this actually has been applied It's not the first algorithm first Reinforcement learning algorithm that has been applied to dialogue system, but it's the most simple. It's and it's been applied by a sink Walker Litman and Kearns in 1999. It's been Published at NIPS the machine learning conference not the speech processing conference So application of reinforcement learning to dialogue systems is back to its 50 15 years old only while the programming is 60 years old And it's been applied to a quite simple dialogue system with 64 states Five actions and they actually learned the model from 200 dialogues, which is a small amount of dialogues But it's also a small system So the first the first so Elvis and toots were a male this was a male spooling and toots was a train scheduling system the state well, so quite a small number of state five actions and they could actually estimate the error bounds on the value function and these error bounds were quite big because they had a small number of Dialogs Okay, but there is actually maybe better ways to To learn the dialogue policy the first the first one that has been proposed Is actually to use Monte Carlo methods to estimate the value function so the idea is that you start at any state and You have a what you have a first policy have a first Policy pi for example an uncrafted policy You start in any state You follow your policy It's exactly the same way as I said as I defined the reward function like the value function You estimate the value function just by running your policy a right number of time starting from any state you evaluate this policy the Q value of this policy and You actually improve your policy by selecting the best action according to this evaluation so Of course that means that but this this is actually the first algorithm reinforcement algorithm that has been Proposed in the literature about dialogue systems. So in 1997 Estelle live in and Roberto Piracini in a through proposed the first paper about Dialogs modeling a dialogue system as an MDP 40 it was the for the 80s system So it was not affairs with their roots. You try to find a roots by interacting with the dialogue system 411 states for actions and using this method the Monte Carlo method. They needed 710,000 Dialogs to converge. It's only four actions. See very small state space for actions seven hundred ten thousand So at that point most of people just would just have said well, that's not possible Just give up reinforcement learning don't do that because collecting that amount of dialogues is just not possible So what they did is actually simulating dialogues. They simulated the user They built a user simulator with probabilities of answering probabilities of noise and things like that so as to simulate this amount of Dialogs and learn the optimal policy for this test So that was Almost 20 years ago and It was so as I said the Monte Carlo methods Monte Carlo method is really the It's the best you can do in terms of convergence It will for sure converge to the best to do the real Value and the best policy. This is you cannot you mostly cannot do better in terms of convergence But it's the worst in terms of Sample efficiency you really need a lot of samples so as to be sure that you convert to the to the to the right value function so actually There is a compromise between the dynamic programming and Monte Carlo dynamic programming is cool because if you know everything You you get the the optimal value function just by computation. It works perfect Monte Carlo is cool because you don't need to know the the transition Probabilities while this dynamic programming is not cool because you need to know the Transition probabilities So we should try to mix both the and what is the the idea for mixing Monte Carlo, which doesn't need Transition probabilities that learns from samples and Dynamic programming that doesn't need sample, but you need the Transition probabilities and you have to know them by in advance The idea is that When you look at dynamic programming. Why is it cool? It's because it's it links the value function of the current state to the value function of the next state So what's the point of? Monte Carlo doesn't need doesn't use that Monte Carlo method doesn't use the fact that you know that there is a link between the current value and the value of the next state So we should actually try to introduce This link in two Monte Carlo method, so it was to make them more efficient And the idea is that is done is given by the principle of temporal differences That has been proposed by certain in 88 it's quite old as well The idea is that if you are if you had a non-stochastic system Very deterministic system you interact with it etc. The value function of a state is given by the sum of By this sum of rewards committed with discounted discounted with the gamma This is actually The current the immediate reward plus the value function of the next reward So the idea is to say that V Should tend towards r plus gamma V of the next state. This is a target of V so if V should tend towards that V minus minus this value should be zero so we should Develop a method so that this minus this Contents towards zero and this minus this is called the temporal difference Okay, so the idea is to update the value function So to create an update rule That will make V of ST to tend Towards this value So this is actually a kind of vidrof Update or a stochastic gradient descent you actually update your current value according to some Parts portion of the error you are making on this value and the error you are making on this value is the difference between V and its target So you make V tend towards this By using this update rule so after each interaction you have with the system you get a new reward And you update V according to this new reward and the value function of the state in which you ended up after having Performed your action in state s so you're in state s you do an action. You are in state s plus one You observe V of s of plus one your current evaluation of the of ST plus one and you Also get a reward So by using this update rule you will tend towards the perfect evaluation of your current policy It's not not the optimal policy, but your current policy Okay But of course what you want is the best policy not the current policy you want to turn you want to go toward the best policy So you can actually do the same for Q Like for the Q value So you use you update your Q your estimate of your Q value According to the action you took in action in state s So you update it by using the same update rule so that this Q value will go towards this and all of you Learn the best policy you actually choose your next action according to the policy you are According to the value function you are learning so if If you are you have an action that is better That there's a better Q function Then the one you are currently selecting according to your policy, then you just you should change your policy so as to Take the one that is the ISQ value So the algorithm is the algorithm is an online algorithm You will learn online by interacting with the system after each interaction Well at each state you choose an action according to Q for example the maximum on a Of Q So you take the maximum one and this is the way you improve your policy And then you evaluate your new choice by updating your current estimate of the Q value With this rule So it's like policy iteration so policy iteration was I evaluate my my current policy and improve it Evaluate again improve it evaluate again improve it here I actually do exactly the same but I don't evaluate the whole policy everywhere I just evaluated at the state I in the state in which I am I do one action estimate what is the quality of this action update the Q function step to another state and then choose the new action according to the value function, so it's like as an asynchronous Policy iteration I mean it by this I mean I don't evaluate the whole policy and I don't improve the whole policy I just locally improve according to one sample the policy and there is another another Update rule that is based not on the Evaluation bellman equation but those but under the optimal bellman equation, you know with the max and This actually if you use so you start from any Estimate of Q you Modify your your current estimate according to this rule which comes from the the optimal bellman equation and this will tend Toward learning the Q value of the best policy of the optimal policy So whatever you do whatever the action you choose if you update with this rule So you can be totally random. That's that's what I mean here with with sarsah One thing I didn't say about sarsah. It's called sarsah because to learn the update rule needs s a r s prime a prime So Could be but no change Yeah, a reward Yeah, so if this you actually have to follow your You have to update your current policy You have to change your current policy so that it improves with time with sarsah While with Q learning you can be totally random you do whatever you want with the system You're totally random if you use this update rule you will learn the optimal the optimal value function So once you are done. Yeah, you've done a lot of random things with your system. You just Stop doing random things and Then you use what you've learned and what you've learned is actually the optimal policy These two algorithms you can find them back To 2002 I think the first First use of sarsah in true learning are in 2002 and it's been like it's been used like for the last Ten years for the last ten years. Everybody used sarsah in true learning and user simulation actually putting me actually But it's still not very efficient. You still need 100,000 dialogues or something So, yeah so so So all did how do we tackle this data sparsity problem? We don't have that much dialogues to learn from we don't have that much time to learn from either Actually people all people did Was as I said first using user modeling user simulation back night back to 1997 is the first example of a user simulation for reinforcement learning also by Roberto Pioraccini and We learned accurate here So the idea is that You have a few amount of real interaction with real user and a spoken dialogue system You try to learn a user model a user simulation model Then you generate fake dialogues You record them or you learn online Whatever, but then you learn from fake dialogues This has been what we've been doing for the last Between let's say 2000 and 2010 most of people just did that because We admitted that reinforcement learning was crap and that we couldn't we couldn't learn with less than 100,000 dialogues actually it's not true because during what why we were Applying this to dialogue system people from machine learning were still improving that reinforcement learning things because what we were using algorithms that Were 30 years old in the best case or even 50 or 60 years old in the other cases so the other way to improve to use dialogue system in a real setting with with a Lower number of samples is to actually Do value function approximation and learn from that you actually this is like what I mean is that we should cast this Into some simple efficient learning and what what are the simple efficient met what that the learning methods that we know to be Simple efficient it these are supervised learning methods supervised learning method are very efficient and so far the reinforcement learning algorithms We've seen are not Efficient enough so we should combine actually Reinforcement learning and Supervised methods So but it's still important that I've shown so far that We casted the problem of reinforcement learning into a function approximation what we want now is a Q function We want we don't want the policy But we want the policy of course, but we don't want to learn immediately the policy what we want to learn is the Q function So we can actually think about casting this learning into a regression problem regressing the value function Okay, we are we will try to regress the value function so that we can generalize and apply very efficient supervised learning methods Okay, so what is supervised learning after all supervised learning is you are given inputs and outputs And you know that outputs are functions of inputs And what you want is to learn this function or an approximation of this function so that in and for any input You can find what would be f of x Okay, this is regression you are given some x and y you know that y is f of x and you're trying to find f okay And you usually do that By minimizing a constant a cost function and this cost function is often a quadratic cost function You actually you actually try to find f That's an estimate of x of f of x That will minimize this kind of cost function That's the different yeah, so try to minimize the energy of the arrow Why can you why don't we do that with value function? Q is the value function and we want to regress Q So why don't we do that immediately so the inputs would be x It would be the state for the value function and for the Q function. It would be s and a The Q is the function of s and a so Here the input would be s a and the output would be V or Q Okay So why don't we do simple? Supervised learning is because we don't observe Q anywhere Here you have y is f of x So that means that you need samples of f and in our case we would need sample of V or samples of Q But we don't have them in anywhere in the state space what we observe are rewards So but rewards are linked to the value functions through the Bellman operator So we can actually use this trick to cast it into a reward into a supervised learning method So for example, we will say that Q is Parametric function linear parametric function, and we will try to find the parameters of this linear parametric function using some supervised learning So the idea is that no what so Yeah, and we have a way to project Q on the hypothesis space spent by the Linear representation so we need some basis function those basis function Creates an hypothesis space, and we want to find the parameters that will project the Q function on the on the hypothesis space, and we we actually say that we have that we have a Method that projects a function into the hypothesis space so it can be a least square or whatever So what we want is actually To find this But we don't know this we'd never see this I can Q is never observed You're only observed rewards, okay? So there are different methods to do that So Let's do it anyway, so the idea is that okay? We don't observe it, but let's derive the equations and see what happens So we have we want to minimize this but we actually minimize this because it's according to the samples We have we only have samples. Let's say that we have samples of the value function that we want to regress and We And we have the the current estimate. This is what we search for that's what we we are supposed to observe and Let's do a stochastic gradient descent on this Those stochastic gradient descent you just derive you just update your parameters your previous Estimate of the parameters according to the derivative of this and the derivative of this is given by this if it's a linear function, okay? But still we don't have this Okay We don't have the target. We don't we don't observe the value anyway So how do we do? We don't have V. How do we do? Well, we actually say this is actually linked to the estimate of the next value Thanks to the the Bellman operator. So we replace this by this The the the value of the state I tried to find is actually the immediate reward Plus the value of the next state Okay, so this is called bootstrapping. I used the current estimate as a target for the next estimate So this is bootstrapping And it actually gives this Update through for the parameters so that this just means that you actually can cast reinforcement learning into supervised learning But you have to be very careful because what you try to target is not directly observable So you should use the Bellman operator to find the target and this is a way to do it The problem with this way is that if your parameterization is not linear For example, you use a neural network. This will not converge Well, this can converge but this is there is no guarantees that it will converge With a linear representation it will converge Yeah, it can It can dive but it but it so it also can converge There is no there is no easy way to prove that it will converge And there are contrary example that will show that it will diverge and this has been this has been applied for the first time in 2008 while it's been Proposed by Tezoro in 1995 So the first time This is actually sasa with value of a function approximation. If you look at this You look at this equation and compare it to the sasa equation. It's actually exactly the same if this is equal to one So it's sasa it's sasa with a function approximation and it's been used for the first time in 2008 on a very big dialog system 10 to the power of 87 states But still to learn they just needed some some simulation after to to improve Okay, so there are other methods Actually the method I just told you try to minimize this distance Q minus Q Estimate of Q minus the real Q This is the one that we tried now to whether to minimize what we can actually also minimize is this distance the the distance between Q and The Bellman operator applied to Q. You know that as I wrote here you're supposed to have Q equals BQ So you can try to minimize Q minus BQ Okay, so this is the Method that is called Bellman residual so you try to minimize this distance and You actually empirically make it with the samples you have So you sum over the samples and you try to minimize this distance and You can actually use the same idea of using a stochastic gradient descent on this cost function now So if you replace all this way you can replace the V by their parametric representation Okay, so this actually is another way to minimize the to cast reinforcement learning into Into a supervised learning problem The problem is There is another problem with reinforcement learning, which is that the samples are not independent So the when you do supervised learning you usually Assume that all your samples are in independent and identity identically distributed It's not the case with reinforcement learning since the next the states you are observing are in a trajectory Because you are they depend on the bigger state. I did an action in a state. I end up in a new state that means that there is a link between the current state and the next state and If you minimize this value this cost function You actually don't minimize the distance between V and TV Because there is a variance terms arising here because of the square So that actually can diverge as well So what people use is the number the double sampling trick So the double sampling trick says that you should start from s Do an action and see s? the next state and Actually, you should come back to s Do another action? the new the same action sorry and then See another next state So this so that this part of the equation and this part of the equation are not correlated anymore Okay, and the there is a third way to do it which is actually the one on which I will conclude Is that it's actually the fixed point So it's the Projected fixed point method. So I come back to my drawing here So I said that we have a projection operator that That makes possible to project any function on to the hypothesis space So I could project either Q either Q But he Q is the Q on which I had applied the Bellman operator And this has to be projected again on the IPa on the hypothesis space So I can try to minimize this distance. This was the direct method I can minimize this distance Which is the resident Bellman residual method and I can also minimize this distance Which is the projected six six point method and this actually has been so yeah, right? really need to explain the The equations, but this has been applied recently to 2011 to dialogue system so I so to the the system I told you about before about the Finding three finding a restaurant using three different slots to do three different information. So yeah, you know You have in mind a certain type of cuisine in a certain type of location and With a certain price range so we have three slots, but I replaced we replaced the value of the slot From binary values, so it's not I know I don't know But we replaced them by the confidence levels of the speech recognition speech recognition tells you I know With 60 percent 70 percent or whatever and this is a continuous value. And so we Replace those the three slot values by three confidence levels So that makes a state space which is three times continuous It's even more than 10 to the power of 87. It's three times continuous and we have The 13 actions So you could ask for a slot explicit confirm implicit confirmation of slots implicit confirmation of a slot is like if I tell you In which area of the town do you want an Italian restaurant? And you say by the river you implicitly confirm That you want an Italian restaurant. So it's implicit confirmation. It's a special type of actions which makes the dialogue more natural and then the reward was like You have a plus 25 if you If an answer with a correct with the correct slot minus 25 if there is an incorrect slot in the answer and minus 300 if you have an empty slot in the answer and we used actually Two algorithms two batch algorithms So we started from an uncrafted policy and collected data from this uncrafted policy and You see this is the reward you get By learning from samples, of course the uncrafted the policy doesn't learn anything. It stays with the same reward and with 5,000 samples we actually got a policy which is quasi optimal and When I say five thousand here five thousand samples here five thousand sample is actually not five thousand dialogues But since five thousand turns of dialogues So each dialogue was approximately five to six turns. So this makes one thousand turns one thousand dialogues to learn and approximately Perfect policy perfect policy would be close to 60. This one was close to 50 with this kind of methods So it's mean that by casting reinforcement learning into a supervised learning problem We ended up with methods that could learn from a reasonable amount of dialogues one thousand dialogues reasonable to collect So that's that's the idea then of course there. We also tried it on the more complicated domain lsp I Will not have time to talk about to QTDQ etc. But lsp I for example here performed Also very good and that SPI is one of those methods. I told you about with the bellman residuality So you see that you can learn from this is no This is the number of dialogues you can learn from a fairly reasonable amount of dialogues a very good policy in Complicated domain so the this domain is more complicated than the one before so the one before was just three slots This one I was talking about was just three slots this one was With 12 slots and it's the Cambridge system that Melissa will talk about later So this has been Some recent results that shows that it's possible to scale up. It's possible to learn from a low a small amount of dialogues and this actually helps a lot. Well, it's a It gives good perspectives Yeah, just just maybe one one word about this Because I've talked I've talked a lot at the beginning about the fact that you shouldn't disturb the user you shouldn't make some You shouldn't while you are learning you still have to perform correctly with the user And this is a very also challenging problem that we actually tried To solve that Different ways so in Cambridge they use Gaussian processes and in my group we used the Kalman temporal differences The idea of this is again to cast the problem of reinforcement learning into a supervised learning problem But suppose that what you are looking for the Q function that you are looking for is a need and variable of your problem And you try to find it using Bayesian filtering You try to make a Bayesian update of your Of your parameters instead of a supervised learning Instead of minimizing directly a cost function you try to infer the parameters of your Q function Online with a Bayesian method and the good thing with Bayesian method is that it doesn't give you only The value function but also summons the distribution over possible value functions But so that gives you some uncertainty about your estimate of the value function and this uncertainty You can use it not to disturb the user. You say well this outcome is quite uncertain. So don't Don't push it too much in this in this way because might be very bad But well it's quite uncertain, but in the confidence interval you have very bad things So you will not go and use these actions, but you have other Parts of the state space where you are uncertain, but it's kind of good The confidence level the confidence intervals tells you well. It should probably be good to try this actions, so you drive your exploration of the possible policies according to the uncertainty you have upon the the the Q function the estimate of the Q function So, yeah, we've also there are different There are different methods to do that, but you you can actually see that if you if you If you use this uncertainty So the red the red line here is using a certainty and the green line is not using uncertainty And you see that you are converging faster with a less with a lower variance if you use the uncertainty So you you actually converge to a good policy faster and with Without disturbing the the user the distribution of the user will lead to a greater variance In the rewards that you that you get actually okay, so I think Could actually stop here except that yeah, what I yeah, maybe just this the what I didn't talk about was the Partial observability of the inputs So and this will be the the topic of the talk of this afternoon, but maybe I can introduce some some part of it It's just that Since what you are trying to learn from our observations of what the user said These observations comes out from speech recognition natural language understanding systems and these are error-prone You don't really really see what the user said not even one your user meant to say so actually you have a Partial observation of what of the state on which you should rely to take your decisions and Actually this Has been modeled in the 70s as partially observable mark of decision process So you are still mark-off in the state, but you don't observe the state You would only have observations of partial observation of these states and you are not mark-off in the observations So what what do you do because you have to take decision according to the observations and not according to the states that you don't observe So they are very Nice sophisticated methods based on Bayesian updates that I I just don't tell you about but you can actually cast the this partially observable Mark of decision process in two MDPs if you replace the state by probabilities over space over over states and that that will be For the explain later, but the probabilities over states It's what we call a belief state a belief state is a probability over the all the possible states And this actually can we can show that you can be mark-off in belief states and the belief state can be estimated for observation and Yeah, you have to replace the transition between state by transition between observations and things like that and This has also been used to model dialogues In 2000 so in 2000 there is a paper by Roy Nicola Roy that Models the pump DPs as the MDP the dialogue systems as a pump DP, but they they actually use Like in dynamic programming they learn the transition probabilities They learn the observation probabilities, etc. And then they solve it By using approximate solvers like point-based value iterations, so it is based on on Dynamic programming for pump DPs and then later There has been work done to estimate. What is the belief state by Jason Williams and Steve Young in Cambridge? To estimate the belief state and then be mark-off again in the belief state and use reinforcement learning on the belief state This is another way of modeling this but Again, you see that Pump DPs have been proposed in the in the 70s and in dialogue system. We actually use this since very recently Okay, I just I just stopped here and conclude so What is the take-home message of my talk first thing is read outside your field of area because as I said many times All these algorithms were there for decades and we didn't use them because well It was not our community and it was also not scalable, etc. But I mean the There are sources of inspirations everywhere. So you should read outside your field to get inspiration and Also what I wanted to say is that no We actually use that a driven methods for learning spoken dialogue systems strategies because it deals with Stochasticity long-term behavior with uncertainty. So this is okay No, it's not scalable. So you can actually learn from a reasonable amount of dialogues And that's this is actually a very important. It can also be can also learn from batch Data so batch data means that you can use the locks of the systems that you already have to learn better policies for these systems You can improve online because it's an online method to by by essence Reinforcement learning is an online method. So you could use all these methods online to improve with real users It's actually requires less and less models because you you really Can learn a lot of things from data and what I didn't show is that it's a transferable transferable from task to task I Also worked on this Okay, that's it. I didn't talk about the problem that the reward functions It's not easy to find and things like that. So there are there are many other problems That we can chat about if you if you want at some point, but yeah Thank you That's a good question Dialogue system evaluation. I mean the the only way to evaluate one is to put it online and test That is tested online, but usually what you what you do is you're just a satisfaction service and You put you put the system online you ask people where they were happy with the system whether they found the TTS group whether they obtained their information the information they wanted etc So this is the best way to evaluate And then people just made a logistic regression out of all these questions to Predict what would be the user satisfaction on other tasks on or other dialogues, but Painly is based on surveys Statically a rule-based system I actually all these methods have been developed to make spoken dialogue system more robust to uncertainty about the inputs so a rule-based system would probably Try to be robust by asking for confirmation a lot of time during the the dialogue because you're not sure about what has been said Etc while these methods are keeping track of the uncertainty all Along the dialogue and if at some point the uncertainty about what has been said is too I it will ask for confirmation or ask for implicit confirmation it will choose between all these natural ways to confirm but only because it's The estimate about what has been said is diverging from What not diverging but you are getting more and more uncertain about what has been saying so far. So I mean they It will actually adapt to Each dialogue while a rule-based dialogue would not Adapt it will actually most most of the time ask for Confirmation at the same point in the dialogue something like that. So it's very It's more adaptive. It will from from one experience to another you probably will ask different dialogues with with from the base or mdp Yeah, yeah more natural more efficient Human like it's never human like it's for the moment. It's not human like it's it's because it's not the man like just because it's task It's go oriented. You are just talking about this and if you are going outside the domain and it's it's going bad So so it's not human like from that point of view, but from the efficiency and if you stay in the domain It's much more natural to talk to a stochastic. Yeah, it's more more natural Yeah, it's more natural, but it's still because of Again, it's not really natural like human human because human would not make speech recognition errors. So it's it's It's Yeah, it's a more convenient way to talk to machines, but it's not human like But this is actually the good the good thing with reinforcement learning if we learn this By interaction it will if if at some point for example in if in the reward You say it should be a short dialogue You want the dialogues to be to be short and then it will accept more uncertainty to make it shorter If you if you if you want longer if you accept tolerate longer style longer dialogues Then it will probably try to confirm at some point because to make it to be sure that it got the Right information, but it will learn it from from from the rewards So this is the good thing because in rule-based if you want to take into account the insert uncertainty in rule-based system Then you need to define in advance. What is the maximum uncertainty? I want to do to accept over 70 percent I say it's acceptable below 70 percent. It's not acceptable Then then you will have to to define it by advance in advance, but with this it will learn from the rewards Just went to confirm. Yeah. Yeah, of course and the that's a Kind of an issue actually for the moment you consider that all this the well, that's not true There are some work working on identifying the Expertise of the user during the dialogue, but most of the work consider that there is an average user That exists and and that you learn from for this average User, which actually is not true. There is no average user. So So there are two different ways of ending this either you learn different policies for different types of users And you try to find out to which category the user belongs at the beginning of the dialogue So that you have the right policy or you have fast adaptation Algorithms that can actually track the optimal solution for you But then it works for longer dialogues For example for a tutoring dialogues if you have a tutoring system that you that that will try to teach you mathematics or physics or Whatever, and you will probably interact with it for several days and then it will it will have time to track your Your personal abilities to learn etc Because it's a long-term dialogue But in short terms dialogue like this then you have probably to identify at the beginning which type of User you have in front of the system