 Well, before we start our presentation, may know how many of you have basic understanding of reinforcement learning? Anyone in the room? Oh, couple of hands. So, remember these people. If you have any question on RAL, either you come to us, anyone of us, or you reach out to your nearest friends. So, I feel many of you have no understanding of reinforcement learning, but it is absolutely fine. Whatever prerequisites are required for this talk, we will cover all of those. But my request to all of you pay at most attention at each and every detail, because if you lost the flow in the middle, it will be very difficult for you to come back on track. And towards the end of this presentation, we will have a question and answer session only for 5 minutes. Fine. With no more delay, let us begin our today's session on deep reinforcement learning. Now, somebody will continue from here. Hello, everyone. Thank you for coming. So, I am also a senior lead data scientist at University of Turkey. And let us just jump right into the problem. So, you have this 3D simulated robot in your computer. And you can control this robot by applying a small force or a torque through each of its joints. So, theoretically, this robot was built in such a way that it can perform a wide variety of very complicated physical motions. But initially, this robot does not know how to do anything. It is just doing something randomly. And the only thing it is doing is like falling spectacularly from all of these positions. The idea tells that you show this robot a video of this person doing a motion called a cartwheel. And to tell this robot, I am going to reward him every time you follow the motion of the human in the video. So, this robot is a very selfish robot. So, it wants to do all it can to maximize its rewards. So, you go away, you go somewhere, you come back. And this is the picture you will see. So, it is trying to do something which looks like a cartwheel, but it is not able to nail its landing problem. So, you give it some time to improve itself. So, after a while, this is the picture you see. It has successfully known how to do a cartwheel. So, then actually the next question you will ask is, if I show it in another video, will it be able to learn from that? So, the answer is yes. In fact, you can show it any arbitrary video and it will be able to learn how to do that faster. So, this work was done by the AI Research Lab at UC Berkeley. And the goal of our presentation is to break down algorithms that can do this kind of learning. You must have guessed that we are talking about something called reinforcement learning or RL. So, RL is typically used in conditions where you have to make a sequence of decisions. So, typically you have some kind of agent. So, this agent can be a robot or a self-driving car or even a chess playing board and it sends some environment. The agent senses the state of the environment and does some thinking. Based on that thinking, it takes an action. So, this action can be move your arm up or go to some place or maybe make this move in chess. But the key thing to note about this action is that it influences the state of the environment the agent is in and also it influences the future actions this agent might take. So, many real life problems have the same table. Think about riding a car when you are constantly reacting to some kind of dynamic environment. And if you bump into someone, you are definitely changing the state of the environment. Similarly, if you take a wrong action and go slightly off-road, you need to devote some future actions to come back into the environment. So, we have this sense, think and act cycle. What do we do with it? We want this agent to learn some kind of task and the way to do that to reinforcement learning is you reward the agent for good behavior. But maybe you can also penalize the agent for bad behavior and let the agent figure out the best actions in order to maximize these rewards. So, think of how a baby learns how to work. So, it will get up, it will try and do something. Maybe at some point it will soar a little bit too much towards the left and fall down. Right? So, the pain from falling down serves as negative reinforcement to the baby, never to take that action again. And similarly, if you have the baby's parents who are like laughing and urging him on when he is taking a few steps, that serves as positive reinforcement for the baby to learn. So, the key thing here to notice that you are not explicitly programming the agent how to do anything or you are not telling the agent what to do. You are defining some kind of arbitrary reward and the agent is automatically figuring out the best actions to take. So, yeah. Quick exercise, how would you model stock trading as a reinforcement learning problem? So, in order to do, in order to model anything as a reinforcement learning problem, first you have to decide what is the state action and the reward. So, for stock trading, the state can be just the raw stock values of the stock you want to sell or buy. So, you can also, and we did that, decide what to do. We incorporated news into the state. So, we did sentiment analysis on all the news articles which mentioned the stock and incorporated that into the state. What kind of actions can you take? Yeah, exactly. Buy by some amount, hold or sell by some amount. What kind of reward do you want to sell? So, yeah. So, you want this agent to make a profit. So, you will reward the agent whenever it does. You will set the net reward, a net profit as the reward. So, RL is being used in many domains and this is not an exhaustive list and I am not going to cover these domains. If you want to find out how RL is used in these domains, you can just do a simple Google search and you can get a lot of articles and search papers and just cover two of them. So, Google has a lot of data samples and it has a bunch of computers and it needs to handle a lot of variable loads. So, it needs a very sophisticated mechanism to direct its power to each of these computers. So, Google uses RL to optimize its cooling systems and by doing that, they manage to bring the build cooling costs down by 40%. And similarly, RL is used in healthcare to treat patients with a condition called epilepsy. So, using RL, you can deliver a controlled sequence of shots to the patient's brain in order to treat them. So, it's not just all theory. It's currently being used to solve complex real-life problems. So, very quickly, we jump into the mathematics behind endorsement learning. So, RL is modeled as something called the Markov Decision Process or an MVP. An MVP has a sequence of states or a list of states. These are the observations that the agent can make about the environment. So, here I have just highlighted three states and MVP also has some actions which allow the agent to travel between these two states. So, here I have taken two actions. You can see the blue action and the red action. So, if you see the blue action from state S1, the blue action can either take it back to state S1 or it can make it go to state 2. So, this is to model the stochasticity or randomness that happens in real life. So, for example, you go and you ask someone out. So, they might say yes or they might refuse you. But the state you end up in is not fixed in the beginning. Rather, it's a probability distribution over the states. And this is end quoted by something called the transition probability function. So, you can read this MVP as from state S1. If you take the blue action with the 0.1 probability, you go back to state S1 or with the 0.9 probability, you go to state S2. So, finally, you have these rewards which are associated with each state. So, as you can see, state 0 has a reward of 1, state 1, 0 and state 2 minus 1. So, ideally, you want to take actions that let the state in stay in state 0 as much as possible. So, now we have this MVP formulation. What do we do with it? The goal is to learn something called a policy. A policy basically says that if you are in this state, you perform this action. So, and this policy should satisfy some criteria for optimal behavior. And this criteria is to maximize the expected long-term reward. So, please don't worry about that equation right now. All you need to know is that we want to maximize the rewards we can get in the future. So, for example, just turn your attention to state S2. If you take the blue action from state S2 with a probability of 1, you go back to state S0 and get that reward of 1. And if you take the red action, you might go back to state S2 with a probability of 0.8 and get a negative reward. So, the action which is maximizing the expected long-term reward is actually the blue action here. So, we want to find a policy which gets us better rewards in the future. So, the good news is that for an M degree, an optimal policy for a policy which maximizes this quantity always increases. So, in the real world, we won't have the agent which is in this environment, it doesn't have access to this M degree. So, I have kind of graded out the only thing the agent knows is that I am in this state and I can take these actions. So, the goal of this agent is to collect some form of experience from this environment. So, typically what happens is that the agent says that I am in this state, I am taking this action. The environment takes that action, simulates it internally and returns the next state and the reward back to the agent. So, this full tuple of state, action, next state and reward is called the experience which is gained by the agent. So, the task of the agent is to collect more and more experience and somehow find out this optimal policy. So, there are many ways to do this. I'll just very briefly cover one of the most popular algorithms you can use to do this. It's called Q learning and again please don't worry about this equation. In Q learning, we estimate something called the action value function or this QSA function. This QSA function basically tells you the expected long term reward or utility or the goodness of being in a certain state and taking the action. So, this function basically tells you how good it is to take an action from a certain state. So, in Q learning, we initialize this function randomly. So, what this table essentially tells you that the blue action is slightly better than the red action. Now, the task of this agent is again to collect some experience, but right now it's using this table to collect the experience. So, suppose the agent is at state S0, it will consult this table to check which is the optimal action. So, right now from state S0, the optimal action is the blue action. So, it will take the blue action, it will get some next state and reward and then it will update some equation which you don't need to worry about right now. But this equation gives us a better estimate of being in that state and the expected utility of being in that state and taking that action. So, we initialize with some random values. Now, we had an updated value by correcting some experience and you do this over time. You end up with a table like this. So, after a thousand preparations, this table will tell you that this blue action at state S0 is good. Similarly, the red action at state S1 is good and the blue action again at state S2 is good. So, through the interaction with the environment, we have completely optimal policy. And it is actually represented, the optimal actions are represented by the dash lines in this diagram. Another piece of very good news, queue learning always converges to the optimal policy. So, we have any arbitrary reinforcement learning problem which is modeled like this. It can apply queue learning to get the optimal policy. It is subject to some terms and conditions. And one important condition is that make sure you explore from time to time. So, that means that sometimes you want to take a suboptimal action. So, this is to ensure that the agent does not get stuck in a loop and all the states are perfectly. Very quickly, we had this environment which is modeled as an MVP. We have an RLEJ which is interacting with this environment with the goal of having through trial and error finding an optimal policy. This can be done by an algorithm called queue learning. So, now there is a huge glaring problem with this combination. In the real world, we do not get these very nice discrete state and action spaces. So, think of a driverless car. It has a lot of sensors embedded into it. And the state of the driverless car will be the sensor readings at any time peak. So, that is a continuous state. Similarly, if the action for a driverless car will be rotate this year in the by this amount. So, action spaces also continuous. Similarly, coming back to our acrobatic agent. This state is actually some n-dimensional vector containing the joints, angles, velocities and position of the agent. And the action is this n-dimensional vector containing the torque or the force you need to apply to each of the joints. So, we need to adapt this queue learning algorithm to handle these real world use cases. And the idea is very simple. You represent our queue table as a neural network. So, this neural network will take the now continuous state. So, earlier there might be discrete states. Now, it is a vector of real values actually put and it will output the queue value for each action. Earlier, to get the best action for each state, we were consulting with table. So, right now we are going to consult this neural network to get the best action for each state. So, how the loss function changes I am not going to cover this. So, we handle continuous state spaces. What do we need to do? If we need to handle continuous action spaces. So, earlier we had one neuron per action representing the queue value for that action, ok. If the action space is continuous now, in theory we need an infinite amount of neuron to represent it, ok. So, what we do is have two neural networks. One neural network is the state of the agent as input and outputs and action. Another neural network takes this output state. So, it takes the output action and the state as input and outputs the queue value. So, this kind of architecture is commonly known as an active critic architecture, where you have an actor which takes this state as input and outputs an action and the critic tells the actor how good its action was. And the training is set up in such a way that over time using this feedback from the critic, the actor takes better and better actions, ok. So, lot of information to take in. The great thing about using a neural network to approximate your queue function is that now you can harness the full flexibility and power of deep learning architectures in your reinforcement learning. So, for example, the state that your agent senses, if it's some kind of images or image or video, you know that there is this architecture called a CNN that works really well with images and video. So, you can use a CNN now to approximate your queue function. So, really what I did in the last 10 minutes is summarized a decade of research into a few slides. So, in a nutshell, deep RL can be used to solve extremely complex problems like go and do not have to. And I am not going to go into them, but you can read about that. Deep RL algorithms are very hard to converge and they require a lot of hardware and they rely on some tricks to get the neural network to converge. So, there are like 100 small different things I am not telling you. So, if you really need to go deep into the technical implementation, you can check out our link. The slides will be uploaded. So, just give us a couple of days to add some helpful comments to this, ok. Now, switching to a separate topic and why we are doing this will make sense in a while. How do we as humans learn? So, a majority part of our learning is done by imitating others and we often do this subconsciously. So, a large part of our attitudes, language and behavior and skills is shaped this way. So, the question is can our RL agents learn through imitation? And actually the answer is yes. This was a self-driving car which was taught how to drive by using imitation learning. So, it copied the moves of a human driver. But the cool thing about this is by the way Alvin stands for autonomous land vehicle in a neural network. So, the cool thing about this is this was done in the 1980s. So, the research has been going on since then. So, there is an entire field dedicated to teaching agents how to learn and it is called imitation learning. Where he has some expert and this expert can read some data or it can be a human. Or it can also be a YouTube video in our case which is teaching the RL agent what to do. So, the question is why would we need someone to teach an RL agent what to do when it is already doing so well on its own. So, to use as a motivating example, initially the RL agent is completely random. So, we initialize the chemical randomly. So, as soon as it explores and gets to that first reward it starts learning. But the problem is what if it never gets to that first reward. So, imagine you have to train a self-driving car to go from tank load to tank. And you give it a plus one reward every time it reaches there safe. So, through random exploration how many times how many times you think it will take. So, it can be infinite. So, in cases like these are domains which pass rewards. You can use imitation learning so that the agent is not random in the beginning but has some good policy. And then you can use universal learning to improve a contact policy. And one way to do this is through modifying the reward function. So, you can give this agent a reward for driving them. For staying in the middle of the road, not hitting anyone, driving under the ceiling limit. But there is no guarantee that this agent will actually learn to drive by following that reward function. It is just way simpler just to demonstrate to the agent what to do. So, there are again many different ways to do this. And I am going to leave you with two very simple ideas. One, you ask our expert for some supervised samples. Which basically say that if you were in this state what action would you take? Take these samples and you create your actor network in a fully supervised setting. So, initially the actor network was random. Now, it has some good policy and now you can use RE to improve upon that policy. And the second way is a method called auxiliary rewards. So, what you do is along with the normal reward the agent is getting. Give it an extra reward if it is doing exactly what the human is doing. Or you can also penalize it for not doing exactly what the human is doing. So, yeah hopefully brings me to an end to a lot of technical detail. The conclusion for all of this is now you know that our acrobatic agent uses an actor critic network. Which is a kind of e-parallel architecture. And it is trained using imitation learning with auxiliary rewards. So, this brings an end to my portion of the talk. Shiv will take it from here. Thank you very much. Thanks everyone. Well, so far we have seen what is reinforcement learning. Then how to formulate a reinforcement learning problem as a mark of decision process. Then I have looked at how to learn optimal policies using Q-learning. And the last thing here talked about is imitation learning. That is the most important part we are going to convert from his slides. So, in the imitation learning what you have seen. RL agent is learning a policy which can mimic a given demonstration. And it works in a more of an actor critic architecture given an auxiliary rewards. Now, I don't if you have noticed what kind of input is given to an agent during a training time. It is nothing but a sequence of states and actions. Now, as Samedan had already mentioned here state means the joint angle, velocity and position. And accordingly the actions means the amount of torque you are providing to each and every individual joints. Now, let's have a look at how to collect this kind of data set. So, one of the way to collect data set is called tele-operation. Where a human being is performed means some actions. And the same is being propagated into a robot or a simulated character. And with help of some sensory devices and all. And what you record is the state and actions of the simulated character. And that you give as a input to the RL agent using a training time. Another way to collect similar kind of data set is called motion capture. In that case what we do as a human being we perform some actions. And those actions are recorded in an environment where you have green screen, lot of cameras from several viewpoints. And actor has a lot of those spatial markers as presented there in white dots. And with help of these markers it creates a 3D structure of what the human being is doing. And that's given as an input to your RL agent. Now, the problem is it's extremely difficult to build such kind of environment. It requires a lot of machinery. It's very expensive going to consume a lot of time. And another difficulty is there is very less number of experts who can do this kind of setup. And as you are talking about acrobatic task. So as a researcher someone like me has to do a backflip. I mean that's not going to work right. But if you search for backflip in YouTube. You will see plenty of videos where some expert had already done that. Recorded those videos and uploaded in YouTube. So you will see videos where some expert is teaching you backflip in 5 steps. Then comes teaching in 5 minutes. Then complete everything in 5 minutes. And there comes some prank where a grandpa is doing backflip. And few more variations. And then comes the father of all backflip, the Indian version of doing that. And the last thing you can expect a robot is also doing a backflip. So the point I'm trying to make here is like you search for any random task in YouTube. You will see plenty of videos with all sort of varieties. And the idea is how do you leverage such existing resource of huge corpus of freely available video resources to train your agent. So in general what you are expecting it's already been discussed by Simiton. Given video as an input. You want the agent to learn a policy which can learn what is being demonstrated in the video. Now remember it's not a one to one frame by frame transformation from your video to the RL agent. Even if you throw a lot of blocks to the agent, still it would be able to finish this task. So it's that kind of dynamic nature of learning. The behavior also can be re-targeted to a different kind of characters. Having different body shape, having different body weights. As you see here it's doing really well on the Boston dynamic virtual characters. Now the question is how do you train this kind of system? What kind of modeling will you use? As Simiton had already talked about imitation learning and that works really well to Mimika given demonstration. So imitation learning can be used as a solution to solve this problem. Now to do that all that you have to figure out is what will be the state space? What will be the action space? What kind of auxiliary reward you have to use? And you have to take up another problem is called domain gap. So for this framework at one side you have a video which is a sequence of RGB images. But on the other side what the RL agent understand is the state and actions. Now this state is represented by, as we told you earlier, is represented by joint angle, velocity and the position. So somehow through some mechanism you have to convert the state in the video to some kind of state what the RL agent understand. And this method is called domain gap alignment. So you will see how to do that. You have a video, you give that video to a pose estimator. And what it does for each frame? It predicts a 3D body structure of the human being. So at this stage it's a frame by frame computation. So each frame comes with some amount of inaccuracy. And because of that you see there is an annoying flickering effect. Then the sequence of 3D structures are given to a module called motion reconstruction. So motion reconstruction provides a smoothness to your video. And accordingly it provides some temporal consistency. So all of these are done to solve the problem of domain gap. And this process is called domain gap alignment. So now what do we have from an input which was video earlier. Now we have the same thing in different domain which the RL agent understand. And now it could be ready to use for the imitation learning. All of you must be thinking how this pose estimator works. It's a supervised model and the kind of training that is being used is like one side you have 2D image of human in different poses. And for all of these images you have 2D and 3D annotation for those. So what you are trying to do is like every pose of a human being is represented by 23 joints. So each of these white dots are the joints. So when I am doing these verses that the joint is here having a different position and velocity and angles. So in general every pose of a human being can be represented by a d-dimensional vector representation. And next we will see how to learn a model to predict this kind of representation. So the framework has an encoder and a discriminator. The encoder predicts this d-dimensional representation given a 2D image. And what the discriminator does given those d-dimensional representation it will say whether it looks like a human being or not. So the way the encoder works you have a 2D image. It's given to an image encoder the train on a resonant architecture. And this encoder gives you a latent space representation of the 2D image. And given the latent space representation to a 3D regression model is called SMPL. You predict this d-dimensional representation of your 3D character. So SMPL which is a skin-tied multi-person linear model I am not going into details of that. I mean that will be too much of details for this talk. But SMPL you can consider as a module. Given a latent space of a 2D representation you can predict all of this parameter related to the 3D model. Now the difficult part is like some of those parameters predicted by this encoder doesn't look like a human being. So as an example human being cannot have a 360-D rotation of his head position. But some of these parameters predicted by the encoder looks like if you render those it looks like the head is flipped towards other side. So given this kind of predicted values to the discriminator it should reject this kind of values and say that it doesn't look like a human being but it looks like a monster. So the kind of discriminator we have is a binary classifier which is trained on some input data having poses from different people and given a parameter test whether it is a human being or not. And this training happens in an adversarial manner. People who have understanding of how GAN works it's almost similar to that. Well I am not going into the details of the optimization function. But what you should take away from here is like there is one cost function which is being minimized with respect to E. So E is the encoder function and there is another cost function is being minimized with respect to D. Another thing you should notice is not D but DI. So DI represents the discriminator for ith joint. So as there are 23 joints there will be 23 independent discriminator trained separately. Cool, next I have some results on how does this 3D prediction works and don't judge the characters by their performance. So my good friend Samarit is trying to do floss and see how the 3D prediction works on those. And then me trying to run Gangnam style and the kind of visualization is as in terms of these joint angles. And as you can see for every poses each joint has a different position and velocity and angles and so on. And then some famous steps. Cool, and then some random actions. So far what we have achieved is we have solved the problem of domain gap alignment. So we have transformed the input from video domain to a domain which can be understood by the agent. And next what we have to do is the imitation learning. Now initially the kind of policy you have it doesn't doesn't know how to do a particular task. So it just takes some random actions as you see it's just falling down on the ground. And then with help of imitation learning you so this agent takes the reference motion as an input. Go for the imitation learning and then the policy learned how to do the particular task demonstrated in the given video. So the most important thing that we have to figure out is how the reward looks like. As I told you earlier the state of an agent is represented by pose, velocity, end effector, center mass all of those. The agent gets a reward only if it follows the given trajectory in the reference motion. If it doesn't follow it it doesn't get any reward. So it tries to copy what is being shown in the reference motion. So the final reward is a weighted combination of each of these individual four sub-rewards. Cool. There is one more difficulty. Now the question is how does the agent will know that doing a full rotation in the mid-air will result into a successful backflip. So most of the RL agent works through retrospective system. So they do some random exploration and if they see that this particular state is resulting to high reward they go for that. Now to give this RL agent a hint of doing a full rotation in the mid-air. So will initialize the RL agent for each episode to a random state. So sometime it starts from the ground or sometime it starts in the mid of a flip kind of. So this process is called adaptive initial state and it helps the agent to learn very complex task. Well I am switching the topic in a sense like we told that SMPL is used to predict the three-dimensional parameters and all. Now SMPL is trained on a fixed set of input data set. Now the kind of input you have might be completely different or might be very skewed from the kind of input is being total in the SMPL. So you can have some different kind of poses and all. So at a very high level if you look at the kind of learning we have till today. So you can go for a traditional learning where you annotate each and every data points you have. Or you go for active learning where rather than annotating everything you just select somebody presentative and just annotate those. And not even the details of the other things. But another way to go for some kind of supervision is called sales supervision. Where you don't tag the data explicitly but the data itself provides the supervision. To give you an example a very naive example in a sense you have some images. Now you want to build a classifier which will classify given an image how much is the rotation of the image. Whether it is a 0 degree, 90 degree, 180 or 270 kind of. So you take an image without any rotation you through some mechanism you rotate that with 90 degree and that becomes your input data set for 90 degree rotated images and so on. Now in our context how to introduce the sales supervision. You take some image at times t1 and take another image at time t2. So at some point that I am here then at time 2 I have just moved by some distance. So for both of these you predict the 3D parameter. So all of these are those D dimensional representation. Now in the 3D prediction the kind of distance you have between the movement. It should match with the kind of distance you have on the segmented foreground images. Now to detect the kind of movement it has in the 2D segmentation optical flow is being used. So through this feedback loop you can fine tune the predicted parameter predicted by SMPR. I am done with the technical part of our conversation. Let me show you few more results. So the agent is so robust that even if you throw a block targeting to the head of this agent it would be able to finish his task successfully. Even you put this agent in an extremely difficult environment having a lot of ups and downs, having uneven surface and all still it would be able to finish his task. And you can also increase the complexity of the environment seeing there are disjoint surfaces. Look how he is trying to adjust to those disjoint surfaces and all. And eventually when almost there is no way to adjust it is going to fall down. So this learning not only can be done on a simulated character, but the same can also be transferred to a physical robot definitely in future. So I am just showing an example of Boston Dynamics Robots. But this kind of learning through YouTube videos can be propagated to a physical robot like this. And as research make more progress in this domain very soon what you can expect in future is something like this. Where the robot is taking care of your cooking. So all that you have to decide that on a particular day what do you want to eat. And this learning can be done through a demonstration of a lot of videos where the human is doing or human is cooking a recipe. It was a prototype proposed by Moli Robotics. And the robot not only does the cooking for you it also does the cleaning afterward. Cool. Well we are done with our presentation we will open that as for any questions. Questions? Any questions? During the learning the states can get added also right. Because suppose say suppose for a self-driving car. So it was first getting self driven say suppose in a city. Then it went to a desert. So as it encounters new scenarios or new situations it will take those scenarios as a part of its states. Right. Yeah. So what I was saying as a input you have a video. So each video has a state. Okay. Now as you are saying you have a demonstration which is not being seen earlier. If it is mimicking the kind of states you had earlier in your state space. Then it will definitely be propagated in the kind of learning you have. So that means if this particular car is now hitting the desert. Then I have to give the video of the desert too. So okay. So say suppose like I am telling you a very like a common scenario say suppose the car my car is travelling. Okay. It is a self driven car. So it starts travelling from Bangalore. And it is going towards say suppose it is going towards Bengal. Okay. Okay. Now I encounter a village suddenly because you have to cross so many villages in the highway. So like earlier it was like the city roads then it became a highway. But suddenly I go to a village and I do not have I have not preferred that model with the pictures of a village road. So is that. Yeah. Yeah. So once. Okay. It will only know one picture or suppose it is. And suppose it was already being in the highway. It will start travelling very well. But suddenly if it encounters a desert. It will hit this new state and it will have no idea what to do. It will have some kind of tribe from the tribe in the highway. But if we need time to access it. It will need no cycles of it. Okay. Okay. It will be not a good job but eventually it will. Okay. So this part. This. Yeah. Sure. Sure. Hey. Any question from this? I just have a couple of questions. Great question. Of course how you are basically using this model on your work front end. What was that? What was it? What was it? What was that? What was that? What was that? What was that? What was that? What was that? What was that? What was that? Yeah. Right. So first part of your question, how to deploy this model in more of an production system. So after doing the learning, what you get is a policy. A optimal policy which is learned through your learning. And if you have experience of learning or training on a neural network, you get the model and then you use that during the inference time. Similarly, the policy is what you have already learned and then that it can be used during the prediction time. Okay. So it's analogous to the way you do learning for a neural network and the kind of model you keep during the prediction is almost similar to that. Now coming to the question of accuracy, I mean having a 99% accuracy or very high accuracy, I would say sometimes it's possible, sometimes it's no. Based on the kind of environment you have during the prediction time. If the environment given during the prediction time analogous to the kind of environment seen during the training time, it will do really well. I would like to probably ask you a question on the other side of this. 99% accuracy or not, how are the models actually done? So is it like all these models typically relevant to all these kind of body applications or do you think how these applications are modeled and how compared to other kind of things? Because I think others have other applications like multi-component models and all those kind of stuff. So what are the kind of other things that you can apply these techniques on? So I'll answer that question. So in your view, we have used this for two things. One is the predict stock prices and you can guess how that will be modeled. And the second is to an intelligent way to annotate our data automatically, which mimics the search process of a computer. So if a human is going through some process while tagging the data, we can use reinforcement learning to mimic that process. Now apart from your view, if you have any kind of control system that you want to optimize, like you want to control an elevator or some kind of traffic light in a highway, where the environment is very dynamic, you can use reinforcement learning. And the third most common use case in the industry is in recommendation engines. So especially ad recommendations and layout recommendations. Since you mentioned multi-component, so guess I'm not going to cover it. Does that answer your question? Thank you.