 Ok, maybe we should start. I guess we've lost a few people. Ok, this is the outline that I gave you at the beginning. We've done, let's say, the first two parts. As I was saying, just an overview of supervised and unsupervised learning. For unsupervised, we've seen a bit of PCA with more details. And then now we're going to see some details about policy gradient for reinforcement learning. Then the rest of this hour and in the afternoon, I'm going to talk more specifically about artificial neural networks, which can actually be used for all the three aspects, the three big families of machine learning algorithms that I've introduced. Then we stick to the original program, which means that the tutorial starts at half past three. So again, it interrupts me whenever you want, so we were just to recall you, we were in this situation. So in reinforcement learning, we say that we aim, essentially, at learn the best sequence of actions that we can exert on what we call the environment that responds to our actions in ways that we can observe. But we do not know in advance. And we receive, our agent receives also rewards as information. And the goal is to maximize these rewards. To find what is called in this setting a policy of actions that maximizes the rewards. So policy gradient is essentially a technique to find this policy of actions. So learn a probabilistic policy, because I cannot enter into the details of this, but it's much better if you learn a policy that is not deterministic, probabilistic policy. The training is much easier to do, and the performances are much better. So we want a probabilistic policy that tells us, given a certain state of the environment, which action and with which probability should I perform. And then, given the subsequent state of the environment, which other action, and then again and again. And I will, in this way, form trajectories of states that I observe and actions that the agent performs, such that this overall trajectory of states and action should maximize the rewards given. So that's the point. So first, more specifically, this notion of policy means probabilistic policy, means that I have a certain state of the environment, at a certain time, let's say t. And then I observe as an agent the state of the environment, I want to decide the action to perform. So, given the state of the environment, I want to perform a certain action at time t. And the policy means that this is a conditional probability. I want to find this conditional probability, pi, that given a certain state of the environment an action is performed. This conditional probability is going to be parameterized. So, just as we saw for supervised learning, and let's again call this set of parameter weights. And I'm not doing that just for economy reasons of letters. It's just because we will see that essentially this problem can be fed into a neural network in the same way as supervised learning are fed into neural networks and we're going to see more details of that later on. So these are parameters that we will optimize then. For example, using fit forward neural networks or other methods. So these, I said, are actions and this is the state of the environment. And once a policy is given, then a trajectory of state induction can be determined. So, what is the probability of a certain trajectory that I call tau standing for trajectory? What is the probability that the certain trajectory occurs? Well, it's a sequence of probabilities of being in a certain state and then conditional probability of that action and then subsequent probability of being in another state given that that action was performed. So, let's say, if we start from time t equal to initial time ti and we stop our sequence of actions state at the final time t final, then we will have, let's say, that given the probability, this conditional probability that I want to optimize of performing a certain action at time t given that the state of the environment is st times the probability that given that the environment is in state st and action at is performed, what is then the probability of being in another state t plus one at the later time. So, this chain of conditional probabilities give me the total probability to perform a certain trajectory. So, I start at initial time with a given state. I have a certain policy that tells me, OK, there are certain probability of performing certain action. Then, given that this state of the environment and this action are performed over that state, then what is the probability of being in the subsequent state of the environment and they sum over all the times. And this trajectory, as I said, is a sequence of actions and states. So, actions from action at initial time to action at final time and the state here, state at initial time, up until state at the certain final time. And then we want, as I said, the goal is to maximize the total number of rewards. So, maybe I can... Is this notation more or less clear? So, what are going to be my overall reward? In particular, this is a probabilistic policy, so the expected overall reward. Well, given a certain trajectory, means that we know the state of the environment in time, so we know whether we have been given a reward or not. So, given a certain trajectory, we can associate the corresponding reward. So, the overall expected reward is going to be given by the sum of the rewards weighted for every trajectory weighted with the corresponding probability of that trajectory. And this is overall the possible trajectory. And we want to maximize these rewards with respect to what? With respect to the parameters of our policy. So, here this probability of a given trajectory depends on the policy chosen, on the action chosen. So, this total reward is going to depend on the parameters of my policy, yes? Very good question. No, this is totally unknown. So, the point is that we can set the policy and we can observe how the environment reacts, but we do not know this probability. Thanks. So, this is the overall reward for a given trajectory and with sum over all the trajectories. No, as I said, the point is that we want to maximize this expected overall reward. So, maximize over omega, this expected overall reward. And the tricky part, not going to prove it, is that this maximization can be performed even if you don't know this object. Why? Because one can prove that the expected overall reward that is going to be given by this formula where you have to plug in for p omega, that formula there can actually be simplified in such a way that the derivative with respect to a certain gradient, let's say, with respect to the parameters w is given by the sum over all times from the initial to the final time of the expectation value over the trajectories of the reward for each trajectory times the derivative with respect to my parameter w of a quantity that just involves the policy. So, it's going to be given by the logarithm, actually, of the policy of performing a certain action, a t if the state of the environment is esteem. The average is over all trajectories. Of course, we cannot perform... So, these, I'm sorry, but you have to take it for granted. Cannot derive it now. But you can imagine that you have to play a bit with the fact that these are all products when you take the log, you can obtain a sum, and this is an average and you can bring the average inside. There is a bit of massage to arrive at that formula. But the point is that it's an average over all trajectories. And this is something that is essentially very akin to what one has to do in supervised learning, as I was saying before, in which you have your various data with your own labels, and you want to minimize the cost function. So, also there, you have to average over all your data. And we will see that these can actually be done efficiently by a technique used in neural networks that is called stochastic graded descent and the point is that stochastic graded descent means calculating this type of gradients and this type of gradients can be efficiently calculated by using an algorithm called back propagation algorithm that we are going to see later. So, the point is that maximization can be performed using a combination of stochastic graded descent, sgd, plus the back propagation algorithm, which is this generic set of tools in the bag of neural networks. OK. Just the general idea. Clear enough? Yes? Yes. Yes, there is a link. In the speed of that, I don't know, but yes, there are people that have studied quite heavily this link. Yes, yes. We will see a bit how it works when we see the back propagation algorithms. It's a sum over trajectories. The speed of convergence of stochastic graded descent is a beast. The speed of convergence of graded descent depends on what, and this is what I... Surely. So, the question is how the noise in the data that we are going to use to estimate the gradient is going to affect the convergence of stochastic graded descent to the true gradient. So, the more noise it is, the longer it takes. But that's a beast. So, the question was, does the noise scale with the number of different paths? If it does, we... I think that the short question is no. If your data are kind of under control, shouldn't. It's at least an empirical evidence. They work and... So, for example, this thing went into the news for something like that. No, for beating one of the world's experts in the Chinese Go game. And... I mean, this means to train your policy gradient plus actually... OK, let me go by step. OK. So, an application of this policy gradient method for reinforcement learning that is pretty famous is this one. So, in 2015, a reinforcement learning algorithm was designed to win a worldwide... a world expert in this game. The Go game is a very ancient game that has been played since more than 2,000 years. So, people develop a lot of strategies to play this type of game. And it's far more difficult in a sense than chess because the combinations of legal actions, legal move in Go is much, much larger than in chess. So, it was believed to be essentially impossible for a computer to learn how to beat a human being. Much larger, I mean, of the order of 10 to 170, something like that. That's just a lower bound. And this is because the Go game is played on a board that is 19 times 19. Essentially, it's very simple. It's a game, but very simple. To explain is relatively simple. So, there are two players. One has white stones, the other has black stones. And the goal is to put your stones in such a way that you gain more territory than the opponent. Gain more territory means that you manage to put your stones in such a way that you enclose a territory whose size, so a portion of the board whose size is much larger than the one of your enemy. So, there are other few rules, but that's essentially the point. And you can imagine the variety of combinations of moves that can be performed for this game. So, as I said, a private company DeepMind based in London in 2015 developed a combination of algorithms, the core of which were supervised learning. So, there is a first part of the algorithm in which a neural network learns from the games of expert, how to play, so human beings, how to play, so moves and let's say the label are whether these moves have been winning moves, etc. So, that's the first part, but then it's a reinforcement learning algorithm plus also other stuff. And there was a paper on Nature published the day after, the year after. But then after a couple of years the same team of computer scientists managed actually to outperform their algorithm called AlphaGo with a new algorithm called AlphaGo Zero. Zero stands for the fact that there was no previous supervised learning of that algorithm. So, it started literally from scratch and it started playing against itself and you see that as the training time goes up, at some point the threshold given by this dashed line the original AlphaGo algorithm was beaten. And experts say that they essentially realized when they saw this game played by AlphaGo Zero they realized that the algorithm was discovering strategies that were totally unknown to them. So, it's this in a sense is a superhuman way of training your algorithm. Yes, that is an example of course outside quantum information technologies. This is a second example again taken from research but this is about quantum information and it's the goal of the game here is very similar to the example that I introduced before about error correction. So, here we have an agent that is our it's a neural network in this case that can observe what happens to a set of qubits that encode quantum information but are subject to noise. So, my environment is the system of qubits over which I want to encode information so these physical qubits that I was telling you before that encode the logical qubits plus what we physicists call environment. So, system plus environment here is called environment. And the agent is just the program that perform observations over the system in particular in this case there were observations over one qubit only and based on this observation these observations are what in this language I call the state which of course are not full informative observation because otherwise we destroy the quantum information encoded in our system so those are just partial information about the system subject to noise and then the agent that performs certain actions and those actions are nothing else since these are qubits then combinations of Pauli operators. The rewards are the fact that this combination of observations and actions are such that the state encoded in my system is preserved. And training again with policy gradient reinforcement learning algorithm they have been able to show that an adaptive strategy can be developed that performs better than non-adaptive strategies. And application of reinforcement learning in quantum information are for example quantum control so these things if you know a bit of quantum control should recall you a lot of quantum control they've been used also for designing experiments guiding in the design of an experiment And I mentioned already before this projective simulation so the last of my short list of reinforcement learning methods that has been implemented both at the classical level and the quantum version of it. So I think that more or less this is all I wanted to tell you about this overview questions. If not we can start with the more technical part of this of this conversation which is this introduction to artificial neural network. So the story of artificial neural network starts formally let's say in the 50s when the concept of perceptron was introduced it's inspired by what we think that biological neurons do but then at the end of the day this is of course all a model of that and we might well end up at the end of all this to understand ok how a specific model works for learning and understand absolutely nothing about how our brain learns of course the model is so simple that might have actually nothing to do with our actual brain so this second part is in mostly taken from denizen book that I was mentioning at the beginning take that as a reference if you want to go back to these things so let's start with the perceptron so as I was saying this is a very simple model and it works in the following way so a perceptron is actually a function that takes binary input 0 and 1 let's say n of them and delivers 1 binary output 0 and 1 so this is and we can call our binary inputs as x1 up to xn and I can call the output a so a is going to be so this function is defined in the following way so a is going to be 0 the sum of certain weights wi times my binary input is more or equal than minus a certain number b now is a real number is called bias of the of the perceptron and this vector w of numbers are again real numbers and are the weights are called weights these biases and weights are the parameters of the perceptron and you can think about them as a single big set of parameters these are the w parameters I was talking about before it's going to be 1 the output if the other if the complement condition is true so in other words this a the output of the dot product between the vector of parameters w the vector of inputs x plus the bias the big what? it's given it's one of the parameters of the perceptron also b will be optimized so let's see an example this is a perceptron is usually sketched as a node let's say with the label given by this parameter b with a set of arrows inputting in each arrow as a weight and then there is one leg that represents the output and then you plug in the input and you get your output so that's the idea so an important property of a perceptron that we're going to use immediately is that a perceptron can mimic a logical end gate the proof of this property is simply constructing such a perceptron so a logical end gate, it's a gate that gets two binary inputs deliver one output that is the negation of the end gate so 1, 1, 1, 0 is the truth table so we're going to construct now a perceptron that has two inputs and gives the same truth table of an end an end gate so we can see that a perceptron that has a bias of 3 and the two weights for the two inputs of minus 2 is going to do the job so why let's assume that we have that we call our two inputs x2 so let's construct the truth table associated to it so we have x1 and x2 that can assume values 0, 0, 0, 1, 1, 0 and 1, 1 and then we can calculate this combination here so dot product between the weights the input vector plus the bias so this is going to be given by so here I have 0 0 so this part is 0 plus 3 if the second one is in 1 I have just minus 2 plus 3, which is equal to plus 1 and same thing for the 1, 0 and then if I have 1, 1 this probably is going to be minus 2, minus 2 equal to 4 plus 3 is equal to 1 so this is minus 1 and therefore the step function associated to it so this is positive so it's a 1 this is a 1, this is a 1 and this is a 0 so this is the truth table of an end gate and if you recall one of the basic results of classical information with so classical circuit that is often used for example you can go in Nissen and Chuang book not these Nissen neural network books no and you can find this result there so an end gate a set of end gates plus a set of copy gates is universal so you can express any logical function with a set of binary inputs and that deliver a set of binary outputs as a sequence of end and copy gates so this says that layers of perceptron are universal so any function can be built by using layers of perceptron every binary function so I can give you the example of the classical circuit for the other of two bits so this is a usual notation in classical circuit this object is the end gate I guess that the vast majority of you have seen this thing before so this classical circuit is the one that gives the sum gate now I give that for granted how can I build a network a set of layers of perceptrons such that this circuit is mimicked well simply whenever I have a copy gate I copy I put that perceptron that I gave there so here all these perceptrons have a bias of what was it plus three and all these arrows have a weight of minus two and it delivers the sum and carry of another circuit is this kind of clear? so you can think that here there's a strange thing so there are arous coming out from a perceptron whereas I said that the perceptron has just one output this is because these are not let's say four outputs of my perceptron but are what is written here so it's just one output so you have four copies and for example we can reduce the number of these legs because you see that these two legs are redundant in a sense that come from the same perceptron and they end up to the same perceptron so you can fuse them together then of course the weight will change so each of the arrow was weighted the minus two so the weight of this arrow is now minus four and then we can divide this circuit in such a way that it's all made of perceptrons if we also take our inputs as the bias of special perceptrons that do not have any input they just have a bias and an output again the convention is that when you have more than one leg coming out of a perceptron it means that there is just one output that is copied n times so this is the simplified version of this of this circuit coming from the other and now what is the main difference so the main observation that distinguishes standard logical circuits made of NAND gates with respect to what we are going to talk about so network of perceptrons in particular network of neurons is that in let's say circuit computation what you are given is an algorithm that of course is designed to solve a certain problem then this algorithm is compiled in gates which means that the circuit is designed so the programmer designs the circuit given the algorithm whereas in networks of perceptrons or in neural networks more in general the problem what is given is the problem what the programmer does is to train trains the network which means optimize the weights and the biases to obtain a function that solve a function means a sequence of perceptrons will be like that with proper biases proper weights to obtain a function that solves the problem the algorithm is not even there so the equivalent of this algorithm not even there and you might end up with the circuit already compiled circuit let's say that whose meaning is totally obscure and this is a part of the game now there is a problem with perceptrons and is that when you want to train these units in order to optimize these parameters what you want to do, obviously is to find how the output slightly changes as a function of a small change of these parameters however this is a binary input and binary outputs so small changes in these real parameters here might end up in very large changes in the output or no change at all so this is really not what you want to do when you want to work with when you want to optimize because you really don't know where to go you cannot follow any gradient so problem perceptrons are binary so small changes in weights and bias do not imply small changes in the output therefore is essentially impossible to train but it is relatively easy to solve this problem I introduce a neurons artificial neurons so this is let's say if that was second part part 1 this is second part part 2 artificial neurons in particular I am going to talk about sigmoid neurons they are very similar to perceptrons but the difference is that inputs and outputs now are not binary inputs and outputs are real numbers that belong to the interval between 0 and 1 so now you can see that small changes in the parameters of your neuron ends up in small changes so essentially it's going to be continuous it's going to be represent continuous functions sequence of neurons are going to represent a continuous function and so the definition of a neuron is the same as before but now we have a function that goes from the interval 0,1 n times to again only one output 0,1 but now these are real numbers we again use the same notation so we have the input labeled as x, the output labeled as a and what changes is how a depends on the inputs so a is now going to be given by the sigmoid function that I am going to define in a second instead of the ever side function so the sigmoid of the dot product between biases and data plus sorry, weights and data plus the bias and the sigmoid function the sigmoid of n number z is given by 1 divided by 1 plus e to the minus z and if you plot the sigmoid function as a function of z you get that is essentially a continuous smoothed version of the theta function so it's going to be something like that that goes from 0 to 1 this is 1 over 2 and yes this type of function for the sigmoid it's protesting but I still have 2 minutes so stay there for 2 minutes this sigmoid function is an example of what are called activation functions and there are many that can be used it's not necessary to use the sigmoid function so the shape of the sigmoid function doesn't really matter sorry, the details of the sigmoid function doesn't really matter so much what is important are the properties of the sigmoid function in terms of having an output between 0 and 1 being monotone resembling the theta function any important point that you want to notice is that whatever you choose it must be easy to calculate derivatives of this because we are going to need to calculate derivatives over and over so the most important an important observation is that sigmoid neurons can mimic perceptrons I'm not going to prove it but it's very easy because you can think of multiplying weights and biases by a positive constant and the output of your neuron will not change if you let this positive constant go into infinity we're going to squeeze this function and it becomes a step function and this is important because therefore we can use essentially the results that we know from before that artificial neurons then are in fact the universe so a network of artificial neurons is good enough as a function to fit any possible function of course analytic that's why neural networks are in a sense expressive enough so this is the terminology used they are expressive enough to mimic any possible function then they are relevant because these parameters can be adjusted usually good enough in an efficient way if the problem is not too difficult et cetera but they can be adjusted and this is the training and this is the thing the next thing that we're going to see after the break questions I know that we are two minutes in the lunch