 Broadly distributed over everybody. Now, one person got five. One person got four, and everybody else was just single votes, and that could have been themselves. Now, that's kind of tough to make an award. Could I ask you to think about this, maybe go around? There's no hurry, except you got to do it today. Could I just ask you to go back around and think about it, and then make another vote? Really make another vote? OK, thanks. We'll see if we can converge on something. I just need a little bit more than I need to. If I had 10% of the total on at least one person, I'd be happy. OK, well, anyway, think about it, go around, and then make another vote. And we'll do that after the coffee break. Actually, we'll do it after lunch. Give you a chance to do that. OK, so we're going to get going. Got Alessandro Farrarro here. There's a slight change. We may actually go down into the Adriatico guest house to the Info Lab, because the tutorials are going to be down there with Python. I think everybody now is an expert. Thanks to Mikkel. So we were going down at 3.30. Maybe I'm going to check, but probably we'll go down at just after lunch. But let me confirm that. I just want to make sure it's ready down there. OK, so I'll just hand it over to Alessandro. We've got a couple of guests back there. Got Anthony Johnson and Vangu. Hey, why don't you guys stand up and wave? Anthony doesn't want to stand up. He's been traveling too long. Anyway, they're from OSA. And you'll see more people, kind of strange people wandering in. OK, all right, so Alessandro, let's go. So I don't know if, yes. I think this works. Perfect. So well, good morning, everyone. Thanks for being here. This set of lectures is probably going to be a bit different with respect to what you have done last week, let's say. We will not see so much physics, let's say, or not so much photonics at least this morning. Can you hear me well? Is it fine up there? Is it going like this? Is it better? OK, I'll try to shout then. I was saying that probably we're not going to see so much physics this morning, a bit more in the afternoon during the tutorial, which will be based on some examples taken from photonics. But yeah, you have to cope with that for this morning. The topic is machine learning, essentially, with some applications to quantum technologies. But I'm not a computer scientist. So by formation, I'm a theoretical physicist working on mainly quantum information, quantum optics, especially quantum information with continuous variables. But I'm not going to talk about that today. So before starting, just to have an idea of your background, I'd like to ask a couple of questions. So please raise your hands just to me to gauge a bit the speed, the pace of my lecture, and also to decide a bit better what to do in the afternoon if really skip the last hour and bring the tutorial forward or keep the program as it is. So neural networks, let's say how many of you have trained a neural network? So I'm assuming that this is going to be a very introductory course on machine learning and neural networks. And I'm sorry for the four of you because in that case, probably you will take very little bit out of this. But have you trained other, let's say, machine learning algorithms, anything, one? Principal component analysis, other four for principal component analysis and four for neural networks. OK, so difficult. But I was expecting something like this, more or less. OK, by the way, in the morning, I will be the one giving these lectures. Then for the tutorials, mainly Luca and Ocente will be in charge of that. We're both based in Belfast, and you can see one of the few sunny days in Belfast in this picture. And that is all I have to say about it now. So this is more or less the outline of these lectures. I divided them in two parts. There is a bonus part that concerns quantum information technologies only. But given my very short survey, probably I'm going to skip that third part. So I'm going to deliver only these two. So you can see that in the first part, I will give you an overview, essentially, of various methods in machine learning. There are three big categories in which machine learning methods are divided. And we will see a bit of the three of them with applications, some applications to quantum information technologies. And in the second part, we will focus more on the first of the three, so supervised learning. But more than that, in particular, really, to what an artificial neural network is and how it can be trained. Probably this is the most important slide that I have here. And it's essentially for you if you want to learn by yourself some of these things. OK, if you're taking pictures, it's fine. Otherwise, I can just put it online. So there are, of course, a plethora of resources available online about machine learning, in general. I'm suggesting, let's say, given the audience with a background on quantum optics or quantum information, I'm suggesting a book by the same author of quantum information and computation, so Michael Nielsen. He left, let's say, quantum information several years ago. But some time ago, he became interested in machine learning. And of course, he wrote the book. And this book is available for free on the web. It's actually, I don't even know how to call it. It's an e-book, maybe. It's available only on the web, essentially. And it's very well done because it also has some pieces of code that can be used. And on which the examples of the book are based on. Then, if you want, there is, as another possible source, there is this Oriel Geron book, which is very basic. But it's really hands-on. It's based on Python. Of course, we are not talking anymore about theoretical physics by formation, I think. This is really computer science. And it's mainly based on Python. So scikit-learn is essentially the workhorse library for that analysis nowadays. And this book is based on that. Otherwise, at a much more general level, there is a classic book by Russell Norvig on artificial intelligence. But there are also other types of resources, very useful. In particular, there are various freely available video lectures around. The first set of lectures that I suggest is by Florian Marcard, a theoretical quantum physicist, working on quantum optics, quantum information, auto mechanics, many body theory, et cetera. So he started being interested in this topic some time ago. And now there are a dozen of the lectures that he gives at his own university. And they are freely available. Otherwise, these other three sets of lectures are from computer scientists, data analysts. And you can find them, well, either in some repositories as this summer school of deep learning and reinforcement learning held a couple of years ago in Toronto, or even just on YouTube. But there are, as I said, many, many resources available online nowadays. Regarding, on the other hand, aspects more close to specifically to physics and to quantum information, there have been already a set of reviews that are available. I've listed here the first authors of all of them. You can find them on the archive. And they focused on different aspects, let's say. But you can engage that essentially from their title. It's easy to say. OK, so I will start with the actual lecture. Are there questions at the moment about the logistics of it or anything? I think I will take a break at some point before 11, maybe 45 minutes or something like that, just because two hours is too long. So I'm telling you, if at some point you want to have a break, I will also want to have a break, so don't worry. So let's start with the first of this on this various techniques that I was mentioning. So there are, let's say, these machine learning techniques are mostly used to perform data analysis. So data came in various sorts of ways. And we will start with what is called supervised learning, in which the data come in the form of a set of vectors. And associated to that, there is a label. The data I denoted with x, label with y, and I have many of them, let's say, n of them. Can you read from there? Or I have, OK, perfect. So in this, you can think of this as, for example, one of the most popular form of data are images. So you might think of black and white images, for example, with various pixels. So one image I can be composed of various pixels. We can imagine that to be a real number, and we have, let's say, p pixels around. And then to each image, you can associate a label. So for example, you can have various images of cats and various images of dogs. And you want to, given then a new image, you want your algorithm to be able to distinguish whether it's a cat or a dog. That's a classical example. If your is a discrete label, so let's say, belongs to ZP, like for cat or dog. Sorry, ZP, ZT, whatever it is. Then we're talking about classification. If, on the other hand, is a continuous index, we're talking about regression. So of course, the tools use us slightly different with respect to what you want to do. But what we're going to say is common to both cases. So the goal here, as I said, was to find a label for new data. So in fact, you want to find a function that goes from your set of data, so your set of numbers that might represent the pixel in your image, to the corresponding label. So you want to guess this function. In fact, so let's call this function f. And the point is that we will have that we want to use this function, not over the data that we have been given, but on new data. So we want to evaluate my function that will be the output, essentially, of my algorithm on some new data and associate to it, possibly a correct label. So of course, we have to understand, we have to do things in such a way that this new label associated to the new data is accurate. So we need a sort of metric to decide which function is going to be the right one. So how do we do that? How do we associate a correct function? Well, by essentially defining a distance, a notion of distance between the output of the function that is going to fit my data and the actual label associated to it. So this distance is a sort of cost function that we want to minimize. And the way it works is that this function here is going to be parameterized. So we want to find the best in a way that I'm going to explain in a second set of parameters for my function. And these parameters, I'm going to denote them, the full collection of these parameters with W. And this W, I will call them in general weights. So this is essentially fitting, fitting a function, fitting an unknown function with a parameterized function. These weights, in general, are going to be real and over a certain dimension. Let me denote the distance as a distance function between the output of the function that I want to optimize and the corresponding correct label. Of course, I want this to be as small as possible. So I want to minimize this distance. For the moment, I'm not going to specify better which type of distance we're going to use. We're going to see that in the second part of this talk. But typically, our quadratic functions or entropic type of functions. And something that we're going to need here is a concept of average distance with respect to the full set of data that we are given. So as I told you, the data are given in this form. The data that we assume are given are called training data. And we want this distance to be as small as possible over all the possible training data. So the actual cost function that we're going to minimize is not the cost function over a single data but over all of them. So my cost function is actually going to be the average of d over all training data. And this, we can call it cost function c, that depends on the set of my weights, the set of my parameters, which is simply, of course, the sum over all this distance function over all the training data divided by the number of training data. So the idea, as I was saying, is find the best set of parameters, which means that we want to find a set of parameters omega such that my cost function is minimized. Up to here, let's say, is really just a way to formalize a fitting procedure. However, the main goal, as I was saying, is that this function works on new data. So there is another concept here of cost beyond this cost function. That is that we also want to minimize the errors that my function can have over a set of new data. And how is that done, typically? Well, what is typically done is that, in fact, the whole data that you have, you usually divide them in two groups, or at least two groups. One is this training set of data over you actually try to minimize this cost function. And another set that we call the test set are used in order to minimize the possible error that the function can have when it is used on new data. So what is called the generalization error here? Another notion that is here tested is the generalization error. So the error that I will have when I use my newly-fined function over another set of test data that are, again, going to be given in the same form as before, just another sample of that. So for that, I have divided, therefore, my original data in what is called the training set and the test set. So let's say a couple of things about this. So first of all, you can imagine that when we're talking about images, we're talking about really pretty huge. And in fact, this set of parameters here can be of a very large number and talking about millions and millions of parameters. So these are not functions that depends on just a few parameters, but really millions of them at times also billions of parameters. But these things, in fact, work pretty well, as you probably know also because of general knowledge. And so this is an example of a quite famous example of how well these methods can work. So I'm not going to say anything specifically now about how these have been trained. But the idea is that, for example, images are given with their own labels. So the label here, for example, is a mite, is a container ship, it's a motor scooter, it's a leopard. And these are, let's say, this example is taken from a training set with 1,000, I think, of labels, something like that, for this type of images. And the outcome here that is given is for this row, for the first four images, the ones that have been better classified. And the last four are the ones that are worst classified. And these histograms are proportional to the probability given by the algorithm to each of the label. So you can see that in this case, for example, the probability to associate a correct label is very high, but also the wrong labels are not so wrong. And here we have, on the other hand, wrongly classified images. But even if they are wrongly classified because you see that, let's say, the maximum probability is given to a convertible and not a grill, and they honestly don't even know the difference between a convertible and a grill, but the second one is a grill. So it pretty much works like we were expecting. And for example, here in these other images, we have that the algorithm was finding a dog, whereas, in principle, the label was a cherry, but this is very arbitrary. So this type of algorithm can work extremely well. And when we say that we train these algorithms, we say that we mean that we try to find this set of parameters that performs this minimization. And in such a way that we can then evaluate the performances using, let's say, various figures of merits, but essentially it depends on whether we are using our algorithm to do classification, like in this case, or to do regression. So when we do classification, the accuracy over the test set is the figure of merit that we use to quantify whether our training. So our fine tuning of the parameters, how well it has worked. So let's say, if my labels are discrete, then use the accuracy. The accuracy is the fraction of correct labels, the test set. So you divide your original data in two categories, in two sets. You use the training set to train your algorithm, particularly your neural network. Once that is trained, then you test it on the test set of data. And the fraction of correctly labeled data is what denotes the accuracy. And extremely high level of accuracies can be reached with this type of tasks. And we will see some specific example later on. Otherwise, let's say, in the case of a continuum set, then we use, usually, a mean square again for the test set to evaluate the quality of the training and the quality of the algorithm. I'm now going to give some examples, not related to, let's say, classical cases, but related to quantum ones. But are there questions? Until now, of course, interrupt me whenever you want. How? Sorry. Which image are you referring to? The last four. They're wrong because you can see that, let's say, for the case of the cherry and the dalmatian. The grill is wrong because this column, so this car is called grill, is not a convertible. Whatever the difference it is between a grill and a convertible, I have no idea. But, well, so here you see that I'm saying that it's wrong because the probability, the algorithm gives to the convertible label is higher than the probability that the algorithm gives for the grill label. So this is what I mean by it's wrong because the algorithm associated an incorrect label. But now I'm not sure that I'm getting your question. I cannot hear very well. OK, and now I understand what the grill is, good. Of course. Yes, yes, yes, of course. But the point is that you have to imagine that a human person has first classified these images and have associated a label to each of these images. And, of course, the label associated to each of these images is subjective. So like for the case of the dalmatian and the cherry, it's subjective whether to associate a label cherry of dalmatian to that image. But the human person that at the beginning associated the label to the original set of data thought, OK, this is a picture of a cherry, of cherries. The algorithm was trained with all these set of pictures, some of them with this ambiguity probably in the label. The algorithm is totally oblivious to this ambiguity. It learns what it learns. And then that is the output. Now, the algorithm is trained on the part of the original data that is called training data. Then it is tested on the part of original data that are the test set. If in the test set, there happened to be that image of the dalmatian or of the convertible with a subjective label that we might not all agree, well, but the label is that, has been already given originally from the very beginning. So that's why the label given by the algorithm does not correspond to the label given originally by the human being that classified them. So it's flagged as an error. That's the point. We're not saying that the algorithm is wrong. The algorithm, let's say, that associates cherry to that image is wrong because of some higher reasons. No, it's wrong because it associates the label cherry. Whereas in my original set of data that I used, some human being associates to that image, the label dalmatian. Wrong in this sense that it doesn't match the label given originally by some human being. It's very hard to hear. That's a very complex question. Various ways in which you can use the fact that you have a small amount of data and still obtain something meaningful. I'm not going to touch them today. They are quite advanced. But yes, there are. It's much more difficult. We're going to see an example of why that is difficult. For example, if you have a few datas, you may end up in a problem called overfitting in the sense that your function that you're going to train over the set of parameters that you're going to find are going to perfectly be able to fit your very few datas. But then it's a disaster when you try to generalize them because you are perfectly fitting the very few data that you have. There are various problems. But yeah, it can be done with scarce data. There is a type of methods called reinforcement learning that actually works very well in a setting that is slightly different from this. But for which you have scarce rewards. So it's sort of having scarce data. So reinforcement learning might be more useful in that case. So maybe I should go on with some examples. So some example in quantum technologies, let's say. I will start with something extremely easy. And then I will give you an example related to some actual research. So the most simple unit that you can think in quantum information theory is a single qubit. So your data can be an original set of data. It can be a set of density matrices, rho i corresponding to 1 qubit. And the corresponding label can be, for example, whether this qubit is in a pure state or in a mixed state. So it's a classification case. And it's very easy, of course, to solve this problem once you have directly the density matrix of your qubit. You just calculate trace of rho square. If it's equal to 1, it's pure. Otherwise, it's mixed. So it's clearly a silly example. But it's an example that you can immediately use to train your algorithms. And in addition, it gives me a bit the chance to present you with an example in which even how to give the data that you have to the algorithm can be a non-trivial question to answer. Because we can think at a density matrix of 1 qubit as four elements of my matrix, each complex. So if I'm totally blind, really blind, I can give eight real parameters to my algorithm. Otherwise, if I think a bit more, I can give just the two real parameters on the diagonal and the two real parameters for one of the off-diagonal because the other is the complex conjugate. Or I can put in normalization in the game, so it's actually only three real parameters. But even in that case, the option might be, OK, am I going to give the initial information to my algorithm as, let's say, the real parameter associated to the first element of the first row and the first column, and then, I don't know, row 3, 3, and then the imaginary part of row 1, 2, and the real part of row 2, 1, something like this, with associated. So it's a four element vector with associated label. So I'm essentially vectorizing the matrix. Or maybe we can think that it might be more useful to present directly to the algorithm the original data in the form of the block vector. So instead of giving a vectorized form of my density matrix, I can give the block vector. And in that case, also, of course, the criterion is immediate. So if the length of the block vector is equal to 1, then it's pure. Otherwise, it's mixed. So again, it's a very easy example. But it gives already an idea of an immediate problem that you might face if you want to apply one of these algorithms to a real problem in quantum information. So the data are not given immediately, for example, in the form of pixels, numbers associated to pixels of an image. And in some cases, this can make a bit of a difference. So this is an example of what can happen by changing the size of your training data. So the two colors correspond to whether I gave to my algorithm the initial data in the form of block vectors or in the form of a vectorized density matrix. So you see that if the training set of data is composed of very few data, and this in a sense is an example of what you were asking before, it can be very problematic for the algorithm to determine with a high level, a high percentage of accuracy, whether this state is pure or mixed. If the data are given in the form of a density matrix, a vectorized density matrix, if they are given in the form of a block vector, for the algorithm, it's much easier. Of course, we are talking of very small sets of data here because at some point, when you have enough data, then the accuracy of the algorithm is essentially oblivious to whether you are giving to the data in a form that is easily digestible or not. It's complex enough that it can work out the solution on its own. And something that is very interesting is that when the algorithm misclassifies your data, so associate a label pure when the density matrix was actually mixed or vice versa, well, if you calculate the purity of all these states and you look at where the incorrectly classified state lies on these axes, you will see, of course, the purity is the trace of rho squared. You see that the misclassified data they lie close to purity one. Despite the fact that we are not telling anything to our algorithm about the geometry of a single qubit or anything like that. But the algorithm immediately finds, essentially, that the hard samples are the ones at the border between the mixed and pure states. A second more exciting example, let's say, more exciting because we are now dealing with two qubits is, again, a classification for two qubits instead of asking the algorithm to recognize whether the state is pure or mixed. We asked the algorithm to recognize whether it's entangled or separable. So these are now my two labels. And you can have a bit of fun with that. Now for two qubits, again, the problem is solvable, but it's not as trivial as that for a generic two qubit state. And I'm going to present here the results for the case of a Werner state that is given by this expression here. So it's a mixture between the maximally mixed state and one of the Bell states. So this psi minus is the state 1 over square root of 2, 0, 1 minus 1, 0. And this is a class that has only one parameter, this alpha. But then we can imagine of rotating my Werner state. And we can rotate it with a unitary that acts only on one qubit. And this will add to a Werner state that has only one parameters. We'll add the other four parameters. So this is going to be five parameters. And we can, on the other hand, rotate both qubits. So we will end up with nine parameters here, other four parameters. So for the algorithms, it will be more and more difficult to. So it will have to work with a larger family, in principle. But you can see, and these are the red dots, that, again, if you train a sufficiently large set of states for the algorithm, it's almost the same. And we did something similar. You can try to do something similar with the x states. So x states are a mixture here. There is a type, which should be a psi plus. It's a mixture of all the four bell states with p1, p2, p3, p4 are probabilities. The degrees of freedom here before were 1, 5, and 9. Now, originally, they are 3 because of these three parameters, p1, p2, and p3, let's say. Then p4 is just the complementary probability. Or it can get larger and larger. And here is the last set, for example, in which, on the other hand, the qubits are chosen randomly. And you see that, apart from this last set, even by training, this is not a very well-trained algorithm. I'm not using millions of data and a high level of a particularly complex algorithm by the weights neural networks. We're going to see later on what it means. So these are, let's say, easy examples of how these algorithms can be used to solve problems in quantum information theory. As I said, these are silly examples, because we know how to solve these problems, essentially, almost with pen and paper. But it's just to give you an idea. Questions? OK. So the last example for supervised learning is, on the other hand, related to a research problem. And this is about quantum error correction. And should give you a bit more of the idea of how powerful these tools can be. So the problem studied here is the typical problem that we have to face when we want to correct noisy, we can want to correct errors raising from a noisy environment. So let's say that, in principle, we have, for example, one logical qubit that we want to encode the logical level. But we encode it with multiple qubits in order to protect it against losses in a sort of redundant way, just as you do in classical information theory. So we have, let's say, a larger number of physical systems. And this is my encoding procedure. And the encoding used in that paper is known as Toric Code. Not going to explain it, but it's just to give you an idea. Once you have encoded, let's say, your single qubit in a larger set of physical qubits, those are going to undergo a certain noise process that is outside your control. And therefore, what you want to do is recover the original information, in particular, the original information contained in your original logical qubits. And you do that with a series of interventions. So first, you measure some of your qubits, let's say. And this measurement is called the syndrome measurement. And the syndrome is telling you something about your errors, but how exactly the outcome, so the syndrome measurements, might be measure the speed in a certain direction for your first physical qubit, then in another direction for your third physical qubit, et cetera. So you will have measurement outcomes of your experiments. And this we can call S. And that's a seat at I to say simply that, OK, we're going to, in principle, to have many, many of that. So these are going to be my set of data. And then the idea is that once you have measured your syndrome, you can associate to it a corresponding error that has occurred, or at least with certain probability in this noisy process. And one, you associate to the syndrome the error, then you can correct for this error. So this is, let's say, the coding or error correction. And therefore, here, you want to find a mapping that goes from the syndromes to the error. And this is a very similar problem to the general framework that we'll be talking about for supervised learning. So you might think that you have a set of data that gives you your syndrome and the associated errors, or the associated label. And then you train your algorithm in order to find the parameters that gives you the best function that associates to possible new syndromes, the corresponding errors. And you want to do that with the best accuracy so that at the end of the day, you can tolerate more noise, as much noise as possible here. And here in this paper, they use a feed-forward neural networks. And you can see that the input is in the form of syndromes. So these are represented by these crosses and squares. And then we're seeing a nomenclature that I'm going to introduce later on, but just to give you an idea. This is, let's say, a way of representing neural networks in which you have a certain nodes here that are your inputs, then a set of fully connected layers of other nodes that are given here. These are called hidden neurons. And then the output. The output are going to be the errors that the algorithm suggests you based on the syndrome at the input. But we will see this in much more detail later. And the power of this method can be understood from this plot. Because here you can see that the fraction of correct error for this case when they use these neural networks is larger than the fraction of errors that can be corrected using the best known algorithm for this type of codes that is the minimal weight, perfect matching algorithm. Which turns out in the fact that you can tolerate a larger amount of noise by using this error syndrome association found by the neural network with respect to the amount of noise that you can tolerate by using, on the other hand, known signals. Questions? If not, maybe we can take a five minutes break. And we can start 10 past 10. But I've been told that you shouldn't move too much around. We will have a proper break at 11.