 We can begin. Hi. Good morning everyone. I'm Alice Kegrepova. I am a professor at Kablin Institute of Nanoscience in Delft in the Netherlands, where we have quantum method and AI lab. This is some pictures from our group life and my website if you want to check out more about our research. So yeah as you heard at the beginning we were discussing with the organizers to have a little bit more little bit more introductory lecture to kick us off even though most of you already are basically experts. So I try to go through things in a way that it will stay a bit interesting. I will try to not only talk about how machine learning works in quantum but also concentrate on these ideas of why and then connect it to the rest of the school with some experimental examples that we do tomorrow afternoon after you had your first introductory lecture into the cold atoms. So let's stop a little bit with this why machine learning because this is something that we should really stop and think about. It's easy to say oh this is kind of a problem I've been working on for the first year of my PhD but I can also do it with a neural network so it's probably going to be a PRA. And that's true but it's maybe useful to think about why should one do something like this. Because in physics we often understand nature by designing some minimal toy models that we can very well understand then analyze them and then derive some underlying understanding from that. And machine learning how you already work with it or you read about it somewhere is very different. You have this very data driven approaches that sort of extract features from data. And so if you think about it on the very very fundamental level these two things may seem a little bit contradictory. So it's maybe useful to stop and think about if we even have a big data in physics or in quantum physics. So can someone give me some examples of if we do do we. Yes that's a great one when we measure a quantum hardware we get a one projective measurement and we need to take many many of those if we want to calculate anything that's a great example I will come back to it later. Do we have more examples? Is the only big data set we have in quantum physics a shot on quantum hardware? That's another good one. Can we have also someone else make a suggestion so we wake up in the morning? That's a great one that's a great one. Anything can we think of anything that is not related to the experimental measurement? I cut it off very meaningfully please. I actually think it is and it's a great segue into the into the next topic right because the the wave functions themselves they are super huge objects so any numerical methods we develop to approximate them they will also yield some large scale data sets. Okay that's maybe let's take these two examples and let's see if we can come up with more during the talk but thank you for this DFT connection right because this is now a good segue for me to talk about why is quantum physics hard do we have someone who do not consider themselves a quantum physicist here? Great okay so this is going to be a smooth right so in in some quantum mechanics one-on-one someone tells you this right that if I have two quantum particles I need four numbers if I have ten I need a thousand and so on and so forth the number of amplitudes of a quantum wave function scales exponentially with the system size and that's the whole crux of the problem and or depends how you see it a great potential benefit of quantum computing so I prepared this actually super carefully in case we have some non-quantum people here but since we don't I will go over it super super fast the way you derive this exponential scaling is actually it's not super deep if I have two classical states whether I put this here or here it's either one or the other right but in a in a quantum I have a super position where I can be in a arbitrary normalized combination of those two states if then I add a second system if I would have a second sanitizer here I repeat the same exercise but if I start writing the numbers that I need to describe that state if I have two it's like one is going to be here one number the other one is going to be here the other number but if I have both of those in a superposition I need to start multiplying things so one superposition of a plus b and the other superposition c plus d once I multiply everything I get four number like this and for three systems I get eight numbers and for n systems I get two to the n and so this is just a super simple consequence of writing superposition and it's really important to remember that this has nothing to do with the speed up right we are just right we are just really writing the dimension of this complex vectors that we jointly decided at the beginning of the past century that we will describe the the quantum reality with and so we come back to this scaling and then somebody tells you usually or you can try on your own computer that once your numbers are start getting depending how like yes CPU and GPU are developing but when you are somewhere around 20 nowadays you cannot calculate things exactly just because you run out of memory on any classical computer because of this exponential scaling so there is a there is a thing to realize and that is that the the wave function itself is a really large scaling object even if we want to write a one-way function of like 20 particles it's already a disaster in terms of complex numbers that I need to store so this is very interesting and you come back to it because Philip will give you the whole lecture series about about how this works and then there is the other thing that you were also mentioning when I asked you the questions about the big data set and that is this experimental connection because of course I cannot go to my experiment measure the complex amplitudes and celebrate right we can only measure a real numbers so what I measure experimentally is some you know expectation values projections things like that and then someone needs to do the hard work of connecting those experimental measurements to the underlying theory models that live in these complex Hilbert spaces so we can complain all we want about how this abstract complex vectors scale badly with the size the experimentalists they have much bigger problems right because they don't get they don't even get this exponential number of complex amplitudes they get a millions of measurements that they need to map on this exponential number of complex amplitudes and it's also something that we are going to discuss later on and now there is this a question of like how what artificial intelligence have to do with it right so now I laid it on really thick so now this is not going to be any surprise basically we even though the way you learn physics in your maybe bachelor and sometimes masters program where you get all this pre reconfigured examples that you can solve exactly so this scaling problems do not sort of come up often in practice when you do science they do because we when we want to push the boundaries of knowledge we run out immediately of things that we can do analytically and then the numerical methods come in and then you want to analyze experimental data set this would go under this umbrella of like maybe discovered a new discovering new physics from data you want a process this huge measurement set that you get by measuring quantum computers you also want to automate the control of contemporary quantum devices because they are getting really really big I am gonna talk about that a little bit more tomorrow but you can also do some very fancy things like discover and optimize quantum algorithms or design new quantum materials things like that since we have a lot of people in the in the audience who said they are already worked with machine learning maybe we can have a three or four examples that someone say what they what they do with machine learning in one sentence that the rest of us can understand you have to listen to me for three hours like you gain nothing by being quiet right now like really okay let's go that's for it say again yeah that's a great one and thank you for that because we will talk about phase transitions first do we have another example also amazing one we will cover that tomorrow but apparently you already know another example great thank you Tom really that's my own group member everyone one more oh that's a very good one that's also can be seen as a Hamiltonian learning problem then okay let's leave it and somewhat good great now we switch gears forget all about this we now sort of motivated to ourselves that maybe somehow we should do we should do machine learning at quantum but let's do something let's do first something physics see to really get into the cracks of this the why of why we need the machine learning hands up who remembers an icing model from their statistical physics class it's like almost everyone that's good okay so presumably this is the Hamiltonian that you have seen before it just tells you that if you are on this two-dimensional lattices I showed you on a previous slide the nearest neighbor spin sigma i and sigma j they can be either plus one or minus one it calculates the energy of this nearest neighbor interaction with some parameter j and so if you just use this Hamiltonian you will realize that if the product of sigma i and sigma j is plus one both of them are up or both of them are down your total energy will be minus j and if they are not aligned your total energy will be plus j so if you want to go to the lower energy state you want to have them aligned right the phase transition that you learned about in your statistical physics course has to do with adding temperature to the system this is something called Boltzmann distribution and it just tells you a probability of each configuration at the given temperature but basically what happens is that when you start adding temperature the difference between this and this becomes more and more subtle and eventually it stops matter at all and that's what you when you get this temperature phase transition in icing model let's let's we come back to the temperature let's go back to this Hamiltonian that I write so let's do the icing test and I will ask you which one of these three configuration has a has a lowest energy shouldn't matter about open good question yes no like there is no there is no there is no true questions really if you have heard me talk about this before I always switch the configurations before every lecture so okay do we have suggestions yes great and why do you think it's B so now we saw not only the correct answer but also kind of smart way to arrive at it right because I wrote you the Hamiltonian right below so you could just do what I said use the Hamiltonian and just write all the plus and minus ones for all the pairs that you have in that configuration and just calculate which one has the lowest energy right but it's kind of annoying to do and it would take you like five minutes and then there is this other approach that you can say oh wait it's just the misalignment that matters so I need to find the picture where the boundary is a least so for example in the sea it looks like it's very blue but the one that flips everything is in the middle so the misaligned pairs are actually a lot so the one that has the least if you look at it is a B everyone is with me yeah okay great and but now we did actually something funny right because nobody's you didn't think about Hamiltonian when you explain this to me you thought about some pattern in the data so we actually did some kind of data inference already even on this minimal example and so if we go back to this temperature transition what happens is that at a different temperature you can draw this up down up down up down samples and then you at some point calculate this phase transition maybe someone taught you like on-sugger solution or some renormalization group stuff and you get this formula that you can get analytically for it in 2016 Leivang did something very simple but super brilliant and this is like my one of my favorite like machine learning in physics papers of all time is that he took this icing configurations that I just showed you use a super simple clustering algorithm and when you do that and you plot these configurations on the two axis where those are the two components of your clustering algorithm you get this plot where if you color the points by temperature you get the mid big cluster in the middle and the two blue clusters on the sides why is that too actually somebody tell me that yeah I mean that's correct but can we say it in like a normal human words exactly so this there is the C2 symmetry means that there is two ground states right like either everything is aligned like this or everything is aligned like this so we get a two we get a two clusters I don't want to yeah this is called principal component analysis I just tell you in a super super high level and we come back to it tomorrow if there's time because I don't want to get like yeah spend the whole lecture talking about singular value decomposition but basically what you do with this algorithm is that if you have some data points in some dimensions you can basically calculate minimal square following the minimal square distance in the data like to look in which direction the in which direction the standard deviation in the data is the smallest and that helps you define your new axis and then if you project the data into this axis you on this example I drew here it's it's meaningless but in this in this icing model you get a plot like this so basically what we do is that we take all of the square configurations make columns of all of them make a big matrix and then we just project that matrix into the two-dimension into the two directions of the of the minimal square distance now we are at this we are at this like a funny point right because to do the Leibang's thing I didn't have to do anything I just needed to know that the principal component analysis exists and you can actually just call it in Python without even knowing how it works actually who knows principal component analysis ah you are such a great audience like maybe we should just like skip this introductory lecture and just like go to the advanced stuff right okay so nonetheless so even if you know it it's even better right like then this plot should completely like value because I can to calculate a space transition in your statistical physics class you spent like two or three lectures right and that is a lot of writing and you need to understand about like partition sums and it's not it's not so easy and this I can just do I can just draw some like samples do the simplest clustering algorithm that we know exists and just celebrate right because here if I plug in the constant equal to 1 into this critical temperature formula I get something like 2.2 something if you look at the color in this plot of the transitional points and map them on the temperature temperature color scale you will see that it's kind of right in the middle between 2 and 2.5 so just like looking at the color alone I would guess like 2.25 is the transition temperature between those two clusters and that's a pretty good guess for someone who never heard about what's a Hamiltonian what's a phase transition right so that's that should be that should be very cool so we just don't go to the physics class anymore and do the just do just cluster everything right can someone think about like is there a problem with this or are we now done with physics yeah that's a great point that let's say that this critical temperature transition or where it comes from it somehow related to the physics that we want to study so actually having the formula would be good and we are not getting it like this anything else that's another thing that I cluster this and then if I have some weird clustering method that has a lot of artifacts I can make like a nature paper with my artifacts clusters and then be like a oops okay this is just the t-snay artifact and not the actual phase transition more thank you for that segue so what Felix is saying is that I can like try to cheat you here with the simplest icing model that we know of but let's try a harder one I am not doing anything fancy I am just going to add the product over the placard so instead of now just looking at the two nearest neighbors being aligned we are looking at the product over the placard and the same argument will apply if it's even then it's fine a IE I mean by fine I mean lower energy if it's odd it's a higher energy so let's repeat the same experiment as before which one now is the lower that's correct thank you and how did you find out great great so you literally had to check placard by placard now so this is becoming harder right because before we were just like we just look at the boundary and probably we can guess fine but if there is a single one in this picture that is wrong you don't know so if I would give you 20 plots maybe 40 times 40 lattice size we would be sitting here before for a long time before anyone can tell and this is the example of the type of spin model where if I do this kind of principle component analysis I am going to be disappointed let's say because you cannot get anything better than this and this is where the interpretability that some of you also raised before comes back because if you are a physicist somebody will teach you that you should map it on these Wilson loops and if they are broken it's a high temperature if they are not broken it's a low temperature just take me at the at my word at this point like that's how it works you get the loop by if you look at the bigger picture just if you try to draw the line between all the orange points and connect them then you get the lines that you see at the bottom picture and so if I get this kind of picture it becomes very it becomes very simple to look at even if I wouldn't give you the stars for the breakages you have this kind of dual image where you just look once and you see okay there is a one broken loop that's a then the then the gauge constraint was violated so so this is something that that it's like an open question that like maybe for now for it's not it's not as simple as just taking different configurations and different clustering and see what we can what we can see from them and then the question then the question of course is what to do next with that I am going to take a short break from this from this argument and let's just do this kind of detour machine learning primer there is many people who already saw this there is many people who didn't so I just I still wanna I still wanna go through it for those of you who are who are not familiar and for those of you who are I will just yell at you in 20 minutes when you can like stop looking at your phone so neural networks with these clustering algorithms that we saw there is a formula I make a matrix in a certain way I calculate singular value decomposition I calculate some project I apply it on my data done with a lot of problems you don't have this and so there is this idea that you that you use some kind of very efficient function approximator and people equated to neurons and a brain and stuff but actually this terminology is maybe a little bit little bit arbitrary even though it was helpful at the beginning of the field in the end what neural networks are they are just like parametrized and that said that can approximate a lot of functions and the approximation can be done effectively through some algorithmic means so how neural network looks like is like this that you have some input layer those are your data then the input layer is connected with this connections which actually is just a matrix and a vector that's a weights and biases to the next layer at which you apply some nonlinear function I will tell you in a second and then from the next layer you go and so on and so forth so in the end you end up with some output function which is the function of your second layer applied on a function of the first layer applied on a function of the data and all of this is a function of these weights and biases these W's and B's which are the so-called trainable parameters so yeah this is indeed this how this like a neural analogy started because you can think about this as like inputs that could be kind of like a synapses that are coming into some cell that are then outputted to the to the outside this is a picture I took from Phi Phi least Stanford machine learning for computer vision lecture which is by the way amazing if you want a more in-depth introduction to this since you want to stay with physics I'm going to be very superficial today but I will be referring you back to these to these in-depth sources okay so we are coming with the weights from the previous layer then the only thing that the neuron does is to calculate this linear combination this is something that you learned in your linear algebra you do some linear combination with the weights plus B to have a more general linear function for for bias and then you apply the so-called activation function I should have them here that gives you the nonlinearity because if you would only do a linear transformation in every neuron then yeah actually then what can I just get away with doing a linear transformation do I need this nonlinear functions for anything yes yeah exactly right so so the so if I would just have the linear transformation I would not be in a way not doing anything better than the principal component analysis that we just saw everything would be linear and okay I would express more complicated functions because I have a more variables but everything would just stay linear but if you want to express arbitrary function you need nonlinear function luckily for this F1 and F2 and Fn functions there is sort of standard choices that one can take that are already pre-implemented in all of these libraries that we use for machine learning these are called relu sigmoid or tangent hyperbolic those are some typical example of the nonlinear function that you can apply and that's really that's really the whole setup like you have some input layer connecting to the weights to the following layers all the weights and biases can be represented at matrices and vectors from which we just do linear algebra transformations plus we have this nonlinear function actors this is again another example from this from this 5.5 e-cores when you what happens when you are training the network I tell you in a second how do you actually train it is that the weights you are training should take some structure if you have a model that is looking like the picture on the on the left you probably you just have basically like a random weights and those models can be interesting to study in themselves but for this like a simple neural network application they are probably not good so what happens during training is that your weight matrices start picking up a very specific features that you can see if you plot them and that is then useful if you want to interpret what is your network learning now how do we train the network we need something called loss function which is the difference between what your network is giving and you wanted it to predict it's for now we will call it a function l and I will give you a concrete example in just the two minutes and what we need to do is to is to yeah minimize this function similarly like you would minimize the energy of the Hamiltonian you minimize some cost function or loss function that describes your learning goals meaning what you what is the function that you want the neural network to approximate and how you do that is that you adjust or you keep slowly adjusting all your weights and biases basically according to this formula that from every weight I slightly adjust by the derivative with respect to the loss function and that way I can slowly walk this optimization landscape until my loss function is the smallest the epsilon here it's called learning rate it just tells you imagine this if I have some multi-dimensional optimization landscape that is like some you know lots of local minima the rate tells you what is the size of the step so if the step is too big you are not going to distinguish or your fine structure and if your step is too small you will get stuck in a local minima so what this is the super important hyperparameter that you need to find what it should be for your specific learning problem now we need to figure out how to actually calculate this partial derivatives because it's easy for me to say okay you have some neurons it's connected by weights whatever comes out goes into the loss function that's some function that I want to minimize I minimize it by calculating derivative but I just told you that we need to calculate derivative to all of the weights and biases in the network which may sound like a daunting task can someone quickly remind us what what is the chain rule this is something that we surely had before thank you normally this is the question that I have to wait the longest to get answered but everyone saw this before right if I have a function that is a composition of multiple function and I want to differentiate it I do it like this that the derivative of my main function is this kind of product of the derivatives of the composite function exactly what we just heard from the audience okay but then it's simple right because I just take this formula and substitute it like to this picture and we write the loss function derivative with respect to the biases let's say in a first layer and then we decompose it like so so I differentiate basically from the backwards I first calculate the derivative with respect to the last layer then I calculate the derivative of the last with respect to the previous one and then I go to the previous one and take the weight that I wanted to differentiate with respect to in the first place and we keep multiplying like this and this is very general way to say it but this is all back propagation is it's just a chain rule and luckily you never have to think about it because again nowadays you will see in a second when we do the practical demonstration it's implemented for you in any package you will be using and there is a lot of like scientific needs why we need to maybe sometimes customize this algorithm but this kind of basic thing is always implemented for you so you never have to do it yourself but it's still useful to understand how it works so I will just repeat sorry for the slide that has a lot of text on it you take your network you calculate the forward propagation which is just linear combination from the previous layer nonlinear function linear combination from the previous layer nonlinear function then you calculate the backward phase meaning you estimate the error in the final layer you calculate this chain rule like I just told you and then you do this for many examples you combine all the partial gradients into the final gradient and then you update the weights by doing this W ij minus epsilon and the derivatives then you update all the weights and then you go again and again and that's pretty much that's pretty much on a conceptual level how it works let's take let's take a super concrete example because now I know I said like a lot of words so let's do the same but on a super super minimal example so first super prominent application of machine learning algorithm was just the image recognition if you remember like few years ago there were all these neural networks that are just recognizing like dogs from the cats on the internet and everyone was celebrating and so this is also a really great example of a neural network application because for example if we want to distinguish characters from these two movie franchises there is not one thing that I can say to describe this is not a problem this is to describe functionally right but this is exactly the kind of problem that is great for your network because I have tons of examples from each class that I want to describe so for to do this example let me say two things I told you before things about activation functions and I gave you some examples I neglected one and that is that for the last layer you want to use this activation function yeah I didn't write it before because the formula is kind of but what it just does is that it maps your output on a normalized probability distribution in this case binomial distribution probability it's a Star Trek probability it's a Star Wars and then you need some loss function and here for example we have this categorical cross entropy that calculates the difference between predicted probability distribution probability Star Wars probability Star Trek and the probability distribution that you wanted so let's take concrete example now with this categorical cross entropy loss function so Leia or Star Wars will be one zero vector so that's my label that's the probability distribution P hundred percent probability on Star Wars zero probability on Star Trek and Spock has the exact opposite and the network output the probability distribution Q will be a probability that the network is telling me probability I am in Star Wars probability I am in Star Trek then we can write out these formulas just writing out the loss function with the P and Q and just filling in the numbers one zero for Leia and zero one for Spock you will see that you get a smallest number if the Q class A and Q class B agree with your label right let's look at it on a concrete example here for Leia so we have a minus one logarithm Q of class A if class A is one then we get a zero and it's good but if it's a if it's a zero or a small number we are getting very big negative number multiplied by minus one very big positive number that's not a good loss and so on so what you really do it's it's as simple as this you just have the labels and you multiple you keep multiplying them with the probability distribution that comes from the network if your loss is not small or keep decreasing you repeat this back propagation step you do it for example a new set of images and you continue let's take a five minutes so you can digest it the people who are new to it you can come and ask me question and after that we will go back to spin models and we can also do some coding examples