 Hi everyone, welcome back. I hope you are refreshed after the coffee breaks. We left off with looking at a little bit of code of the very simple supervised learning for an icing model. I hope that it's been a little bit useful for you. If you are completely new, you are not expected to understand everything already, but hopefully you can find a point at the code where you can change the layers and change them a little bit. And if that's all you achieve today, that's an amazing success. And then you have the code if you wanna revisit it later. There will be notebook too, but let's talk about that when the time comes. What we did so far was just to take this very simple icing model and supervised learn low temperature, high temperature. Let's come back to this icing gauge theory problem. This is a fun example. It was an interesting failure point to make my point about a PCA. If you now go to the notebook too, don't do it yet, but then if you do and you load this icing gauge theory into the same kind of feed forward network, it will not work. And there is a place in the code where you get to try for yourself and see it for yourself. And that's an interesting point, right? Because we just said neural networks, it's kind of very general function approximators, but I am telling you already that this naive architecture is not going to work. Do we already have some early guesses where it could be? Yes? Yes, that's a correct topological answer and now in a human language again. Exactly, I made you do this point that we look at these big configurations where you really need to spot the one placate that violates the condition. And it so happens that first of all, if you flatten it, the placate conditions become maybe not so clear into this connected layer. And the second thing, roughly speaking, is that you would have to show the network all the possible combinations of the placate positions so your training set would explode because the network doesn't know what's the constraint and you need to give it the data in a format that it learns that. And it so happens that if you just take these specific icing configurations, flatten them and stuff them into the feed forward neural network, it will not work unless you make your training set completely exhaust all the configuration options that you have at that system size. Maybe that's not super clear if I'm just saying it like this, let's walk through the rest of the argument and then revisit this, but I already want to mention this because then this icing age theory also becomes a good illustrative point that architecture actually matters in machine learning. We cannot just take this dummy neural network I gave you in the code, apply it to anything and it will just work. And this is kind of over engineered outlier where it doesn't, but it will happen to you once in a while in a practical application that the feed forward network is not enough. What we need for this is something called convolutional networks. Again, as before, I am gonna give you very basic superficial primer and give you some references to look at later or we can look at the code together. Basically what the convolutional networks do is that they allow you to not flatten your input. The way we discussed things earlier, it was under like implicit assumptions. I said that whatever is in your visible layer in this column, that's your data. So you may be noticed in the code or you would have noticed when you come back to it that what you do is that you take this configuration and you put it into the one column. And you can do this also with this layer and Spock data, you can do this with any data. And it's a common practice in machine learning and oftentimes it works. Except here we have this very creepy and very particular constraint that is not going to survive this flattening. So roughly speaking, convolutional neural networks allow you to keep your data into D. And what they do is instead of connecting everything to everything, they just take like a small window. It's called kernel or sometimes it's called a filter. And you connect the weights from your input to this kernel and then you keep scanning the kernel through your model. So you do this kind of item-based multiplication that I did the first element of my input matrix, the first element of the kernel and I multiply and then I combine everything together and I get the first point on the upper left on the right. Then I slide my kernel. Now these are the same weights. I just keep, it's like a scanning thing that goes like this and I keep training the same set of weights. I slide it with something that is called a stride that's a size of the step I am making by two. I do the same thing. I multiply, I get a second point. And then we can slide around like this. This is kind of, this may look a little bit abstract and I didn't even like write the formula but in the end it's again kind of simple to just use the PyTorch package. PyTorch package where you just write how many, what is your input? How many outputs or how many kernels do you have? What's the size of your kernel and how much you wanna shift by? And then the package figures out the rest for you basically. So that's one thing. So you will have a line like this in your code later if you wanna try. And so the thing here is that if you just scan one kernel around your input if you only scan one you are probably going to pick up one specific feature but no more. And so what people do is that they actually take many of these kernels or filters that's a choice that you make in your code and then you train many of them and maybe one of them picks up the color. Maybe one of them picks up the noise dependence. Maybe one of them picks up where the cat is in your picture and maybe one of them picks up that you need to look at the things placket wise. So you will see in your code is that we will have a kernel size two which is the placket and a straight one and we will be shifting this placket around. And this is exactly the kind of structure that allows the neural network really quickly distinguish between the placket is odd or the placket is even. But it is also exactly the kind of thing that would be like infinitely hard to discover if I just give the whole data at once and then flatten it into a single vector. But I can take these convolutions and just make it sort of recognize the specific spatial structure that the data has. This is why this is sometimes called in machine learning translational invariance because you don't care which placket it is that the condition is met. You are scanning the same filter all around the picture. In a non-scientific machine learning this is super useful when you have this kind of models that are like find where is the cat in the picture but you don't have a preprepared pictures where cat is sitting in the middle. So that's one thing that is useful. Then second thing that is useful is something called padding. It comes back, we had a question also before about boundary conditions. You don't wanna change your network from scratch to let's say that for example you pick the filter size that is not the exact multiplier of your input. Then what, then we are done or maybe we need to pick other filter size that will not solve the problem that we are trying to solve. So for this reason something called padding is useful because you can just pet your image with however many zeros you need for the dimensions of the filters and the stride to fit together correctly. Or you can do a second thing which is the periodic padding which for you as a physicist is the more relevant one because for most things you are doing at least with this kind of simple spin models you want to have a periodic boundary condition. So it's not nice if your image ends here and if I would pet it with the zeros, of course then I violate the gauge constraint that I'm looking for here. So we need to do a periodic padding. This is just the two things that again you can choose very easily in the package. For this specifically lattice gauge theory it's kind of awkward because there is a two sub lattices and it's complicated so we just rode the periodic padding function in your notebook and you don't have to worry about it and then if you ever need it in the future you can just look it up there. And the final on this quick tour of what might be a useful layer if you want to optimize these problems is something called dropout. Basically when you chain a neural network there is something that is called overfitting. Maybe someone from the audience can tell us what's overfitting. You all raised hands that you already work with machine learning so. Yes please. Perfect answer, thank you. Yeah so the answer is that the neural network really over specialized in the data that you are showing it but does not generalize well. What is a good way to avoid this is that you sometimes just randomly kill some neurons and then the network needs to relearn them again as the data is coming. This is super simple and very powerful tool that basically makes you to just if you give some picture that has very prominent feature and immediately decreases your loss function then it sort of predetermines the weights in your network to be biased towards that specific configuration and not towards the universal thing you wanna learn. So what you do is that you just say okay I will just kill a specific neuron and what this function does is that it just zeroes out all the ways that go to it and then you have to try then you have to start training them again. It's again super simple. We don't have to think about it super like deeply how to write it because spy torch has a function where you just write function that has a argument as a probability of how many neurons in a specific layer you wanna zero out. Here is the example of 50%. Usually if you are just starting out doing something with the like having a dropout layer 20, 30% every other or every second other layer if you are doing a deep network is usually a good idea. There is more regularization methods if that's something that you are interested in I'm not going to cover this here in detail but with the references that you get you can read up more this. But this is again something which is if your network is not converging probably this is a good useful thing that you wanna do. Yeah actually now I see that I have it on the slide. You can also add like a renormalization functions to your loss functions where you sort of add the absolute values of your rates to prevent your neural network from exploding to get like some super small or super large numbers but yeah this we are not doing today. All you are going to need is actually this dropout function. Now the question is I think it's maybe useful nonetheless if we go to the notebook number two for people who haven't for people who already finished or are very comfortable with the first one where you can look at this IGT. And then for those of you who are already pros there is actually an empty windows there where you can write your own code. So let's take maybe just like 15, 20 minutes to also check out the second notebook. If you are new this is the time to come back to the first notebook and continue at your own pace. It's in the end 15 minutes is super little. So like I just want you to you know see the code get a feeling for it ask me questions so basically you can get started. And for those of you who are already who are already super ahead they can just go to the notebook two. The setup there is exactly the same as we had before except that now you will load the Monte Carlo samples from this Hamiltonian and yeah you even have this plotting function. And then I will first ask you to build dense neural network that classifies these configurations. You will need to know the size of your input so do not just scroll through the loading data part. Then you can do that. And once you done that you can you know continue with the evaluation. And then in the second part there is a setup for the convolutional neural network where you also have this like a beautiful animation of how these filters multiply the weights and combine them into the output if you wanna check that out. This is the function that does the padding it's just like manually written periodic padding for this kind of thing. And then you can just do your convolutional neural network. Here again since we are just starting I just filled in all the numbers for you. Jin my PhD student left this question for you here. Questions for students. How do we know the number 2312? So that's something to think about. And then you can try to then you can try to run this code and see what kind of classification you are going to get. And what you should be able to see is that with the convolutional neural network you get a beautiful 100% accuracy very quickly but with the dense network is not going to be so easy. So also maybe let me just repeat that like I super I just care about that you learn something so I don't care about what you complete or what you don't. Even if you read through like first few cells in the first notebook and you understood everything that's awesome. So this is the time to learn at your own pace and if you get stuck at anything ask me and if I'm busy ask one of these guys. So did everyone manage to do at least a little bit with the notebook? I see people nodding. I saw a lot of nice results when I was walking around and it was also nice to see what kind of questions you are asking that makes me feel like you understood things. I'm gonna spend rest of today talking a bit about some pitfalls of what we did so far and potential some newer scientific results that can help alleviate those. So so far we looked at two examples. One very simple one kind of convoluted that's a pun to distinguish the phases and we learn that with a relatively small neural network I can learn the phases and you get accuracy upwards of 99%. You see this in these notebooks and you could improve it even further if you work a bit on the hyper parameters and your architecture. So now the question is again I keep asking the same question like is everything solved? Like let's again step back and think about so if this works am I, is my machine learning, learning the physics that I wanna do so I can just always do the supervised learning and be happy with my life or is there still something we are missing? That's one great point that the whole assumption here was that you have some training set that is labeled meaning I understand my physics problem enough to draw some samples that are sufficiently representative and I also know how to label them. Outside of these labels is there anything other that could be problematic here that maybe it's sort of under the carpet but if you think about it maybe you can. Something I didn't say yeah I didn't hear you say again. Yes that's a great point very technical one but nonetheless correct one. Of course the training here was kind of simple in a sense that we know that it converges not every optimization landscape in terms of where your trainable parameters are the variables will have a nice global minimum that you can actually reach with the out of the box numerical method and then one has to do more research into optimizing things. Anything else? That's another great point that we just I just told you that is two phases so we labeled with two phases and if there was a third phase I didn't tell you about you maybe your model wouldn't have a way to find out would it? And still waiting for maybe one more thing that is like a bit different from the labeling. Anything else we can think about Felix? Yeah that's true like a hyperparameter issue that like just because a simple kernel choice works for this example it may not work in any example and then one needs to yeah work hard and run many different instances to find a kernel that would work. Anything else? So the machine learning thing it's a data thing. Thank you perfect. So here I give you this Monte Carlo samples the credit to that also has to go to my colleagues in Zurich who helped us create this training sets few years ago. You may know that this like icing model is like basically easiest Monte Carlo problem is how you learn like Markov chain Monte Carlo if you ever had a class about it. The thing is that for the interesting problems the sampling is not so easy so it's not like that you wanna solve a new problem and voila like 20,000 configuration beautiful training set that is unbiased samples and represents your optimization landscape will be available just like that. So but if I boil down your points into two concerns one is the data and one is the supervision. It's not I need to have a way to check if my data is has a sufficient quality and if I already knew the labels what new physics did I actually learn. So those are two things to have in mind. I wanna address in the rest of today a little bit about the labels. Basically there are sort of relatively straightforward ways to generalize the supervised things that we talked about today into something that is like more or less unsupervised and helps you deal with some of this lack of knowledge things. Yeah this is I just said. So first method that I wanna talk about is called learning by confusion. Again sorry now things are written on the slide so you can reuse it later but I walk you through it pictorially in the following slides no worries. So this is a method from ETH by Everd van Nieuwenburg and Sebastian Hooper. Also from 2017 it was one of those papers that came side by side together with the supervised paper we were looking at in a previous lecture. Basically here the idea is that even if you don't know the phase transition you can guess your phase transition point then train many different classification models and see how the performance of the model depends on the critical temperature you chose as a label. So we have again this data set we forget about data problems for the time being and we have this icing model that someone gave us but we don't have the labels we just remove them. So I will have some that are possibly low temperature and some that are possibly high temperature so I can just choose to guess the critical temperature. I am going to guess it's a two and you have this labeling script in your notebook so you could very easily modify that to choose to label differently specifically to label everything below two as a one phase and everything above two as the other phase. Once I do that you can go to the first supervised notebook that we did today, rerun everything and just make a classifier under the assumption that your transition was two. Okay so I do this, I am done. Then you save the accuracy of whatever happened here and you continue with this game. Specifically you guess a new temperature for example two plus epsilon and you repeat everything we said. You go to your labeling script you put your new critical temperature you're reliable everything you run you look at the accuracy and then you pick a new critical temperature and like this you can sort of sample through your reasonable guess of where the critical temperatures should be. So far so good. Now we didn't do anything crazy we just trade multiple supervised models with different guesses of what the critical temperature should be. This is from the code I made for the data set that you have. It's also somewhere on the internet it's not on the notebooks today so you are super welcome to try it for yourself especially if you are a more advanced practitioner. This is a kind of ugly version of this result but it's still useful to look at because now we are plotting accuracy on the y-axis and the temperature on the x-axis. And what don't we see? An interesting shape that somehow has a peak exactly around your 2.25 where you would want it. So that's cool right? And this you can do without knowing where the critical temperature is we just made a few guesses trained a few supervised models now you already know it takes like seconds to train them in Google co-op and you get something like this. Do we have a guesses why this works and why does it look how it looks? Okay so what's happening when you are when you guess your temperature wrong is basically that there will be the features the network is extracting from the data will be associated with two different labels basically. Like you have if your critical temperature is too low the network on a tail of the high temperature will start learning oh is this ordered almost nothing is aligned but I will also have examples that are super aligned but have the other label. So what is it feature wise that the network should learn? And vice versa if I put it too high everything that is in a that the network is learning that is ordered and should have one label will suddenly also have the training set elements that is completely disordered and have the same label. And this kind of confusion exactly creates underperformance in the network because it's not as straightforward as is this so this is the feature good is the other thing. It's only when you guess the critical temperature more or less correctly where I will be right with you Mathieu. That the features that the network is extracting and the labels somehow agree and boast the performance. What is it? Oh yes, thank you. I just wanted to ask you that. So who can answer this question? Yes, that's a great thing. We have a follow up. This is correct. They are almost there. That's kind of also true. Yeah, it's indeed. Like you are all saying the correct thing and the last one was related to the sentence that I wanted to hear, right? Because if your critical temperature is your edge temperature then you just labeled everything as the one class and then there is nothing to be confused about because you don't have any mislabeled samples on the higher temperature and vice versa. So then the network extract something admittedly useless but at this accuracy measure it's going to have a good performance and that's what it is. And it's the same argument for the lowest and the same argument for the highest temperature. So then this is very nice because I'll buy it. This plot is not super nice because you have a small set so you don't have to wait for Google call up forever to load it. But this kind of very nicely reproduces the exactly same plot from the Fannie Wemburg and Al paper. So you can even do it for yourselves in your supervised notebook and you should be able to get pretty much the similar phase transition. So that's one thing. Then there is another thing which is something we were working on with my collaborator. This has a small, the learning by confusion maybe has a small like a scaling disadvantage. Can someone tell me what it is? I already used the word scaling. So yeah, exactly. That's exactly correct. And so it's that thing. So I need to train a lot of models and each of them need to be trained independently because new labels and so on. And the second thing is that if I accidentally don't have my, if my granularity is not enough and I am super unlucky that my peak is actually between the steps, it will, yeah, you need to start doing iteratively this process and it leads to even more training. So there is this idea to like, I sometimes like people keep calling this like a single shot version of the, of the learning by confusion and it's not how I thought about it originally but I kind of like it. So the idea here is that you will train in a supervised way but on something other than the labels of your phases. For example, when you are drawing the Monte Carlo samples, you know the temperature at which you are drawing the sample. If you are measuring your experiment, you know at which magnetic field you measure. If you are tuning your quantum gate, you may know the voltage of your gate, things like that. There is always some, not always, but in a lot of use cases, you will look at in your daily problem solving, there will be some continuous parameter that you need to adjust your experiment. And so the idea here was just to train a supervised network to learn to predict the parameter that you anyways already know could be temperature here, I wrote beta which is inverse of the temperature with the constant. So if we just remember like how our derivatives work and what we would ideally want from this method is that you have the, if I plot the predicted parameter on y-axis and the true parameter on x-axis, I would want a linear function, right? Meaning everything that I predict is exactly the same as the true. And if I take a derivative of that, it's going to be a constant function, yeah? Okay, let's do the same thing with this icing data set again and then I show you some fancy results with a better data set. That's actually not what happens. Specifically, you see that this is definitely not a linear function and the derivative is definitely not straight but interestingly, the derivative has a peak just where the phase transition is. Let this settle a little bit because this is an important point. So basically what happens is that there is like a gradient in your data. Like if you are at the phase transition, your, you change your parameter a little, you change your parameter, let's say temperature a little bit but the structure of your state changes super, super little. Sorry, a lot. While if you are deep in a phase, for a spectrum of parameters, the configurations look kind of the same. Like if I am deep in an aligned phase, if I just change the temperature a little bit, I will just keep drawing more of these plots where everything is like this. And then if I have a plot where everything is aligned and plot with temperature plus epsilon where everything is also aligned because statistically that can happen, then the network has a hard time to assign the parameter correctly. But when I am at the phase transition is exactly the opposite case because I changed my parameter by epsilon but the structure of my state completely exploded. So this create this kind of gradient structure in your prediction. And then what happens is that when you are deep in the phase, you are sort of underestimating the linear function than we would ideally want because it's actually really hard to see the temperature increase, especially when you don't have enough samples. Then you have a really nice, deep change when you are going when you are going through the phase transition. And then again, when you are deep in the new phase it starts flattening again. And then it's useful to look at the derivative because that tells you the rate of change of the thing. And okay, this is not a great data set. You see that you have two peaks. I put the phase transition point at the bigger one so technically it would be fair but it's a bit noisy data. But this is yet another way you can use to find out where your phase transition is without actually labeling the data with the transitions. Is this clear? Okay, great. And this way you actually only train one network. For both of these methods there are some subtleties of doing this however because of course if for example for this icing training set if you will go to your own Monte Carlo function and draw 10 millions of samples you are going to get a straight line for this. And you will also overfit on the confusion probably a lot so you will not see this beautiful W shape. So for example, for this method is really crucial that you critically assess your training set and data preparation and try a few things and see how the prediction curves are changing because if you just do it for a random training set configurations is not probably going to work as much as you would like, yeah? Yeah, this is 100% correct. Like yeah, I will show you the paper result in a second like there for some of the simple examples like we indeed like you can just check that what the network is actually learning is to estimate how many configurations is there at a specific energy. And then of course, if you start differentiating through that you will get the same thing. This is a super, yeah, this is a super good point. Like this is one of those things that like, yeah if you don't want to think about it very hard and this was in 2020 so this wasn't super well understood just slapping a feed forward neural network on it is a solution but it's true that nowadays we have also other tools to do this. And other questions, yeah, yeah, great question. So the way how I explained it here this is like done for a fixed size and then what you see is a phase transition at that system size. However, indeed nothing stops you to get a data for bigger size, process them like this and plot the, yeah, plot how the parameters shifts like so for example. So this is our results, yeah, from the paper where we did it on a slightly nicer data set for the icing gauge phase transition. And so this is exactly the example of this like a system dependence that you talk about because there the phase transition in an infinite system it's really at zero. Like either you are aligned or you're not and then by you are in a high temperature. And then of course at the small sizes we are not seeing zero but we are seeing the systematical shift to that. So I think in that sense it could address your challenge. Oh, yeah, yeah, they do. You can, I don't have it on the slide but it's in the paper, yeah, you get a nice, you get a nice finite high scaling loss, yeah. Yeah, actually maybe this is a, wow, now my presentation got confused. So for us this was like a kind of like a groundbreaking moment because for me this icing gauge theory is a really good temperature taking thing if you wanna test a new method because it turns out for the supervised learning I just sort of walk you step by step how to really engineer your neural network to recognize the phases, specifically doing the kernels of the correct size, doing the stride correctly and then you train a neural network and then it works but someone was also asking at me, asking me in the audience like, yeah, okay, but what if I don't know what the kernel size should be or like what if I don't know that I need to switch to a particular convolutional architecture, then you would need to run a huge numerical experiment that is scanning all your possible hyperparameter architecture configurations and that's not super practical and it turns out that for all the different unsupervised methods you can go with the clustering, you can go with the variational auto encoders, there is many different methods. This particular gauge constraints, they are super, super hard to learn without hard coding the constraint in your loss function because of this weird thing that even if one of your placets is fit is even if you are epsilon one spin away from the correct thing, the whole thing is broken and for the data driven models that are used to learning the average things, this is not, this is a very sensitive problem but this kind of unsupervised learning where we train on the temperature and completely leave the phases and gauge constraints out of it is actually a great way to approach this because it can capture the phase transition without learning what the gauge constraint is and then maybe you have an interpretability problem that's a second issue, we will come back to that maybe tomorrow but this is a nice thing so if you have a new unsupervised phase detection algorithm idea, I would warmly recommend that you try to test it on the icing gauge theory because it's a simple data set and it's a very vicious one and so we were also then deploying this kind of same techniques on this is for example, phase diagram of a torque code who knows what's a torque code? Okay, only some people, it's one of this error correcting code that you can write as a Hamiltonian that has a very complicated phase diagram that has all different types of phase transitions and it turns out that even for this weird phase transitions in a like a quantum error correcting code, ground state you can do this kind of methods and we actually got a very nice peaks for all the different ways to do the phase transition at the time. Okay, those are the main things I wanted to tell you today. What to remember from today? So hopefully we covered a good semi-helpful primer introduction into supervised learning. You have your notebooks, if you wanna look at it, how it looks in the code. I really, really tried to make everything as data set independent as possible. All the training functions are really ready to copy paste to your own applications so hopefully that is going to be helpful. If you are new, I would recommend that you try to go with that. And yeah, we covered some more advanced layers like convolutions drop out and convolutions drop out and now I forgot. I guess it was just a convolution and drop out. Oh my God, sorry. And yeah, now there is this unsupervised learning techniques which actually hopefully now you understand they are just like supervised techniques in a sheep's clothing, meaning you either guess your face transition point or you label with some other label that you know and both of them are known to perform rather well. And then tomorrow we are going to look at some practical applications in the morning. I think Francesca is going to tell you a bit about ultra cold atoms. So we will do some examples with the supervised learning with ultra cold atoms and then we also look at some more advanced machine learning techniques that we are using in contemporary quantum experiments nowadays. That's it for me, we have 10 minutes left so if you wanna stay here and get some more advice of working through your notebooks, feel free and otherwise enjoy the rest of your day and see you tomorrow.