 Well, welcome back everybody. And we will now go on with the last of the three lectures by Florian. And I already see some announced Jupyter coding examples. So have fun. OK, thank you, Markus. So welcome back. I want to start by pointing you to some example notebooks that we produced for my machine learning course that I've regularly given back in Erlangen. And you can find these examples by going to the GitHub page of myself, Florian Markwart, Machine Learning for Physicists Repository. You will see many different Jupyter notebooks starting with the very basics of machine learning, programming a neural network, a numpy, and so on. But I want to point out a particular type of example, which is just applying policy gradient in this case to solve a maze. And so the idea is something like you generate a maze like this, and your little robot has to find its way. And you can do this in many different variations. In particular, you can say it's always the same maze, but maybe my robot starts at random locations and has to find the exit, or has to find some treasure chest. Or you could say, no, my robot should develop a strategy in arbitrary mazes to know what it's doing. And that has a big influence on how you can set up the solution. Namely, if it's always the same maze, then the state could be just the location of the robot, because it's always the same maze. So once it knows its location, and somehow it has learned to go around in this maze, it will know where to move. And in that case, you can actually encode the policy, that is the conditional probability of action given state, in a table. You just make a table that is as big as this image of the labyrinth. And if there are four actions, then there will be four set tables. And these are the probabilities, and you can train them. And this is what is shown in this first tutorial. So if you go to this GitHub page and you download 07 tutorial maze policy gradient, that is policy gradient applied to finding your way in one given fixed maze. And the beauty of that tutorial is that, yes, the policy gradient algorithm, the full learning of everything in 124 lines of pure Python code, as it says here. So the advantage is you can really go through every line yourself and understand it. And it's really the minimal example I can think of here. Now, if you want to go to, and then you can run it, and it's being visualized, and so on and so on. Now, if you want to be able to solve arbitrary mazes, then the input to the robot probably has to be something like an image of the maze. And then it should look at the labyrinth and say, oh, yeah, I should go this way and that way. In which case, you need a neural network. And so this is shown in other tutorial examples here. Let me just quickly go there. So here's 08 tutorial using deep policy gradient, deep sense for deep neural network. And you can go through it again, of course, a little bit longer. But this first attempt didn't yet work that well. And there's another version of that 09 homework improved maze deep policy gradient. And so that is really applying a convolutional neural network to an image of the labyrinth, going through several steps, and then spitting out the action probabilities, similar to what we saw for Alpha going through a convolutional neural network. And that eventually works fairly well. And then you can play different variants of the game. Here we have variants where there are treasure chests hidden in the labyrinth. And you have to pick one after the other and stuff like this. So I really invite you to go to this GitHub page, download these Jupyter notebooks, go through them, try to understand the code. But that's something you have to do yourself in little teams, so I will not now do it online. You just do it yourself. Okay, so now I want to spend the last 80 minutes or so of this lecture series on giving you a few examples of applying reinforcement learning which are taken from our research. So you really see how this works out when you want to apply it to quantum problems. And they become non-trivial. Okay, so can we design better quantum computers as the overarching question by having a classical neural network interact with my quantum computer? And there are many aspects, even beyond reinforcement learning where classical machine learning can be useful for quantum technologies. Maybe you discussed parts of this already during the rest of the lectures this week, but let me nevertheless summarize it. So here you have your little quantum device, could be a quantum computer, a quantum simulator, or a single quantum sensor. You want to control it via machine learning and what are the tasks that you could try to solve? Well, for one thing, maybe you just want to simulate your quantum device and maybe it's complicated because it consists of many qubits. So even to represent the quantum state is difficult and you may have already heard something about it. You will certainly hear something about it later, so neural network quantum states are one example implication of machine learning to simulate these quantum devices. And then you could also just be staring at this noisy measurement data, so voltage versus time, very noisy, and you want to interpret it. You want to figure out if this qubit in state one or zero, despite all the noise and despite all the non-idealities in your quantum device, so it doesn't behave maybe perfectly like you would want to think in simulations and you can train a neural network to do this in a classical kind of classification task. Or maybe you have fabricated your quantum device and you know roughly which Hamiltonian it should obey, but it's not quite there. There are unknown parameters. You want to figure them out in an efficient manner and again, machine learning can help. And then on the other side, there's the discovery of strategies. So you could go down to the hardware level and imagine in a kind of quantum control setting what are the best pulse shapes that I have to send down to my quantum device to for example, even just do a single gate like flipping the qubit from down to up. Or you could say once I have these gates, can I stitch them together in a quantum circuit? Maybe even including some measurements and feedback to discover a good quantum circuit or a quantum protocol involving feedback. Maybe you want to even discover the whole experimental setups. So these are the things that may be tackled with reinforcement learning. And if you want to read a recent kind of very brief review of the whole field of classical machine learning for quantum technologies, I direct you to this perspective article that we wrote last year. Okay, so now let's go to reinforcement learning. Yeah, I still wanted to advertise what we are doing in our group. So yes, we are doing machine learning for quantum technology, but we're also applying machine learning, say to optimize photonic systems, figuring out how photonic crystals should look like to do what you could call in general artificial scientific discoveries. So for example, you are staring at a very complex system with its complicated dynamics. Can you discover such things as what's the best representation of the system? Also we are turning things around and looking at how to use physics to produce better machine learning devices. The key word here is neuromorphic computing. How can you replace digital neural networks with something more analog? Okay, so this picture you've all seen already, reinforcement learning consists of an Asian environment, but how would we apply this to a quantum device? And so here's the thing I'm thinking of. Here, for example, you would have a couple of qubits. You can decide to measure them, that would be something like an observation that your agent takes in. And then, possibly based on a neural network, your agent decides on the suggested actions and then you apply these actions and in such a setting, the actions could be like quantum gates. So you decide to flip one or qubit or to apply a controlled knot between two qubits. And then you would have to define a reward and tell me what is the task that you are really trying to achieve. And so here's one interesting question already. Do we have a model of the environment? I briefly mentioned it in the very beginning. Sometimes in physics we actually have this model because it's the Schrödinger equation, for example, for quantum physics. But even though that may be the case, to apply that model can still be difficult because in order to really apply the Schrödinger equation, I need to know the exact details of the parameters in the Hamiltonian so I need to calibrate my experiment. Not only do I need to calibrate the parameters in the Hamiltonian, but maybe also all the technical details of how the signals travel along the control lines and so on, so it can be fairly complicated. And therefore it is useful also in this setting where in principle we know the Schrödinger equation to apply these model-free reinforcement learning techniques that we discussed before where technically from the point of view of the algorithm, your quantum system is treated like a black box and it reveals its effects only via the interaction with the agent. So the agent says I do this action, then the environment spits out another observation for the next round. So in this way the agent does learn what's going on but only implicitly it does not make use of a model. Okay, and so one of the big tasks that we identified as being amenable to reinforcement learning is quantum error correction. And why quantum error correction? Well first it's an important task in quantum computing because they are subject to noise. If we don't have quantum error correction we will never have a useful quantum computer. Second, the strategies for quantum error correction are fairly tricky so they are not completely obvious to us human. So you could try to discover them with reinforcement learning. Also, quantum error correction strategies may involve some kind of feedback loop, because you measure and then decide, oh an error must have happened, I should correct the error, that's a classical type of feedback. And reinforcement learning is very good at feedback because for reinforcement learning is about an agent observing its environment and then reacting based on the observations. So that's where we started. And you can ask things like what's the best quantum error correction strategy in practice for a given hardware platform will depend on the kind of noise that is acting on this platform, the available quantum gates, the operations that I can apply, the connectivity between the qubits. And so that's already interesting because even if you know in principle of how quantum error correction works because Peter Schor told you in the 90s then still in every particular hardware platform there's still a lot to discover and you can use reinforcement learning for that. Okay, and so now I want to go through the first project that we ever did in this domain and it was also the first machine learning project that we ever did as a group. And that was about the following situation. Imagine you have a few qubits, not many, a small quantum module so to speak, they are subject to some noise so in principle quantum information would deco here and decay and you want to protect it against that noise. And you pretend that you know nothing about quantum error correction. So it's the task of the agent to figure things out on its own which has the advantage that it may figure out also slightly surprising solutions that you would not have otherwise come up with. So to make this really clear I will formulate it like a game because every reinforcement learning thing is like a game. So at the start of the game you will initialize one of the qubits in a superposition state and it's an arbitrary superposition state and your task is to preserve the superposition state as long as possible regardless of how it looked like in the beginning. So it's not about just preserving a known fixed state that could be easy. You just wait for it to decay and then you recreate it. But you have to preserve the state that can be arbitrary and so you should never lose it because once you lose it, it's gone. Okay. So what can you do? Well, what you can do is obviously apply different quantum gates maybe do something on single qubits or also do something on pairs of qubits, flip the second qubit state depending on the first qubit state these kinds of things. And the way we like to represent this is in terms of a quantum circuit. So this will occur again and again in this lecture. So a quantum circuit has time running to the right. Every qubit is a different line and one of these four qubits in this example is the qubit where we put in the arbitrary superposition state and maybe the others are in their ground state. And now if I don't know anything well I can just start applying quantum gates. For example, a control knot between this qubit with an interesting superposition state and a neighboring qubit. So the neighboring qubit will be flipped depending on whether the first qubit is in zero or one. And so that's the circuit symbol here. I can go on and apply it to other pairs of qubits. I could also apply single qubit gates I'm not showing here. And eventually maybe I decide to measure. We want to make the agent have the possibility to have a measurement because we know that in quantum error correction sometimes I want to do measurements. So now if you did decide to measure at this point in time, well, if you think it through and you know your quantum physics then you know that these controlled knots have entangled the original quantum state with all the rest of the circuit and this measurement now will completely collapse the quantum state. So instead of preserving it you have done the opposite. So that would be an exceptionally bad choice. But I'm showing this example because this is exactly the situation that a reinforcement learning agent finds itself in the beginning. It doesn't know anything. It will apply random actions and it will typically fail in this manner. Okay. And so now how could we solve this problem using reinforcement learning? And I will first introduce a naive approach which is the one that we started with because we were very enthusiastic. We had read about AlphaGo and we said, okay, that's great. We can solve any complex problem using this techniques and we just need to press the button. So we said, let's say, what's the reward? Well, the reward could be the overlap of the final state of the qubits or of that particular qubit that I care about with the initial state. If that overlap is larger, I have preserved my quantum information. And maybe since I want to preserve arbitrary initial states, I run trajectories for arbitrary initial states and I will average this over initial states but that's a small detail, okay. And the policy network, if I think of policy gradient reinforcement learning, would just have the measurement result as an input and then produce the action probabilities as output and the action probabilities would be, for example, single qubit bitflip, controlled knot, any other kind of quantum gate and also measurement. And of course here, I simplify already because if there are four qubits, there are many different C knots that I could apply between the first and the second, the third and blah, blah, blah, also bitflip. So there will be relatively many actions already. So now you could let those run and it will surely produce some random sequences but it will be completely stuck in these random sequences actually. So we were very disappointed. All these really powerful techniques were not of any use. And so what's going on? What's the problem? Well, the first problem is that's of course also true for other reinforcement learning setups but here, it's particularly bad apparently, is a combinatorial explosion of things you can do. So the shortest useful gate sequences to encode the quantum information, do detections of possible errors that occur and then decode the quantum information, that's already fairly long. That could be, I don't know, 30 different time steps at least, something like this, yeah? And then you also have many different gates at each time step. So if you just do the combinatorics here, I say for 20 possible gates or actions and 10 time steps, you have 20 to the 10 possibilities. Now it may be that there is not a unique solution so maybe there are several that give a good result but still it's completely crazy and in the end we will even be talking about 200 time steps. So there's a large, large space of possible solutions and the problem is a little bit, it would be fine if there is one solution which only needs a few gates and does not so good but already better than nothing and then you can build on that and improve but even the shortest solutions that do something useful and recreate the quantum state are already relatively long so it's very hard to encounter them by sheer luck. Okay, so that's one problem. And the other problem that we then also discovered was local optima. So of course that's not very unique to reinforcement learning but in this particular case what it means concretely is the agent discovers that it's really good to be lazy because if you don't do anything, any gate and any measurement and you just let the qubit sit there you're already pretty much better than if you do something. So what is, if I want to plot this here I'm plotting well I want to be low should be good so I'm plotting the negative return, okay fine. And so this is the idle strategy. Doing nothing is already a local minimum and then there is in principle something better but it's a smart and complex strategy and then there's this long barrier of incomplete strategies where you are not yet quite there and not yet doing quite the right gate sequence. And so that's really tough here to overcome this. Okay, and so the lesson I want to teach by discussing this old work from 2018 is that in many cases if you are really trying something difficult in physics you cannot just import the things that computer scientists have invented and then press the button and hope that it works. So you have to put in your knowledge of physics. And so we had here two key concepts that we invented for this purpose. So the first was to put in as much information as possible and I explain it in a moment and then also to construct a very smart reward function that is much smarter than the naive thing that we started with, okay. So the first about the information that you get. In principle this error correction strategy should work purely based on the measurement results because we know this, I mean, Peter Shaw taught us in the 90s already how quantum error correction should work. So in principle these projective measurements of a qubit should be enough to decide for example whether an error has happened and I should apply a correction sequence. But if you think about it these are very rare things. So most of the time you're just applying gates and once in a while you're doing a measurement and then you get one or zero and then based on that you have to think about what this means. And if I come back to our little robot, this is almost as if this little robot is blindfolded and bumping around the maze and only maybe touching the wall that is nearby and not much more. It's of course much easier for the robot if you give it the full picture of the whole labyrinth including the positions of all the boxes. Even then it's not completely trivial because it still has to plan the shortest path to the nearby box. So it's not trivial but it's obviously much simpler. And so how can we translate this to our setting? Well we could say in principle that instead of being fed only these measurement results the agent is fed the state, the quantum state of the quantum computer at any given time and then has to decide what to do. That's certainly more information. That's actually the most information that we can possibly provide. The problem is, okay, it looks a little bit shaky. It almost sounds like cheating. So first of all in an experiment we will not be able to do this because in an experiment I cannot somehow stop my quantum computer and then do full quantum state tomography and see which state I have. That's completely besides the point in the experiment I should not, most of the time not measure the quantum computer and only very rarely measure it. And then we would be back in the old problem. So we can do this in simulations but not in an experiment. So that's something we have to think about. Plus also this agent has the task of preserving an unknown quantum state. But now you're giving the quantum state to the agent. So in the worst case the agent could start to cheat. Initially it will be given the quantum state and it memorizes the quantum state. Then it lets the quantum state decay and then after at the end of the time evolution it comes and just do a state preparation sequence and recreates the quantum state and says ha, I was successful. But again this is unfair in a real experiment this is not possible. No one tells you this unknown quantum state. That's not the point of quantum error correction. So yeah, what do we do? So first we tackle this problem that we want to preserve an arbitrary quantum state. So really instead of really feeding the quantum state itself into the agent what we're feeding into the agent is so to speak the time evolution of arbitrary quantum states or in more technical language it's the map that would map the initial state to the state at some point time T. And technically that's a completely positive map that includes all the dissipation that's going on. And we can then think about how we describe this map and it's actually not that difficult because we are only considering arbitrary quantum states in this single qubit subspace so it's manageable. And so we will actually feed this completely positive map into the agent so it cannot then spit out an action that depends on the actual realization of the quantum state. It only knows in a more abstract sense about what's going on. So that's already one trick. But still this doesn't solve the problem that in an experiment I have neither access to the density matrix nor to this completely positive map. So what we do then is the following and we called this two stage learning. So first in simulations we train this what we call state-aware network which gets the full information and again here I pretend I'm feeding the quantum state row and in reality I'm feeding this completely positive map. And once this is trained and because it's so powerful it does train well then I go and say I set up a second network that really would only take measurement results as input but instead of trying to train that using reinforcement learning which I already know is very hard I use my first network which has become an expert and I run it on trajectories and then I try to teach my second network to reproduce these actions but now only based on the measurement results. So the first network that was so powerful has been trained using reinforcement learning it knows the correct sequence of actions and the second network tries to mimic those actions but only based on the few measurement results. So that part is a supervised learning. So a powerful network really the teacher teaches a less powerful network which however then can be applied in the experiment. So that was something that we did and that really worked very well. And in principle it's something that you can apply much more generally. So if you have a reinforcement learning setting where you observe only a small part of the world and so it's really difficult to decide what's the good action maybe if you have a simulation available that can give you a much bigger observation think of a computer game suddenly you see the whole map of the playing field instead of only the vicinity. You can first train an agent on this fully observed environment and then have it teach another network that only takes on these very sparse observations. So that worked really nicely. Now there's one thing here, one technical thing. If you give me the quantum state that's enough to know what to do. If you give me measurements you should actually tell me the whole sequence of measurements so I may need some memory, I may need to memorize oh the first measurement five time steps ago that gave me one, the other one was a zero. But once you give me this memory then this is enough information in principle to reconstruct what should have happened. And a network with memory this is one of those recurrent neural networks. I don't know whether you did discuss this. Yes, you did discuss this. Okay, so it's an LSTM network and it worked really nicely. Okay, so this is about the thing about having too little information. We solve this by having a lot of information and then training a network that can do with a little information. Any question about this, yes? You're asking about scaling with qubit size? Yes, absolutely. So it's not by accident that I showed you four or five qubits. So any technique that tries to figure out strategies for a quantum computer is already limited to some degree by what you can simulate on a classical computer. So that's why I emphasize a small quantum module that I want to improve instead of the whole quantum computer. So that's first of all true. Then sometimes when K and get around this a little bit, depending on the gates that you apply, I will come back to that. So with Clifford gates, it's easier to simulate. But even in those cases, it's true that the input to this network is a little bit of a bottleneck. We already tried to make it easier on the network. So what we did was not just naively feed the density matrix into the network, but what we did was a eigenvalue decomposition of the density matrix and only keep the largest eigenvectors and eigenvalues and only feed those into the network. So that already saves you a little bit, but still there's an exponential part in here. I'm not sure I'm following the exact idea. Oh, so to speak, a quantum reinforcement learning agent. So I mean, in general, ideas about quantum reinforcement learning exist, but I don't think anyone has done it on this problem, I would say. But we also now have different techniques to tackle the whole problem, so I don't know whether it's relevant. Okay, so second is a smart reward scheme. So obviously things would be easier for this little robot if instead of just getting a reward when it's already on the box, if there are these little cookies around the box and it already gets small rewards, it's a little bit like the Q function, only you would set it up, so to speak, to guide it a little bit. Whenever you can do something like that without too much effort, maybe you should do that. So for us, what does it mean? So the problem here with our reward, which initially was just the overlap at the final time between the final time state and the original state, it's a very sparse reward, only at the very final time you're being told whether you did well or not. And so what will happen is during training, you do some gate sequence, unfortunately some of the gates are chosen in the wrong way, and in the end you get an overlap of zero between your initial state and the final state, so you get reward zero. You run it again, again you make some mistake because it's a long sequence, again you get zero. So zero, zero, zero, zero, zero. You never get a reward signal actually. You cannot really learn anything from that. So that's a problem. So what you would want ideally is something where even if you do things somewhat right, at least for a few steps, you should already get a reward. So ideally we would want to know in the middle of this complicated gate sequence how well are we doing, how much of the initial quantum information is still preserved. Now there's a problem with that because in the middle of this gate sequence I cannot simply take the overlap. I cannot take the overlap of this quantum state in the middle of time with the initial state and the reason is, the whole point of quantum error correction is to take the initial state and to entangle it in complicated ways and spread it over all the qubits. So the intermediate states will look completely different and the overlap with the initial state is basically zero. So the overlap is not good enough. But still it would be so great to have an intermediate reward that teaches me how much of the initial quantum information is still preserved. But it must be much, must be more clever. It must be something like, if at this point in time I would do the optimal decoding of this complex entangled state back into the single qubit, then how much would I have an overlap? So it's more complicated. And so then my student Thomas came up with a really, really nice idea. So the point is this. How can I tell how much my quantum information has decayed? Well, initially I only have a single qubit. I'm talking about a single logical qubit. So I can represent that on a block sphere. And then I can say pick two opposite directions on this block sphere, say spin up and spin down, qubit up and down, and evolve them according to this complicated quantum circuit. Each of these states will become a very complicated entangled state distributed over many qubits. But I can still ask, how well can I distinguish these two states? If I can distinguish them perfectly by a well-chosen measurement, then probably this means in some way the initial quantum information is still present. If, on the other hand, everything had decayed by now, so everything has to relax to the ground state, then I cannot distinguish them anymore. Actually both states have become the same state, and then it's bad, obviously. So there is a mathematical scheme. If I give you two quantum states, even mixed states, and I say how well could you distinguish them? If you could optimize the measurement and the measurement could be arbitrary and you take the best possible measurement to distinguish them, how well could you still distinguish them? And that's simply by taking the difference of the density matrices of these two states and then applying the one norm. So it's basically you diagonalize this matrix and take the magnitudes of the eigenvalues and you sum them all up. That has been proven to tell me what's the probability to properly distinguish between these two states in an optimal measurement. And this is then what we took as applied to the initial logical states of the logical qubit to say how much quantum information is still preserved. We call that the recoverable quantum information. We take the worst case, so depending on the initial state of the logical qubit, maybe sometimes you're doing better and sometimes worse. So we minimize, we take the worst case over the initial block vector direction. And that has proven to be super useful. So this is a quantity that we can calculate at an immediate time. And if there has been no decoherence regardless of my complicated quantum circuit, I will get one, I will say, say this is still perfectly recoverable. Even though I wouldn't know at this moment how the recovery sequence, how the decoding would work. So I do not need to discover the decoding in order to apply this measure. And so with this reward, things then finally worked out. And so here's a few examples. You're having four qubits, you're training, initially it's almost random. Eventually it learns how to avoid the catastrophic measurements that collapse everything. In this case it finds a repetition code, so it encodes the qubits into three. It encodes the state into three qubits. It discovers that there are smart measurements, but these are the parity measurements. So here you would set aside one of the four qubits and it learns that on its own, you set aside one of the four qubits as an anciller. You do a controlled knot with two of the qubits and then measure the anciller. And then you can discover whether the two qubits are still the same or whether they have been flipped with respect to each other, which indicates there may be an error and you have to correct it and so on. So this works nicely. And once it works on one simple example, you can now run it really on many different topologies, many different connectivities and gate sets. And for each of them you can ask how well it is doing and how much the error correction improves the coherence time versus the bare coherence time this is shown with these bar plots. And then you can go on, you can also do other stuff. You can, so these are technically all stabilizer codes and very simple ones at that, but you can also say, oh, let me introduce some noise which acts in the same way on all the qubits and then it finds very smart adaptive measurement strategies in order to protect one of the qubits by figuring out about the noise from measuring the other qubits and so on. So that's the good thing about it, but of course it's not very scalable. So we do discover things from scratch. We do not build in any knowledge of quantum physics or error correction strategies or how quantum computing works. So from the point of view of the agent, it's just being told you have these 20 actions, you have 200 time steps, you get a reward and intermediate reward, okay. Do your best to improve this reward. So that's what it's doing. The advantage is it's not tied to any of the standard quantum error correction approaches where it's not at all scalable. So we had then five qubits and six and then we stopped, so it's really hard. Okay, so now how do we scale up? And there is a need for scaling up because now we are in an era where there's more and more experiments doing quantum error correction on say 17 qubits, 50 qubits or so on a number of qubits that would be beyond that algorithm that I introduced just a moment ago, but it's important to have machine learning techniques to discover better quantum error correction. So what do we do? Well, I already told you the trick. The trick is inject as much of your knowledge as possible and that's fair as long as you don't supply the exact solution, then of course there's nothing to be done, but as long as there's still a lot to be done for the reinforcement learning, that's okay. So for example, we use the known structure of quantum error correction. We know it's usually some encoding part, then some error detection, maybe some correction and then some decoding. And so we can split this and say we concentrate on the encoding for example and let the reinforcement learning agent only discover this. Also, smart people many years ago already have thought very hard about quantum error correction and so if I tell you about the noise which typical kinds of errors will occur and you tell me which logical qubit states you have prepared, then there's some conditions called nil or flum conditions that you can check and you say aha, yes, for these logical qubit states, for these kinds of errors that you said could happen, I will be able to correct these errors because somehow the errors acting on these logical qubit states may give you new states but they will never merge the states, that's the catastrophic thing. So if everything decays to the ground state, then of course you cannot recover the original quantum information but if these two logical states just turn into still autogonal logical states, things are good. Okay, and then there's another point. So there's this slightly scary exponential scaling even of the simulation. So if you want to train on a simulation that runs on a classical computer and you're trying to simulate a quantum computer, we know already we will be in trouble at some point otherwise there would be no need for quantum computers if they were always classically efficiently similar level. Now the nice thing is if I concentrate on this large class of so-called stabilizer codes and quantum error correction, which was Peter Shaw and others invented in the 90s, then I know already in advance that all this generation of the logical states initially and also the error detection and so on can be done using so-called Clifford gates are the simplest kinds of qubit gates you can think of. So for a single qubit it's just rotations by 90 degrees and for two qubits it's spiritually still the same 90 degree rotations. And so they can be simulated classically efficiently. We know this even for many qubits, even hundreds of qubits if you like. And so at least on the simulation front we do not have a bottleneck anymore. That doesn't completely solve the combinatorial explosion of course. I mean even viewed as a classical problem if you have many, many possible gates that you can apply on many possible qubits, there's still so many possibilities but at least the simulations are fast. Okay, and so now I want to show you some of our, two of our very most recent works where we have gone to far larger qubit numbers up to 20 actually. So the first setting is described like this. Try to find quantum error correction codes or more precisely find the encoding circuit together with the code. And code just means logical qubit states. And so the setting that we imagine is you encode your original arbitrary state into some interesting collection of logical qubit states. And then afterwards there would be noise because you send it across a noisy quantum channel for example and you want it to be quite immune against this noise. What this setting means in particular is that during the encoding the noise does not act. So the encoding is assumed to be perfect. So it's really like a quantum communication scenario. Later on we will worry also what happens if we have noise during the encoding. Okay, so now how do we efficiently describe the actions of such a quantum circuit? So the agent again will try to place individual gates like this Hadamard gate and controlled not gate and so on. And we want to describe what's going on but in an efficient way. I don't want to simulate the density matrix or anything. And so people know about this stabilizer formalism. So it's a very smart way of describing a quantum state instead of storing all the amplitudes of this quantum state as decomposed on some basis. You're just asking are there some operators for which this quantum state is an eigenstate? And these operators are of a particularly simple type in this setting. So they are just products of Pauli operators. So for example, Sigma X on the first qubit, Sigma Z on the second qubit. Maybe there's an identity on the third qubit. And you want that this state has eigenvalue plus one, okay? That alone would not yet describe the state. If you think about it more closely, you need actually precisely, if you have three qubits, you need three of these conditions and then it will fix your state actually precisely. And so the language we use to abbreviate is a little bit like XZ and then identity would be one of the so-called stabilizers of the state. And for three qubits again, there would be three such stabilizers and then the three qubit state is entirely fixed. And it's a very simple state. It cannot be an arbitrary superposition but it's not necessarily just a product state. So it can be more complicated. A GHZ state, this up, up, up, up plus down, down, down, down can be represented in this manner, okay? So this is how stabilizers work and the efficient way how to then simulate the action of such a quantum circuit is that whenever you apply a gate like controlled node or Harama, everything that changes is only these stabilizers will change. For example, the X will turn into a Y or I don't know, here there will appear another Z and so on. So they are very simple rules how these stabilizers change and again for three qubits you only have these three stabilizers that you write down and each one of the three changes. So suddenly the effort for doing this is no longer exponential like in the general case but it just scales in a very simple polynomial way with the number of qubits that you have, okay? So in our particular setting we will be interested in generating logical qubits, so to speak subspaces. One logical qubit would be a two-dimensional subspace, two logical qubits is a four-dimensional subspace. And so to describe such a subspace, if I have six qubits, instead of writing down six poly strings, so this is also called a poly string, instead of writing down six, I will write down five. So I lose one and that gives me a two-dimensional subspace corresponding to one logical qubit. And so you would then call the code generators, the stabilizers of the code. So I have five of these poly strings and they will define a two-dimensional subspace and so that defines my quantum code somehow. And so I can simulate this super efficiently. So once you apply the next gate, all of the five will change but in a very simple way. And these stabilizers can be actually translated into bit strings. So people write down things like if you had five qubits, you could say I write down a bit string that has a one wherever I have an X in this product and when I have an identity I write down a zero and then you do another bit string where you say I write down a one whenever I have a Z. And if I have a one at both places, at both corresponding places, then it's actually a Y. So this is the way people convert this into bit strings and if you write down then each stabilizer as one of these bit strings, you actually get a matrix. I don't know how the next one looks like, yeah. And so on and this is called a tableau. So the whole thing is that you have this binary matrix that you evolve as you apply gate after gate and it's super easy. Okay, so this is how we describe the actions. And that also means if I want to feed the current state after application of the gates into the agent, all I'm doing is feeding this binary matrix into my agent. So that's simple enough. Okay, so then what's the whole setting? How do we put everything together? So what you specify here for this game is first the error operator. So you tell me there might be the possibility that qubit number I can flip from zero to one. So that's an XI, that's one of the error operators. Maybe there are also more complicated error operators where two qubits flip at the same time. So you tell me the set of error operators against which you hope to become immune. Then you also have to tell me what gates are available. Maybe it's controlled node, maybe it's some other gate. The only constraint here has to be Clifford. So these kind of 90 degree rotations, but we know from Peter Scho and others that this is enough for this class of stabilizer codes. And then the qubit connectivity. So that will determine where you can apply these two qubit gates. So where you can apply controlled nodes between different qubits. And so that's also an important part. And if you go to IBM, then they will show you the different cloud quantum computers that they have available. And each one of them has a different connectivity. Okay, and so now how the cycle works of reinforcement learning is, I have my agent, the agent suggests to apply some gate. I apply the gate, so my quantum circuit is starting to grow. And I will also update my code generators which started in a very simple initialized state where basically all those qubits were in the ground state. So it's just a Z at each of the positions of these qubits. And then it's modified because of the gates that are acting. And then I will feed a representation, as I said, this binary matrix into my reinforcement learning agent. And then it has to decide what's the next action. So it's a cycle that is simple enough. Any question at this point? Yeah, at least you understand what's the observation. It's such a binary matrix representing the quantum state or rather the subspace that I'm in. And the output is the usual action probabilities. Okay, yes, please. Sit again. Yes, yes. And this type of codes, yes, it's absolutely enough to do Clifford gates. So for the syndrome detection where you want to figure out what has happened and also if you need to for the correction, yes. And this includes a lot. This includes the surface code and so on. It doesn't include everything that you could do. For example, there's the completely different quantum error correction approaches for bosonic codes where you talk about harmonic oscillator states and so on, that's a completely different thing. It also doesn't necessarily include error mitigation techniques where, okay, maybe active noise cancellation where maybe you measure the noise and then you try to say, oh, if this noise acts on multiple qubits and I've measured the fluctuating magnetic field as this value, maybe I want to rotate my qubit that I care about a little bit in the opposite direction to compensate for this magnetic field value. That would be a small rotation and a small rotation is not a 90-degree rotation. So automatically it's not Clifford anymore. So there are things that are not covered by this but really, really, really large space of possibilities is covered. Yes. Ah, yes, yes. So the whole point is this is here about quantum error correction in the sense of I will produce logical states, I will then preserve them by detecting and maybe correcting and so on. Eventually I will also have to apply logical gates and it's at this point that because eventually to run a quantum algorithm with quantum advantage as you realize you have to have more than Clifford, then people have to do these extra tricks like preparing magic states which, so to speak, is the 45-degree rotation and so on. And so this point is a little bit more tricky. But the bare quantum error correction already works with the Cliffords. And that is not in contradiction with having a quantum advantage, it's just a part of the whole machinery. But you're right, so the whole quantum algorithm better have some non-Clifford gates otherwise I would just run my little laptop in order to get the results. Yes, there was a question. Okay, yeah, but all the surface code quantum error correction and any codes what people do that's based on the Clifford gates. Okay, so now an important thing in every reinforcement learning is of course constructing the reward. So as I told you, there are some errors that can happen. And basically I want to make sure that any error that can happen when I send my logical qubits over this quantum noise tunnel I would be able to correct. And Nylinda Flamm figured out how to make sure that this happens. And so the kinds of things you have to test for is if I have say my qubit in a logical state one and my qubit in a logical state one and in a logical state zero and then I apply an error operator say e alpha onto this state and another error operator e beta on this state I should not emerge at the same state. They should still be distinguishable. It's the same thing that I already explained on the recoverable quantum information. So this overlap was zero initially before the errors. It should remain zero after the errors. That's Nylinda Flamm and then there is a few more of these conditions. So you go through all the error operators for each of them you check these Nylinda Flamm conditions and if the condition is fulfilled then we say we set a variable k to zero for this particular error and otherwise one and then I construct my reward by just summing up all these zeros or ones and the best thing I can have is that all the errors can be detected then all these k will be zero then the reward will be zero otherwise I will get a negative reward so that's bad and I even snuck in some probability so you can imagine that we have probabilities for these errors to happen. That's not required when I'm sure that I can correct all the errors but imagine a situation where I'm technically not able to correct all the errors then I want to do a compromise and I want to focus on the errors that are most likely so that's why I also multiply by the probabilities. Okay so that's my reward and I can check this reward any moment in time while I'm constructing my circuit. So that was one important ingredient so first the stabilizer, so first the good reward and then I now come to the stabilizer simulator so if you look around on the internet the go-to stabilizer simulator is called STIM by some guy at Google and we also used that initially but then you realize STIM was invented for a situation where you just run your big quantum circuit once on many, many qubits and figure out the results but in reinforcement learning and machine learning in general you often want to do batches we want to run many agents in parallel and so on so we want to vectorize things, parallelize things and so we did that using JAX and so now we have our own little parallelized special purpose JAX stabilizer simulator which is really depending on the numbers maybe 50 times faster than STIM and so that really helped us a lot to explore much faster and it's out there on GitHub so you can also use it it's not as powerful as STIM in some other regards that we don't use but it's good for this purpose. Okay and so some results so you run it and you find stabilizer codes and now you can distinguish according to the number of physical qubits, the number of logical qubits also the so-called code distance which teaches you how many errors you can correct so for example a code distance of three means you can correct one error and for different numbers of logical qubits of course you need different number of physical qubits if you want more logical qubits and you need more also the circuit size increases if you have more logical qubits or physical qubits and each of these points is not even just a single code that you discover it's really a large family of codes and then what's highlighted here is some sub parts of this family for example the nine qubit short code would be sitting here in this little box but as I said it also discovers these encoding circuits and they will depend on the connectivity and the gate set that you give it and there's just one of the encoding circuits that it discovered for 17 qubits so we can go up to 20 qubits and if you only want distance three then it's really just seconds that it can do this if you want a larger distance then it's more complicated of course and we have applied many different optimizations by now there's different classes of codes there are so-called CSS codes that are very close to classical codes and that can be done even more efficiently because they have a certain structure so you can play many games so there's one other interesting aspect here and this is also very general for reinforcement learning so in reinforcement learning you could either set yourself a very particular problem like solving a particular maze a particular labyrinth or you could say I want to train an agent that can deal with any labyrinth and here in this case we said maybe I want to train an agent that could deal with any kinds of noise models in the simplest case it would just mean that for example the probabilities to have the X flip versus the Z flip are different and I can change these probabilities and then what I can do is either for each choice I train a new agent that's possible or I say I have an agent that also gets us input some specification of the noise model like for example this error probabilities and then is trained randomly sometimes on this noise model sometimes that and eventually it becomes a kind of you could call it a meta agent that knows how to deal with all possible noises and why would we do this? Well the reason is because then maybe it trains a bit more efficiently because it can maybe reuse things it has learned for this noise model also in another situation and so that's shown here so the horizontal axis is this kind of noise parameter so the ratio basically of the error probabilities and the vertical axis is the actual total logical failure probability given a relatively high physical error probability so of course if you reduce that it gets better and what we see is first it evolves in some way but the orange line is this meta agent that was trained simultaneously on all possible different noise configurations whereas these purple spots they are agents that were only trained on particular choices of the noise and well in some areas of parameter range they are sort of similar but you see these versions where the agents that were only trained on this particular parameter value did not perform as well and we tried to make a fair comparison so the total amount of training used for the orange curve is the same as the total amount of training used for all these purple spots together so that's interesting and that comes up again and again you can train agents on a specific model or you can train more powerful agents on many choices of situations okay and if you want to try it out you can go to this github of John the postdoc who did this okay good so then let me go on now there's the question what happens if during the encoding I also get errors and that's really one of the big issues in quantum computing because you know here maybe I have a single physical qubit error I would be able to correct this with this code let's say no problem and if I apply a single qubit gate maybe this error gets transformed so if it was a kind of Z flip error initially it gets transformed in an X flip but no worries it's still one error however once I apply a controlled knot this error will propagate because the source qubit will now affect the target qubit and that's bad so suddenly I have two errors the errors have spread and that's really bad because say in this particular code I cannot correct two errors anymore so in my attempt to create a logical qubit I'm failing, I'm completely failing because the single physical qubit error will translate into two errors and I cannot correct them anymore and so this is a general problem so because I will always have some complicated quantum circuit with two qubit gates I will have this error propagation so if I did nothing against this I would be better of just keeping my single physical qubit actually so this is the question of fault tolerance so preventing proliferation of errors and so there are several strategies that people have invented and one of the most recent ones is the so-called flag strategy so what you do is you introduce an extra and silver qubit and you let it couple in a smart way to the original qubits such that if you do have the ancillar qubit flipped and you measure it in the end to determine whether it was flipped then it flags the fact that there is now an unrecoverable error so you have two manual errors around in your circuit you will not be able to correct them even later and so you should try again so basically you run this preparation if the flag qubit doesn't say anything you know you are good if it does say something you know you are bad and you should just restart the whole logical qubit preparation and so now this has been worked out by hand for a few situations for a few codes but the question is how do we do this in general and can we use reinforcement learning to discover it for unknown situations and so these are the people working on this Remy Zen was the first order and then as this whole team including in particular a good collaboration with Markus Müller in Aachen okay and so we tried out several approaches but the one that worked best is if we just say the following here you apply your little quantum circuit whose task it is to prepare the logical quantum state you will target a particular logical state for example one of those discovered by the reinforcement learning that I described 10 minutes ago and then you add a few ancillars and what you will want to have is first the distance to the target state should be small in the end okay but also all the errors that would be harmful should get flagged plus an extra condition so the data qubits in the flag qubits are in a product state because otherwise also it would be bad but that's another condition okay and so again you use stabilizer simulators and so on so you use the same machinery and you see what happens and so we got very nice results so for example here's just examples of certain qubit connectivitys that I think were taken from Google devices and you will set aside a few ancillars these are the blue ones here as flag qubits and a few of the qubits were even unused and then it optimizes this reward by going through the user reinforcement learning and what I'm not even discussing here we are typically using PPO and that's an actor critic method and it has both a policy network and it also has a value network so these whatever we feed in into the policy network like these binary matrices representing the stabilizers we will also feed into the value network and so on so it's the whole machinery behind it and so here are the results so depending on what you are targeting either you are as good as existing human-made solutions or if you are targeting interesting states like the 5.1.3 which is the smallest code that can correct arbitrary errors then actually we found significantly better solutions than what exists so what exists is a 20 physical sorry a solution with 22 qubit gates we bring it down to 12 two qubit gates they needed six flag qubits we can do with two flag qubits so it really shows that reinforcement learning is helpful also we discovered the transfer learning helps so for example you have trained it on a fully connected qubit setup and then you switch off some of the possible actions so you go to a more restricted connectivity so some of the actions are no longer possible and then you continue training on this new situation and that's what's shown here in orange so continuing training on the modified situation is better than training from scratch in this modified situation so that's helpful yeah so you can train on some situation you go to a new situation you do not delete the neural network it really works okay and so here's a very recent result from a few weeks ago we can go to this relatively large already surface 17 codes so 17 qubits in total and then ask for logical state preparation and it is able to do this and as far as we understand no one had found such a flag qubit preparation strategy before okay so apparently it works and so now with this combination of smart rewards and stabilizer simulations we've been going far beyond these four or five qubits that we had okay and again there's GitHub you can just play around with it and maybe discover interesting strategies on your own by running this agent any questions about this? I don't see any I think we still have 20 minutes or so that's what I guess yes okay so so far I was talking about training in simulations this was all done on simulations and you could then deploy it in experiments but what about training directly on experiments and there are big challenges so first you need to be able to extract the reward from the experiment so you cannot do this recoverable quantum information or these fancy things because you do not have access to it in the experiment plus also if you do have feedback and this is the interesting cases you need to provide this feedback quite quickly so your neural network needs to take in the measurement results and then say in a microsecond has to answer what should be the next action so you cannot just run it on your GPU or so so you really have to be fast and so here's something where we got together with people in Syria so I would say the leading superconducting QBIT team in Europe Andreas Wallraff's team and especially Christopher Eischler and they work together with us on doing for the first time experimental reinforcement learning with this real time feedback in this context on the superconducting QBIT system so what had existed was people using reinforcement learning for example for gate design for superconducting QBITs but that's something where you don't need feedback you just run it even on the cloud and you try out different gate sequences until you converge to something that works nicely people had also used it for quantum dot tune up so you go to certain voltages in your quantum dot setup you measure the current then you go to other voltages and so on but that's really slow and again it doesn't need real time feedback so the setting we have here is a single QBIT because we have to start simple not five, not 17, just a single QBIT you measure that single QBIT in the usual way that these superconducting QBITs are measured you have a microwave cavity coupled to it you send a microwave signal through it you get a very noisy measurement trace and then you have to decide what this means for the state of the QBIT you're feeding this measurement result into a neural network and the neural network then has to decide to suggest next action which would be a single QBIT gate so for example the QBIT is flipped from up to down or if the QBIT is really treated like a three level system because that's more realistic maybe you also have other gates available and the task we set was very simple just ground state preparation because if you just let it sit there it will have a few thermal occupations so ground state preparation so you want to be fast and one of the ways to go fast is to implement instead of a GPU you're implemented on a so-called field programmable gate array which is almost like as if you can program a little chip which then does what you want it to do this is some technology that people were using anyway in these kinds of setups for understanding the measurement result but that's a very simple thing you integrate up the noisy measurement trace and then you do a threshold and depending on whether you're up below or above you say it's up or down but we implemented the neural network on the FPGA and there was lots of thinking about how to optimize it all the multiplications and additions fighting for every nanosecond so that's something that the experimental PhD Kevin Royer and my PhD student Jonas Landkopf can tell you more about it I wasn't even involved in counting the individual nanoseconds but then there was also an interesting architectural idea and that came again from Thomas Fersel mostly so usually when you have a neural network you feed the input into the input layer then you compute layer by layer the usual combination of linear and non-linear functions and then you get to the output and you announce the output but that would be slow so each step from layer to layer takes some time and so what we decided instead was to change things a little bit so what we say is while the measurement data is still coming in the string of continuous values that represent the voltage versus time we start feeding the first few nanoseconds of this measurement results into the input layer then we take one step of calculation and then we feed the next few nanoseconds of measurement trace into the next layer and again and again so while the input data is still coming in also the network is already doing its computation now this is there's a price to pay in particular the last pieces of measurement data will only be fed into the last layers and so the computation being done on them cannot be very complex it's just maybe a single layer but it's good enough the performance is really good and so what this whole structure means is that we only adding 50 nanoseconds to the latency by having the neural network because all the rest is still the measurement so to speak the measurement takes a few hundred nanoseconds and we are not wasting any time because while the measurement is going on the network is also doing its calculation okay so it's a little bit similar to LSTM recurrent networks but it's still different because in an LSTM network for each new time signal that comes in you would use the same neural network structure to calculate what's the output here the computational power is shrinking as time goes on but it's good enough okay and so that works it has a microsecond cycle time it's really fast we believe it's the fastest neural network agent that exists currently as far as we know at least published so the second fastest that I know of was for some plasma physics control but that's 100 times slower okay and then once you run it you really do it in the following way you let it run for I don't know a few hundred times or so on the real experiment it collects the actions and the rewards and so on and then it sends all of this to a PC on the PC there's another copy of the neural network the PC then applies say the PPO learning algorithm from stable baselines or whatever it updates the neural network parameters because that part doesn't need to have to be incredibly fast what has to be fast is the feedback loop and so then after this update has happened you download the neural network parameters to the FPGA and then it can go on for the next 1000 trajectories and then it uploads to the PC that it updates the parameters and so on and so after running this for three minutes and many thousands of trajectories you have a fully trained agent and so that's really nice to see and so then you can analyze the strategy if you like so you can say for a given input what does the agent suggest in terms of actions and now that's a little bit tricky because the input remembers this very noisy measurement trace so it's a high dimensional vector so how do I even plot the action as a function of input but what I can do is to visualize I can take this high dimensional vector and say integrate it and integrate it with some weight function and so I get moments if you like of this vector and then I can take two of these and plot in the two dimensional space and that's already good enough for visualization so what I'm showing here is so to speak in each of these panels is the probability for a certain action for example here is the action of flipping from the second excited state down to the ground state because here we are treating a three level system to make it more interesting second excited state going down to the ground state what's the probability for suggesting this action depending on the input and the input is visualized in terms of these two coordinates even though it is higher dimensional of course and so here you see oh there's a certain area where the agent says no it would be time to flip from the second excited state down to the ground state and there are other areas where other actions will be preferred so you can analyze all of this and again it's completely model free so it could be my qubit is not properly calibrated other funny things are happening the signal gets distorted and so on it will adapt to all of this any questions about this so we were very happy when once this worked okay so on the most basic level it's just a standard fully collected neural network that takes as input measurement trace so if this is time and this is say the voltage that I measure discretized I will have a very noisy trace so this then will turn into the input vector so all of these values will go into the input layer neurons and then the only thing it has to calculate is a policy network it has to calculate action probabilities so one of these action probabilities would be the flip from the second excited level to the E and another would be I don't know E to G and another would be the action probability for doing nothing idle which is also interesting maybe there's another termination action that says okay now I think I have prepared things in the ground state so please here's the ground state so this is a usual policy network and the only trick that I mentioned here was in order to save time while it's doing its computation layer by layer in the usual manner I'm still injecting input so for example some of the neurons in the first hidden layer some of these neurons will be actually set aside to receive new input that's unconventional that doesn't happen in a normal neural network normally they all just receive input from the input layer then I calculate these values then I calculate the next values then I calculate those but a few of these neurons will be initialized according to some of these voltages yeah it's a little bit funny yeah exactly so the length of the input layers may be just the first few voltages in this case plus also we put in some summary information about earlier measurements and some tricks but yes so it's not the full trace in that case okay so these are little tricks also yeah that was tensor flow actually and I just was about to mention that I mean it was on the PC it was developed using tensor flow and then when it was programmed on FPGA I don't even remember what detailed technical language they were using but this was all implemented then by hand which is not difficult for a fully connected neural network if I don't even want to train it just want to execute it it's just multiplications and additions and applying a sigmoid function or some non-inactivation function but one of the interesting pieces is such an FPGA does not do floating point arithmetic so it does fixed point arithmetic where you have to decide how many digits you have after the comma and in front of the comma and one of the benefits of tensor flow is actually you can tell it to please try this neural network in fixed point arithmetic so already on the PC you can check whether the accuracy is good enough even though you're using this fixed point arithmetic so these are the little details there okay so maybe in the last few minutes I want to show you something where we went to really many qubits and that's work again with Thomas Fiesel and some Google research team so the motivation is when we have these near term intermediate scale quantum devices well their qubit numbers are already pretty large like 50, 100 or something like this so can we do anything with reinforcement learning there well I told you there's these cases where you have Clifford simulators available that's one possibility but here's another possibility so if we set ourselves the task to optimize a given quantum circuit then we can also do something smart so suppose you have compiled a quantum algorithm exactly into a certain circuit you are sure that it does what it should do but it's maybe a little bit too large maybe you want to shrink it because every additional gate and qubit and so on will result in more errors because it needs more time so you want to shrink you want to optimize the circuit and so here's something you can do there are exact transformation rules for example you know that these four C-nauts together are exactly the same as these two C-nauts we can just prove it and so if I give you a big quantum circuit you can try to identify little building blocks like these four C-nauts which may be present and replace them by the two C-nauts and the good thing is you do not need to recalculate what the circuit does if you guarantee me that the circuit did the right thing in the beginning then even after applying the transformation it will still do the right thing so that avoids completely the need to simulate and so what we set out to do is was to train one agent that can optimize any circuit again this game could be played in many different ways you could say here I have one circuit let me do reinforcement learning to optimize this one particular circuit so in the end I will have an agent that is able to optimize only this circuit and the next circuit I train again but here we wanted to train an agent that can take an arbitrary circuit and then suggest simplifications okay and so this is the pipeline again you know these graphs by now so the agent will suggest an action which is really saying oh at this point please apply this transformation rule number five and then the circuit has changed and the circuit should go into the agent as an observation so we also have to figure out how to represent the circuit and this is something I'm showing here there are possibly many ways it could be for example represented as a graph but what we did was to represent it like a color picture so it's like a two-dimensional image obviously one dimension being which qubit I'm looking at and the other dimension is time and then there's a third let's say channel dimension which I used to encode which gate I have so at this location in space and time I have a Hadama or I have a C-naught say going up to the next qubit or a C-naught going down to the previous qubit so this was good enough for this purpose to encode it as an image that goes then into the agent and the agent has to make an output again okay again there will be a value network and a policy network but let's focus on the policy network so the policy network again has this two-dimensional structure and now the different channels means different transformation rules so please at this spot with this probability apply transformation rule number three and with that probability at another spot apply another transformation rule and then you sample from all of these and you will most likely pick the ones with the highest probabilities and apply these transformation rules and then there comes the question how do you even train in principle you could give me a lot of compiled quantum algorithms and I try to improve them but what we found but there's only a limited amount of them and so on what we found better was to work with random circuits so what Thomas in this case did he just generated random quantum circuits then he even applied random transformations which when applied randomly have the tendency to blow up the circuit actually and then he's feeds this really long terribly inefficient quantum circuit into the agent and asks the agent to apply the transformation rules to become better again and so the agent is able to learn this and one of our advantages is that when you do it like this you already know you have a kind of benchmark because the initial circuit is already shorter so you should at least reach this maybe become even better okay and so I don't go through the details but you can apply it so first you can train on relatively smallish quantum circuits and then you can apply it to much larger circuits because you have a convolutional network agent so if it works on a small image it also works on a large image and so that works very nicely here for example shown for 50 qubits in the background in gray you have a randomly chosen circuit that is not yet unop that is still unoptimized and then in blue you have the circuit that the agent produced after looking at this and applying its transformation rules and then I still want to say something about the efficiency you want to compare against something and a very general technique for solving such optimization tasks would be simulated annealing so basically you're picking these transformation rules a little bit at random but then you decrease some temperature and then you only apply those transformation rules that would actually shrink the circuit preferentially but the point is if you apply this to a circuit of that size to reach this level of performance you would need something on the order of a week actually and that's comparable to the full training time for the general reinforcement learning agent which afterwards can be relatively quick in optimizing any particular circuit so once you do more than one circuit it already pays off Okay, I think that's enough thank you very much Florian for this great lecture series are there still some questions final questions from your side that you have or are you happy like everyone is happy that's good and now it's time for lunch and at 2 or 2.15 we will meet again with the final lecture from Filippo Vicentini but for the moment let's thank Florian again for this fantastic series