 Okay, thank you everyone for joining today. Today we're very excited to have Jonas Buckley and Federico Felici with us to talk about magnetic control of tokamak plasmas through deep reinforcement learning, where they replaced separate controllers with a single policy using reinforcement learning that is able to control a diverse set of plasma configurations. So a little bit about our speakers today. Jonas is a senior research scientist at DeepMind and he has been working at the intersection of machine learning and control for most of his career, as well as a wide range of interdisciplinary topics. And Federico is a research scientist at the Swiss Plasma Center at EPFL, where he leads the research activities in advanced plasma control and future fusion devices. So I'm very excited about today's topic. We will have questions and discussions at the end of the talk. So without further ado, let's start a seminar. Thank you very much for the introduction. It's my great great pleasure to to give the first part of this talk and my colleague Jonas Buckley will take the second part. First of all, let me say that it's really a pleasure to give this talk on behalf of this big team consisting of people both from EPFL Swiss Plasma Center, which is where I work, which is a center for research on plasma physics for fusion in particular. And the DeepMind team where all of the experience and the expertise with the machine learning, reinforcement learning in particular came from. So without further ado, let's begin the story. So first of all, introducing you to what is nuclear fusion and what are we trying to do and what are thought documents. Now, if I'll give a very brief introduction because of course covering the topic exhaustively would take much more time than we have today. But basically the idea is that we are trying to achieve the nuclear fusion on earth with the goal of in the renewable to use for electricity production. And the problem with nuclear fusion is that while it has many advantages from the energy generation point of view, it's quite complicated to achieve on earth because you need a high temperature, a high density, and a long confinement time. And these kind of temperatures, in particular, the, the, the matter you're trying to fuse is in the so called plasma state which means that the ions and the electrons are dissociated and they're flying around around freely and forming this complicated material which we call plasma which behaves in very complicated ways but can also be influenced by electromagnetic fields. So the idea of magnetic confinement fusion is to use magnetic fields to confine this hot high temperature plasma and keep it far away from the walls of your fusion of your fusion reactor that you were trying to make on earth from which you're trying to generate electricity in India. So typically we're talking about temperatures of more than 100 million Kelvin, relatively low densities compared to for example atmospheric densities of typically 10 to the power 20 particles per meter cube. And we're talking about confinement time so the amount of time that you keep these the heat and the inside the heat and the particles confined inside the plasma of the order of one second. So in a particular way to do magnetic control, there are many different ways of achieving this one particular way is a so called Tokamak configuration which is when the magnetic fields are arranged in an axisymmetric or in a toroidal shape, which look kind of like, like this so this is an example of a picture from the actually the biggest existing operating Tokamak which is in in Oxford pretty close to where you are as a European. Tokamak where you see here an example of on the left, what the inside of the Tokamak looks like when they are not making plasma and on the right you see what the actual plasma itself looks like and it's interesting to see that the actual core of the plasma which is the hot part which is around 100 million degrees. You cannot actually see because it's so hot is not emitting any lights in the visible radiation space spectrum. That was one example of a Tokamak there's also a Tokamak, a smaller Tokamak at the call TCV at the Swiss Plasma Center at the campus of EPFL in in Swiss Switzerland. We see some pictures of it here and if you look then at what the Tokamak looks like actually on the inside. When you're when we're doing our plasma experiment it looks like this so you can see this beautiful plasma with these beautiful purple colors here. Now looking again from the outside of the Tokamak it looks like this you see there's a lot of sorts of equipment also around it to take complicated measurements of the plasma or also to try to heat the plasma to increase the temperature but if we remove all of that and we focus on the magnetic coils used to generate the magnetic fields to confine the plasma and we have something like this we can see a combination of magnetic field coils. And if we remove the external coils and we look a bit in more detail of how the the the whole magnetic field structure looks, you can see that on the inside you have a representation of the actual plasma. And that's surrounded by this gray part which is the so called vacuum vessel, which, yeah, which is used to remove, you know, to keep have a vacuum within which the plasma can then be created to separate it from the outside atmosphere. Then around that are these red coils which are the so called poloidal field coils which are used for actual control of the where we want the where we want our black plasma to be. And then these outside bigger coils which are shown in the yellow are the so called toroidal the field coils and I'll explain in a minute what how all these magnetic fields interacting how they are used to generate the magnetic field that confines the plasma. So let's look in a bit more detail at this magnetic field structure. So to have a successful confinement of this of this very hot plasma, meaning confining minute keeping it away from the, from the outside wall of the reactor, you basically need a superposition of two kinds of magnetic fields. So I'm showing in this picture how these are generated the first one is generated by this big coil this yellow coil called the toroidal field coil. It's called that life because the current goes around like this which means that a toroidal magnetic field is generated going around the plasma in this toroidal direction. Unfortunately, that's not sufficient to actually confine the particles of the plasma but to have sufficient confinement, you need actually also to have a field in this direction which you call the poloidal direction. And that's actually generated by an electrical current running inside the plasma. So the combination of the current in the external coil which makes the field like that in the toroidal direction and the toroidal current in the plasma which makes the field in the poloidal direction make this kind of helical magnetic field lines which are successfully able to find the plasma and keep it at high temperature for some reasonable confinement time. Now, once you have have this this by itself is not enough you also need the external coils which I was showing earlier the so called control coils to make sure that the plasma is maintained in the position that you actually want. I'm just going to show you in a bit more detail here here's now again the talk about what now showing only these poloidal field coil so the coil that are used for the plasma control. And the in this case we're showing again the TCB talk about that we have at EPFL in Switzerland where there are many of such control coils, and they allow quite a lot of flexibility in how to actually position and control our talk about plasma. The reason we're interested in actually controlling the position and the and the shape shape is that this position and shape of the plasma has a very important effect on the quality of the confinement so the kind of temperatures you can reach in the kind of pressures you can reach. And this is very important, because the higher the temperature and the higher the pressure the more fusion reactions we're going to have. I'm going to clarify at this point that in the TCV. The TCV talk about is a reactor which we use for research. So we don't generate significant amount of fusion reactions and we're using it mostly to study and to understand how the plasma behaves under influence of magnetic fields and various heating systems. And there are other topics which do attempt to generate fusion energy. For example, the jet tokamak I was showing earlier in Oxford, quite recently you might have heard established the world fusion energy record, breaking the record of how much energy was actually generated in terms of fusion reactions during one, one single experiment. And this research, again, is going towards the producing electricity in the in the future so all these tokamaks are a stepping stone towards larger and more performant reactors which we will build in which we are building now which will operate in the future, which will push us more towards a regime where more fusion energy is being created by the fusion reaction with respect to the power you have to put in to drive the magnetic fields and to heat the actual plasma. Coming back to a bit more of the details of how this plasma shape actually looks and what we're trying to control. There's a number of things which we care about controlling again using the magnetic field coils which are on the outside you can influence in detail how the plasma looks as it is inside the vacuum vacuum vessel. So we care about the actual position in this case the position represented by the center of the plasma is here, but we might want to move it up or down for whatever reason to study the plasma in different, different ways. Another important aspect we might want to control is the location of the so called last close flux surface which is this black surface going around here which is kind of the surface which defines the boundary between the part which is hot and confined and the part on the outside where some of the plasma that escapes interacts with the wall material for example. And we also want to control how this looks in particular we're often interested in controlling how this location of the so called export point so you see this point here, this is a point of so called magnetic null. And this means that the field in the portal idol direction here is equal to zero and there's purely a field in the total idol direction. And it can be influenced by our polo idol field schools are control coils over here, these carry an electrical current, and they basically control where this plasma is and what kind of shape this last close flux surface is going to have and where these stress right points are going to be strike points are also important because they are the location where in the end the plasma that exits from from the last close flux surface, kind of follow these magnetic field lines and can end up over here and with the material of the wall. Now, to be able to get the plasma that we're looking for as we desire we need to actually control the currents in these polo idol field codes in these control coils in the real time. This is typically done by a controller by a feedback controller actually using a combination of feet forward and feedback control that I'll explain later in more in more detail. So for now let's just realize that this means taking on the order of 100 magnetic measurements so measuring 100 taking 100 measurements of magnetic fields and of magnetic fluxes and other quantities which we need, sending it to a controller, which typically in the case of DCV operates at 10 kilohertz so 10,000 times per second. In this sense, 19 commands to the 19 control coils in terms of the voltages that the power supplies are supposed to apply to these coils to influence their current and to steer the plasma ultimately where we want. Now let's just think a bit more in detail about what we need to control so there's a number of key parameters we need to control. The first one being just simply the total value of the of the toroidal plasma current. If you remember a part of the magnetic field the polo magnetic field was generated by the by this black plasma current and this has to be induced through these control through these control coils by a transformer effects. The other things you want to control are in particular the radial and the vertical position so where the plasma is in our vacuum vessel and these three total plasma current the radial and the vertical position together are the basic things which you need to control in every document just to be able to have a well controlled plasma you need to control these three. And then as I mentioned you might want me might be interested in controlling the actual plasma shape in the distribution of the last close flux surface and where it is. There's a further complication in terms of the vertical position control in particular is that actually most of the plasmas we're interested in are actually unstable in the vertical direction. This means usually you want to study plasmas for various reasons that I can discuss today but for various reasons, we want to control plasmas which have a so called positive. We have an elongation so called higher than one which means that they're higher in this direction than they are wide in the radial direction and like the one I'm showing here. And for these plasmas it turns out that they're unstable in their position meaning that if you only control the electrical currents in the control coils and don't look at what the plasma is doing. The plasmas will actually be unstable and will go like in this illustration here will fly up into the wall and interact with the wall, which means the plasma will cool down and loses energy and we're not going to have any fusion reactions anymore. So this is of course what we want to avoid which is why we need to actually feedback control the vertical position to suppress this instability in the first place while at the same time taking care of all these other control problems as well. So now I'll spend a few minutes describing how these kind of control problems are solved using traditional engineering techniques. So, since all talk max need this kind of control there's been a lot of efforts in the past to achieve magnetic control of talk max using existing control engineering technique and typically what we do is we saw solve a set of equations describing the evolution of our system and we use that to pre compute the coil currents that we want to have in the control coils and compute the voltages that we need to apply to these coils to these coils to achieve the coil currents that we want. This takes care of the feedforward part so the preparation of how we're going to make plasma during a given plasma experiment. And then as I said it's not enough because you also need to control the system in feedback for the suppressing the vertical instability, but also just to make sure that any, any, any, any small mistakes you might have made in the feedforward calculations, because the model you use for that is not is often not 100% perfect, you need to compensate for these using feedback control. And then there is a process of designing observers and controllers so observers meaning you need to somehow from the magnetic measurements that you have. Find out what is the, what are the quantities you what is the value of the quantities you want to control so what is the radio position the vertical position and the plasma current. This is a relatively complicated nonlinear partial differential equation in real time to find out what the plasma shape actually is so what the distribution is of the plasma where the last close flux surface is. So then you have the observers you have actual an estimate of the quantities you want to control. And then you need to actually design controllers to control each of these now the first step of that is to choose a combination of colloidal field coils or combination of the control coil these close labeled ABC, the E and F a combination of these coils, like a few coils here and a few close there to control a given quantity that you that you want to control. And then, once you've chosen that the actuators you want to use, you design individual independent, usually single input single output controllers for each of these quantities to design one controller for the radio position another for the vertical position, another for the plasma current and another for the plasma shape, and you need to be careful to make sure that there's no interaction or minimize the interaction between all of all of these. So I should say this approach has been has been successful in the sense that many talk about worldwide operate using controllers designed more or less along the lines that I have described here, and they're relatively successful in that we can make all sorts of varieties of different, different shapes and run talk about experiments to study and understand the behavior of the black plasmas under the influence of magnetic field but also other things like how we do plasma and how we fuel the, the, the, how we fuel the black black black plasma and do all sorts of details studies to understand how talk about work and how plasma confinement in talk about work. But still, I hope I've convinced you that these controllers, though they are effective. They're also quite complicated to tune because of the number of kind of manual decisions are partly manual decisions that the control engineer has to take to this design so this really requires a lot of control engineering expertise and a lot of understanding of how the different control actuators affect how the plasma actually actually be actually behaves and it is the job of the talk about control engineers to do this, and depending on what kind of shapes you're trying to control or how well you want to control the form. These can be a relatively complicated job job. What we wanted to do in the work that we're describing that we are describing today is to try a completely different approach of controlling talk, controlling plasmas in a talk about using the magnetic fields. So the idea is that we want to try to substitute entirely the conventional control which as I described has all these different components by a completely different architecture, where we substitute all these components by just one single control policy and we try to generate the entire control policy in one one go so without designing the separate controllers try to solve the problem at the same time of controlling all of these quantities. And at that point we will be able to specify directly the kind of targets that we want so what how we want our plasma to behave and how we want our plasma to look. So we will be able to obtain one single control policy, which does everything that we want without having to design the components, says, separately without having to do a separate feed forward generation, and without having to do a separate estimation prior step of estimating the, the, the errors in the variables that we want to control trying to do everything at the same time. So we will move it over to Jonas Buckley will explain how we solve this problem using reinforcement. Great. Thanks for the recalls yeah. So let's look a little bit how what reinforcement learning is and how it helped us basically achieving this goal that Federico has set out so throughout basically probably have heard about various forms of machine learning, there is this is very different flavors one is supervised learning where sort of it's a, you solve a static mapping problem you predict an output from input given variables, or there's unsupervised learning, for example, a clustering algorithms where you tried to find havens from a data set. Now reinforcement learning is quite a bit different. It's, it has to do with dynamic systems and dynamics of decision making. And in a nutshell it's literally the formalization of trial and error learning so just a little quite a bit how humans would learn it actually has been inspired quite a bit from psychology. But it has a very strong mathematical underpinning. So you basically what you do right so you would sort of explore and gather experience and then learn from this experience how to do a certain tasks better, and we often sort of formalize this with this little action, you know, observation loop that you see down here where an agent interactive environment and there's a reward so we will talk more about that through the reminder of the talk. So, but maybe briefly so what are advantages of reinforcement learning. Why would you want to put up with like the complexity that is learning algorithms bring. First of all, it makes very few assumptions or you can learn to control non linear stochastic dynamical systems in principle, you can handle like heterogeneous and very large observation spaces are combining times here isn't images for example, on different timescales, you can have it very heterogeneous action spaces. And you sort of have like one way to tell the agent or the algorithm what you want to do that's this reward function and let's do one and the only way to define your control objective sort of what a good reward is and this allows for a large flexibility. However, it also means that in these reward functions you really have to define all the specifics that define a good solution and it turns out that this can be a little bit of a tricky thing so humans are not very good actually in transferring what they know is a good solution into this kind of like scholar value which is reward functions we will talk a little bit more about that. But many of you might have seen right in the last decade they were like very big successes of, of RL in machine learning so here at DeepMind to work on, you know, solving sort of like many different really really difficult board games like shogi and go and chess with single algorithms by now or the very complex online game of Starcraft, but then also in the real world right there's successes are here in the middle just one example from our friends at OpenAI where they sort of solved Rubik's Cube solving with a real robot but maybe let's just quickly look a little bit what is a big difference between games and the real world when it comes to like controlling and doing reinforcement learning. So in games you have discrete states and actions it might be a very large but it's always a finite choice of actions that you have typically it's actually fairly small even and you can have a very large and but it's always a finite state space and also very importantly the simulations which are the game engines are perfect they're a perfect representations of the physics of the thing you really care about. Whereas in real world problems in particular when it comes to things like plasma physics. That's not true right so the states and the actions of the system are. I mean, depending how you want to look at it but you could say they're infinite choice of actions there's an infinite state space. And no matter how you slice it it's going to be like even larger than on the left side and very importantly also the simulations will always be approximations to the reality and very crude ones on top of that right so basically I mean. There's a lot of physicists in the audience obviously and you know what what it takes right you slice away a lot of the complexity in order to basically come up with models of the world and that's what basically is the basics of simulations. Here's another tricky business with reinforcement learning so this is best illustrated with this little critter that learns to try to run over really difficult terrains and here you see basically. How many trials and what the sort of outcome is and this is just a sort of a 2D problem if you will with with you know a few degrees of freedom. After 200 trials had barely I mean it cannot even stand yet after 2000 trials, it, you know, sort of knows that maybe how to get forward but it doesn't get very far. After 6000 trials you start to see some sophisticated behavior but it's still it still doesn't get very far. And you have to get all the way to basically 15,000 trials before that critter really learned proficiently to get over that terrain. This just sort of highlights that reinforcement learning is very data how so you have to usually have a lot of data a lot of trials a lot of experience that you gather from your environment in order to find a good solutions. So let's look a little bit how these different components look like in this action environment look, but this then basically means what we have seen. So, what is an environment well the environment is sort of the standing for your world or control engineers will sometimes call it a plant. Very basically, you put the actions in, then you can observe what is going on you have some measurements, and there's also reward being generated somewhere. We can discuss if that should be exactly the environment or, but there is a there's an Oracle that based on what happens in the world gives you a reward attached to it. In our case, the environment is the TCV talk amongst so basically we're asked to represent there's 19 control actions. And we use 992 observations which are mostly magnetic observations of locks loops magnetic probes, and then also the call currents in the console coils. And we can compute that reward that we talk about. Now, this is not a directly useful environment to learn directly against for the reason that, well, first of all, it's an unstable system as Federico pointed out so. And, you know, you don't want to just start from ground zero and mess around with it and then, more importantly, I would say here on that specific talk among the discharges are short. So they're last about two seconds, and you only get like an experiment maximally about every 10 to 15 minutes, and it's a shared facility so there's actually a lot of people from around the world that want to use this facility. So reasonably you can only get a handful of experiments that that's just not enough I mean it's far away from that those 15,000, you know, thousands and hundreds of experiments that you typically would want to have. So the solution usually that you can use to for this is basically train your solution against the simulator. So you have a stand in. As we said before but again so these are models, and what we're using is the FG simulator which has been developed the DPFL. It's a free boundary grudge a front of solver and has is augmented with circuit equations for the conductors, but in addition we had to augment the and this basic plasma physics model a little bit more. So we had to add a little bit of a model for the power supplies that's mostly to do with the delays but a little bit also with offsets. And just as a take home message the simulated environment has to expose the relevant dynamics to the reinforcement learning algorithms for your control problem doesn't have to be perfect. But it has to sort of have the relevant dynamics and again as control engineers in the audience might understand, for example delays are an important one. If you forget about those solutions will not transfer so what happens if you miss an important element, your solutions will not transfer onto the real plant. So what do we, what we did we try to achieve so basically sort of in an increasing order of difficulties was a little bit like how we went those are the experimental schedules the first goal is basically just want to keep the plasma life, and you can sort of as Federico said right on the plasma is in stable so if you don't do it well sort of fear it goes vertically down, you lose it after like only like 24 milliseconds, and then, oops, that's one too much. If you're learn how to control you're on the right side, you can keep it alive for you know 550 milliseconds which was at the time the full control window we had but you see it sort of wobbling around a little bit so that's not, it's not exactly a beautiful solution So then the next step is basically tell the plasma, where should it be so stabilize basically the sort of the, you know, the center of the way that center is and you can teach that to the algorithm saying well, you know, don't worry so much about the shape just make sure that the plasma is in a certain shape of this, and then the reinforcement learning algorithm will actually find sort of like a really small round plasma which, as we have been told by our friends at the infusion right that this is a nice and stable solution so the algorithm sort of exploits the inherent stability. But this is sort of not the shape that you necessarily want right as Federico indicated there's other shapes you might want to have you want to have a bigger, even have a bigger volume that they have more energy in the plasma. And ultimately there's many other things so this is the figure that Federico wanted indicated before so you might want to care about any of these things and tell the algorithm via the reward functions, where should that flux boundary be where should sort of basically x points be should they be active or passive where should the legs be right because that's important for energy deposition and so on. And then you want to typically have this with some precision so in our work we sort of set ourselves a requirement of about two centimeters for shape parameters and IP to within five kilo amps which is a realistic quantity as we have been told. There can all be time varying quantities so we sort of tell over time where the different quantities should be. So what is the agent that is working against and is supposed to learn this so that's sort of the mirror to the environment so it's a thing that takes the observations and three word and produces actions that supposed to solve the task, and in particular to make it a bit more formal to want to find an optimal policy which basically in our case and typically reinforcement learning maximize a discounted or an expected discounted sum of future rewards, and sort of to drive that point in a little bit more in more detail. So you're you're getting your reward. And so you can compute something that's called the state action value function which is really the expectation of that that some that future some discounted some of your rewards, and it's a function of the state you're in and the action, and in a nutshell what it is it sort of tells you if I'm in a given state, and for all the actions I can take and then follow the actions of the given policy pie here, what would be the reward that I would get. And now that is, if you think about it in a case like the talk about with this like really large state space is a huge mathematical very complex mathematical functions or it's not something can just enumerate that's exactly where neural networks come in so we represent that with a neural network. For smaller problems you can sometimes get away with like sort of doing table based or even express approaches but that's totally out of the question for this physics problem that we're looking at. Then there's different flavors of RL and sort of two important ones are the value based methods so we're basically you literally sort of learn this Q functions by having a candidate like policy use that to gather experience and this is to improve learn the Q function sort of an iterative fashion, converge towards an optimal Q function and an optimal policy, and there is sort of like, you can reason through that this procedure should should converge so this is not totally just like guesswork. And then there's like another important family which you haven't used here policy based methods so where you don't basically try to approximate the Q function you directly work with the policies. And you roll them out and you sort of directly find the gradient for the policies or the parameters of that policy, and they can also be very useful depending on the circumstances. So what we used is a value based functions or more specifically an actor critic method. So where the critic learns the Q functions from the data that is generated by the actors with the environment and then the actor basically learns a policy by optimizing against that learned function I explicitly say here but again what it is like Q of state and action so now what the policy needs to do is in a given state, what is the action that they should take so that they get the maximum and do is if you can do this for any state then basically have an optimal policy. And, as I said before so the informal learning basically means that we're using sort of deep more or less deep neural networks for the Q functions and for the policy and other than that. It's very close to quantities that we also see an optimal control. Certainly the actor critic method has an advantage it allows for the asymmetry so the networks that are in the critic can be very big because they learn against the simulator in the data center. They don't have to run on the real plan however the policy has to run in hard real time at 10 kilohertz on the on the real plan and so basically with this message we can really basically have a very big critic and the small and very fast policy for inference. For the algorithm to train that we use MPO which is sort of a trust region based formulation of the actor critic method, which adds stability to the learning. It's a fairly data efficient method which is important, because you have to be able to deal with varying degrees of off policy mess. And basically this algorithm is very good at that and it's sort of battle tested for continuous actions state action spaces. And it's sort of indicated right discretizing physical systems like the talk amongst are are difficult. This is not usually this gives you a combinatorial explosion that's not usually what you want to do so you need the method that works with continuous state action spaces. So, when we learn we have what we what is called the learning loop so we have we create typically many many instances of the environment, which interact with a candidate policy. We store this, the outcome of this in a in a on a data server, then the learning algorithm can pull down, basically use this to improve the learner Q function fit the policy and then periodically gets deployed into the policies. The data service indicated here with the replay buffer, and then it just basically you iterate over this procedure. Once this is sufficiently converged, you take an instance of the control policy, and then we have basically deployment pipeline that generates from the TensorFlow graph, a hard real time capable binary that we can then hand over to the people that run the operations framework and run this in hard real time on the plant. Okay, so maybe just a quick word about the importance of the asymmetry asymmetry in the in the message right so we looked at this a little bit and it turns out that this is critical I mean you can even make a very big feed forward network. I think the recurrence actually helps us make sense right because it gives memory. I mean the process which the physics also has. We sort of also looked I mean, you know, how many actors do you need to be typically trained with 5000 actors. But for many problems you can get away with a bit less. They sort of an asymptotic, but it's also is very much dependent on the difficulty of the problem you're trying to solve. This is also I think not a finished exploration I think we are convinced that in the in the mid long term we can bring this down to much less actors actually to point where people can run this on a fairly beefy machine desktop machine and don't be the data center to run all of that. And we then transferred our simulation agents to the Tokamak so the deployment actually to our positive surprise mostly just worked. There was a number of small environment variations that we had to introduce to robustify a little bit, which had to do with the measurement and some of the plasma parameters we had to randomize a little bit and the power supply actually models, some of the elements turned out to be a little critical. Mostly the delays but a few other things as well like offsets. And then we basically iterated with our colleagues at SPC to improve over the reward function to improve the performance and then also did some simulator upgrades. Good. I hand back to Federico who will talk us through the results that we were able to achieve. Thank you. Thank you. Yes, so I'll show you some of the results beginning from the one you're seeing. You're seeing right here which is an example of one of the reinforcement learning control experiments that we did. This was one that we did to actually demonstrate all the difference, the ability of the reinforcement learning algorithms to do a number of different things from the platform control point of view. So you see as a function of time at the beginning of this experiment, the plasma was created in a certain position kind of high up in the vessel. And then it was moved actively down to a lower position and then this X point was created with the so-called legs as I explained earlier. And so we actually prescribed what we want the plasma to do with us showing with the little blue dots there. And then the controller was trained as Jonas explained and then deployed on the actual tokamak and what you're seeing here is not a simulation is actually a result of a feedback controlled experiment with the machine learning trained controller for the plasma acting on our TCV tokamak and controlling the feedback control, the feedback control coils. This is just one example and I'm showing it here in a little bit more detail so you see the time evolution from left to right at the beginning with this relatively small plasma and then the formation of the X points and then bring it back to the original position and you see at the bottom you see the actual time traces of the references that we prescribed as to what we wanted the plasma to do what we wanted the various parameters of the shape to do and then overlaid what what it actually did and we see the match here is pretty, pretty good. So that was not that was just one demonstration discharge and then we actually tried to use the same approach to create many different configurations so many different plasma shapes many different shapes of the last close flux surface. And one case with a relatively high elongation shown here, one case again with an X point and a high elongation this is a shape which is similar to the one which is planned for the future tokamak called eater, which is actually being being being also studied at TCV. And here is an example of a plasma with a slightly different shape which is so called negative triangularity so if you want with respect to this one. It's like a D shape was flipped in the opposite direction. These are also a topic of very active study for a number of reasons to assess their whether they're viable and interesting for future fusion reactors. Here's another example where instead of having one x point we have two x points, which is interesting from the point of view of how the plasma interacts with the surrounding material with the surrounding wall. When the plasma particles exit from the last close flux surface. So again, this was all trained using exactly the same reinforcement learning procedure, just by changing the details of the reward function so how we reward the various degrees of target shape. And yeah and then running the same reinforcement learning procedure interacting with the simulator environment, and then deploying it on to the actual talk. So all of these were actual experiments done on the TCV talk. And then we actually moved on to do something which we had never done on TCV before, which is to actually make and maintain two plasmas completely separately in the vacuum. This is interesting for a number of different reasons in particular we're interested at TCV and studying in detail what happens when these two plasmas are being slowly merged. In this case we managed for the first time to actually stabilize two completely separate plasmas in the vacuum vessel as you see in the camera image here as well. And what's interesting here is because these two currents, these two plasmas have currents which are going in the toroidal direction in the same direction the naturally tend to attract so it's naturally an unstable system for that reason as well. So in order to control the magnetic fields to actually keep these plasmas separated from each other, and the reinforcement learning trained agent was able to do this successfully not only to stabilize them but also to ramp up the current in both of these kind of dropped plasmas individually, and maintain both of them at the desired position as we showed. Now having shown these results and having demonstrated that this reinforcement learning approach works for magnetic control of Tokamak plasmas, we also took a step back and thought so what did we actually learn and what are the advantages and disadvantages of using a reinforcement learning based approach to control for magnetic control of plasmas with respect to the traditional control engineering techniques which we can classify as multi input multi output, the ID controller like the techniques. Now, as I said at the beginning, one, one issue with the traditional methods is that you have to have an explicit step before where you compute the estimators for the quantities that you want to actually control. With the reinforcement learning implementation, the whole problem of estimation of the control variables and the feedback controllers handled all in one step, all based on the single reward function that that simplifies that aspect. Also, in the traditional controllers you need. So, first of all, again, have these separate state estimators, some of them which, which are very complex like the equilibrium reconstruction, and you need to tune all of these control, the control looks independently. So it's quite a large number of control parameters you need to do typically in the order of about 10 to 20 parameters, which you need to tune for which you need to control engineering expertise to figure out which parameters to change when something doesn't behave the way you did. And again in the reinforcement learning solution you have the joint solution to the entire stabilization and control problem in one go. Now, again, for the traditional controllers you need the engineer control engineering expertise to design the controllers in the reinforcement learning implementation you still need a lot of domain, not knowledge, but the domain knowledge goes mostly into the generation of the simulator and of the environment in which the reinforcement learning agent is to be trained. And so that also requires a lot of expertise and a lot of work in the physical modeling in plasma physics and physics understanding of the system you want to control but once you have that you are done and you don't need a separate step of the control engineering oriented expertise. And as I mentioned again in the traditional controllers, you need to tune several different control control parameters having a kind of idea of which parameter changes which aspect of the control. In the reinforcement learning implementation you also have to be tuning but that's mostly on the side of the reward function engineering. You have to say in some way have to weigh the different things which go into the reward function for example weigh the accuracy of the x point position control versus the accuracy of the last close flux surface distribution control. So you also you need to tune things in both cases but you're tuning slightly different, different things. So there are some things which are, which, which I would say are still an advantage in the more traditional ways of doing control. One of them is indeed this very clear relation between the parameters and various aspects of the control performance is usually a clear relationship between changing one of the parameters in the controllers, and what effect this is expected to have on the actual closed loop dynamics of the control the system. And if something goes wrong and if the system is not behaving the way you expect you usually know what to change and what to do. Well with the reinforcement learning implementation what you get out of the enforcement learning procedure is just a kind of a black black box controller, which you cannot really change in an easy way. So indeed one of the future avenues of where we want to go with future studies is to also be able to have tunable controllers coming out of these kind of reinforcement learning approaches. Another issue, which we encountered in this particular implementation of the reinforcement learning control is that there is fundamentally no guarantee that any of the controlled variables will actually go to a zero steady state error. So for example whether the plasma position will exactly be what you ask it to be, you know within the measurement accuracy. And that's fundamentally because the, the, the agent which was trained was a feed forward only feed forward only a agent so in traditional controllers we have something called an integral effect when you have integral controllers, which is the control error. And therefore, you can show that when these controllers work the way they should some control variables actually go exactly to zero in steady state and that's quite a big advantage from the control performance point of view. And it's also something which would be very interesting to introduce in reinforcement learning approaches as well. So we need to the outlook of this of this work work. So I already mentioned a few ways in which the reinforcement learning control implementations can be improved based on the on the work we did so far. So again using recurrent policies introducing dynamics into the actor network to get zero steady state errors or other desirable dynamic control behavior. So you see the more parameters as I mentioned, we by which you can affect the behavior of your controller after the trade training phase. And also, as mentioned the policies or the agents that we the used in this work relied 100% on training on the simulator and it didn't use the feedback of experimental data to improve the models or improve the agents in any way and that's of course would also be very interesting to try to somehow combine this approach with improving the model by experimental data. Also a future step for using reinforcement learning in fusion science in particular is to try to use it to actually optimize plasma performance, meaning it ultimately fusion power by controlling other aspects and only the magnetic field for example now we only control the details of the magnetic field, but in general for you want to study plasma performance you also need to evolve or to have a model that evolves the temperature and the density of the core of the plasma, plasma, in and the physics models you need for that are quite a bit more complex and more complicated and contain different kinds of physics with respect to the magnetic control that we showed today. So there's a lot more work to be done there, both on the side of the physics understanding and of actual fundamental talk about plasma science and also on the computational side on the machine learning side to accelerate the models we could use for such studies using computational machine learning techniques. And finally one possible future application of this work would be to do so called co design, which means to simultaneously optimize the design of a future tokamak fusion reactor, while designing the controller and solving the controllability problem at the same time. To just repeat the conclusions and the main results of this work, we demonstrated for the first time the application of a reinforcement learning controller for closed loop magnetic control of a tokamak plasma where the controller was trained entirely in simulation and tested on a real device now both from the reinforcement learning point of view. This is a big step towards applying reinforcement learning techniques on real world control engineering problems. This is one of the most complex or probably the most complex application of reinforcement learning in a real world engineering problem so so far. And we one of the things we take away from this is that physics model model so all the physics understanding you need to make high quality models of your systems are required for this kind of approaches to be able to learn from simulations. More, more generally speaking this also points towards more good and interesting future applications of reinforcement learning, both for accelerating fusion science along the lines as I said earlier improving plasma performance and potentially design new devices. And those are applications to more complex real world systems engineering systems which you want to feedback control using these approaches, particularly, in particular where good models exist. This leads me to the end of our of our present state station and we want to stress again how such a collaboration has really been a multidisciplinary collaboration where we merged physics knowledge physics understanding and simulation abilities with machine learning and reinforcement in particular so not only was this a tight integration of these various technologies, those are required a very tight integration of the two teams to to be able to to achieve this this this this result and bring these two worlds together. So with that I thank you also on behalf of Jonas and if we, if you have any questions we will be happy to answer them. Cool, thank you so much for this very very interesting talk. So now we move on to the discussion. Part of our seminar, any questions from our audience please just a mute and ask or raise your hand.