 And let me welcome our first speaker. Today we have Maurizio Pierini, who obtained his PhD in the year 2005 from the Sapienza University in Rome. After his PhD, he moved to SLAC, where he was working on the Barbar experiment and he became a very important player in the measurement of rare EMIs and decays. And for this work, the European Physics Society awarded him a young physicist's award later on. Since 2007, Maurizio is part of the CERN staff. He is now a member of the laboratory research staff and he has been working at the CMS experiment since that time. His main interests focus around the search for new physical phenomena in the data collected by the LHC experiment and how we can use modern tools and methods to increase and improve the physics reach and the capabilities of these collided experiments. And that's precisely the topic that he's going to talk about today. So please, Maurizio, take it away. Thanks a lot for the invitation. It's a pleasure to give this seminar, virtually unfortunately, but still. So everyone can hear me well, I hope. So we'll start. So where I'm gonna show you a set of applications of deep learning for physics at the Large Hadron Collider with the idea of giving you a sense of what is happening and what we expect to be able to do in the near and long-term future. I'll start discussing a little bit, a big picture. So as you might know, the Large Hadron Collider is the most powerful particle accelerator in the world. It's 27 kilometers long. It collides protons at unprecedented energy, 13 TV. And it's mainly built to investigate from the mental constituents of matter at these extreme conditions. Now, mission of the LHC, when it was designed and put together was to discover the explosion or exclusive systems. And that's done. But also to characterize the nature of latruic image breaking. Now answering the big questions left in particle physics and in particularly what makes the latruic scale stable was the nature of dark matter. The origin of the cosmological difference between matter and antimatter. So the fact that we don't see antimatter basically anymore in the universe. And if there are unexpected phenomena that emerge at the energy frontier. And this is basically what we're currently working on. Now, looking back at the Higgs discovery, something that doesn't go always noted is that this was also not only, of course, machine learning story. In a sense, the machine learning was functional to discover the Higgs at least to anticipate the score of the Higgs. We would have discovered the Higgs even without machine learning techniques. But as a matter of fact, we would not have discovered the Higgs at least for CMS. We would not have discovered the Higgs in 2012. And the boost in extra sensitivity that came from using machine learning techniques, particularly for instance, for the identification of photons, which was one of the golden genres we scored in the Higgs boson, came from effective boosts of decision trees, pacifiers and regressions. And it was, now if you look at the plots on the right, you focus on the bottom one, you go, so a completely based selector is the yellow line as a function of the Higgs mass. This gives you the factor in statistics that you would have needed to go from the first evidence to the actual discovery. And it's factor five. And we didn't have such a factor in 2012. Using machine learning, this factor was reduced to 3.5, which is the statistics we had by this following summer. So this gives you the sense of the fact that without machine learning, we'd have been on the yellow line and we'd have missed the Higgs in 2012. We might have scored in 2015, 16, fine. But I mean, four years of delay, it's not nothing. Now, going from where we stand now to the future, I'll show you a little bit the idea of what the LHC Big Data Challenge and how it's getting worse and worse. So I have a little bit of an animation on what an LHC collision looks like in a very pictorial way. That gives you a sense of what we're trying to deal with. So we have these protons colliding at 30 minutes. And we have 20, every 25 nanoseconds, there is a collision. So we have 40 million collisions per second. Each of them, it's about the megabyte of data. If you would like to get everything, it would basically impossible. So this one, we have what is called a trigger system line. It is based on two different stages. What we call the level one trigger and the high level trigger, the first trigger, which is what you see now in the video is the one that cuts the majority of the events, go from basically 40 terabyte per second of data to 100 gigabytes of second per data. And it's completely implemented on custom electronics. The second stage is on a small scale data centers, which are hosted next to detectors. And this basically it's a traditional HPC-like kind of environment. And this is where the last selection happens and then you select the last thousand events per second and go to for data analysis. So we go for something like 40 millions events per second, 2000 events per second. We understand that a lot of physics is happening in these two stages of the selection. And in particular, to go from 40 million to a thousand, you really need to cherry pick what you think it's interesting. And this is where the bias from our previous knowledge comes in, which so far worked well, because we knew exactly what we're looking for. But now that we don't really know what we're after, I mean, we have ideas, but there are many of them. Now this becomes a constraint. Now, as I said, saving everything would result basically in the biggest data set ever. It's the very dark line on the bottom that you see there that I put on top of other data sets that were evaluated in the paper by Wired, it was 2013, so it's a little bit old, but you get a sense of the fact that people, usually when they talk about the LHC, they think about the light magenta, but actually this is after we death with our big data problem, it's not before. Now, things are getting worse, as I said, because we're trying to do more, which means we're trying to, so we're gonna go after 2025, six maybe into what is called the Illuminosity LHC phase, but basically we will collect more data because we will have more intense beams, the beams will be squeezed, the probability of colliding is gonna be increased, so we'll have multiple collisions happening at the same time, up to 200. And this means that with the same amount of time we'll collect 10 times more data than now. Now, this is fantastic, except that now you need to navigate through the thing that you see on the right, which is pretty much a big mess. And to do so, we are upgrading our detectors to basically disentangle these 100 collisions and keeping the CMS and Atlas with more granular detectors, which means a lot of extra sensors, smaller sensors, many more of them, such that we can basically try to get the kind of rendering that you see on the right, where we're gonna try to follow each of your particle and crazy collision rate. Now, if you have so many sensors, you have the information a priority to do your job, the problem is that you will need more computing power to do so. And even if you scale things to 2027, we're still basically a factor of 10 off. Now, the other end of the problem is the simulation, because we strongly rely on, we heavily rely on simulations in order to get an understanding of what's happening, in order to define our data analysis strategy and the best ways to do measurements, to estimate in the background of optimizing the selection, even selection and so on. And so this is done through a workflow that starts from simulating the event collision to addressing the particle emerging from the collision with the result of the showering due to quantum chromodynamics, to then take all these stable particles and propagating them through the detector and create the equivalent electromagnetic signal that everyone would see. And then from there, applying the normal event reconstruction that we use with real data, destruct the view of the event as the detector would see. Now, this full chain is extremely cost is extremely accurate. I would say that the accuracy is lower than a 3% the kind of error we make. So it's a very faithful representation of what is happening, but it's extremely costy. And even here in them, both of this NCP, we are about to turn off. Now, I am among a few people that thinks that deep learning could come at rescue. And on both aspects, on one hand, because you could, we basically have a very well understood problem. It's not that we don't know how to reconstruct this one particle collisions happening at the same time, is that it just take too much time. But we do know how to do that. So we have an understood rule-based algorithm that allow us to go from the super granular representation of the incoming parts of particle to see on the left to the answer to the kind of questions that we typically ask, which is which kind of particle where could you use, which energy do they have, which direction do they have. And to do so, we employ a lot of CVO. Now, the one is, can we train a neural network to give us the answer to these questions in approximate way? And if we can, then at least for the real-time analysis for the trigger, we will have approximate use can be accurate and fast and stay within the resource budget that we have. And the same kind of paradigm applies for the reconstruction, for the simulation. I mean, can we model the detector response through a neural network such that we once thought we jump from the generator view of the event, which basically means what we would see with the perfect detector to the reconstruction view of the event, which is what we see with the detector we have. And if we can model this with the neural network, the execution time would be super fast. Now, I'll show you in the next, in the rest of the seminar, a few examples of this. But the one message that I want to point out but the one message that I want to pass, at least for the reconstruction part now, of course, not for the simulation, is that there are challenges in using deep learning for this. We have to do with the fact that energy physics experiments are a little bit of unique places. And particularly the focus has to be in what happens at the trigger because if you're really trying to extend your sensitivity using deep learning to remove, let's call sociological biases in the way you do searches, then you need to do there where the bias is imposed first, which is at the trigger level when you actually go from 40 million events to 1,000 events. And this means that you have to develop ways to deploy the algorithms in very, very customized computing environments. The other thing is that our data are special. I mean, are not unique, but there are similar things out there, but these are not like the traditional images or sound sequences or time sequences that you would pros with CNNs and RNNs. So convolutional recurrent networks, the kind of things that made deep learning so popular. Our data basically, and this is true in general for energy physics, not just for the LHC, are sparse collection of dots. And these dots can be characterized with features like the amplitude of this electronic signal that was detected or other kinds of things that your sensors would give you. So you have this unordered set of dots. And if you wanna deal with this, you need to deal with this at the level of the architecture of the natural you are using. And I can anticipate that this is the reason why we're looking to graph networks a lot. The other thing is that when you run these algorithms, you need to be fast. As fast as less than a microsecond if you wanna have your algorithm living in the very first layer of the trigger. Now I'll show you a few examples. So as I said, we have this sparse set of data, you could say, okay, it's sparse. I can still apply convolutional network to sparse image. That's true. The problem is that our data are not even images in the sense that these sensors are also irregularly arranged in space. And so it's very difficult to imagine how you could do this in real life. And in my opinion, this is the reason why the most of HEP related deep learning literature which deals with convolutional networks so far found little application in real time problems. People struggle when going to the real time problem beyond the proof of concept with the fact that the kind of detector that you have on the right doesn't look like an umpire array. So several solutions were attempted. People pixelated the detector with some course resolution, but then you lose some information. People tried to enforce use case specific ordering criteria, which sometimes is fine, sometimes it's not. And as I said, now we're really intensively looking at graph networks because we think that it's the architecture that better fits the nature of the data. Now I take one example, we just jet tagging. Jet tagging is a typical problem for which deep learning is applied as a showcase. So basically jets are the most common objects that you get out of the LHC collisions. Basically every time you have quarks or grunts produced, you get this spray of particles emerging, which is the thing that's representing the top right part of the slide. And sometimes when you have heavy particles that decays to quarks, like WZ, Higgs bosons, top quarks, you get overlapping jets of this kind that you need to construct as a single jet. And then it looks different than the others. The idea is from the distribution of the particles in this spray of particles that you call jet, can you tell the origin of the nature of the particles started a shower? And again, as I said, you can try to use pixelated representation, you will use recurrent or graphs. And we try to do so in a comparative way. And now in particular, we consider one kind of graph net that was developed at DeepMind, which was mainly developed to model n-body systems and to learn one-to-one body interactions and really to learn physics. It was shown to work very well for n-body gradation simulation. It was shown to model like a hook low when you have two springs. And there is a reference to the paper there. Now, how does it work? So you have a graph, which basically is a set of objects which are labeled O, one, two, and three there to make it simple. And you have these edges, which have directional edges that connect them. And so you try to build all the possible connection between your vertices. And then you build the beam metrics of objects and edges, which is normally zero and you fill them with one when in the case of the receiver matrix, the object number one, for instance, we see edge number four. Or in the other one, you will want the edge that starts from the object that you see. This is basically a matrix, which is n times n and n minus one. So it's a pretty big matrix. That you basically, it's a mathematical trick you use to start with n custodians and MP features and build the matrix of the edges. But then you process through the thing that you see on the right, which effectively is a pair by pair processing that goes through classic dense neural networks that tries to learn from the input features that you specify for each initial object, abstract representation of these objects that allows you to accomplish the task. And in the case of the demand simulation, this is basically a loop. Every direction is a delta T in time. And you're trying to learn the position of the objects after delta T from the position of the object before delta T. This is not what we try to do. It's basically stopped at the yellow matrix, the O1, which is this learned representation of our constituents, which were the jet particles. And we try to learn the quantities that we aggregate through the graph using a summing function. And then with this new quantities, basically we are trying to engineer features here like you typically do in a convolutional network if you have a continuous array. And then we try to ask the question from this feature, which kind of jet this was. And if you do so and you compare it to the other methods that I meant that I alluded before to, you see that the best that you get, typically you get it out of the graph nets. And the only really competitive approach that you have is the one with a classic dense neural network that takes as input physics motivated quantity. I mean, the reason why this one is competitive is because there is a lot of domain knowledge that goes into building these features. And remarkably enough, some of these domain knowledge is recovered by the function itself because if you look at the quantities that were built by the network and we correlate them with the physics quantities that we know are relevant to the problem, we start to observe correlation just pretty high. So, and in particular across the different classes which somehow is indicating the fact that the network is learning the right physics to accomplish the task. Now, we also try to apply the very same problem to the expose where the problem is very similar. It's a little bit more complex. And in this case, we use the real CMS for simulation from the experiment. So we didn't use like our mock data but we use the ones that CMS releases. And we managed to improve with this technology. The, this is the blue line you see on the right. This is a rock curve, which is the true positive rate versus the false positive rate. And bottom right means better. And the blue one that is our interaction network improves the previous existing result which was the green. No, sorry, the orange. And then the other lines are the same when you train the network while forcing it not to learn the JETMAS because you wanna keep the JETMAS on the side as an further discriminating quantity that's basically your final money plot if you want. And of course you lose performance but the hierarchy between the algorithms stays. The interaction networks are still better. So the interaction network without the learning the mass is slightly worse than the traditional, the fourth algorithm with the mass extra boost of sensitivity. Now, now I like to show you something similar which was done on to a much more basic process which is really when you try to process the raw data from the detector. Now you have a calorimeter, which is a detector that solves the particles coming in. And then you have a lot of this energy deposits left by the older particles crossing. And you will like to, you will like to basically in this case, it's two particles overlapping, you will like to disentangle the two particles. And you can use this basically an edge conf which is a classic what I'm called the graph network does message passing. So the way this works is that for each particle X, you build some network that learns the communication if you want between this, this vertex, this energy deposit and the nearby ones. And then you create again a new representation of your vertex that you aggregate across the graph with some function typically. And this is the, if you think about it, it's basically what the convolutional network does except that now you don't have a filter that scans a pixel array, but you are just connecting every pixel with every pixel. And it's the training that decides which of these connections are relevant. So it fits very well as part of the problems. Now, of course, the problem is that with all these connections happening, your memory blows out. So we try to come out with somehow, some ad hoc architectures that we're trying to do the same while keeping the memory under control. And we did so using, if you want analytical functions to wait the connections that we're using some Euclidean distance in some abstract space that you learn. So you take your inputs, you learn two things. You learn new features, FLR, and coordinates in some abstract space, S. And this means that you are taking your graph and you are deforming it to something which is topologically the same, but is in this new coordinate system which you try to learn, is coordinated by the S1 and S2. And there, with these new features, you run something similar to a message passing except that you don't use neural network. You use something in an analytical functions to wait the edge. Doing so you keep a little bit the memory more under control. And in this other variation, we did something similar except that this time we don't do an end to end connection, but we try to connect the vertices of the graph to a set of aggregators which are typically much smaller in number like order 10 and not order hundreds or thousand. And this way you have an end to K number of connections and so the memory is smaller. So despite the fact that you have less connections, this network pretty much work. See an example of true versus reconstructed. The color is red is one particle and blue is the other particle. And you see that what you get on the right is pretty much faithful to what you have on the left, which is the ground truth. And this is also given by this plot square as a function of the energy of the incoming particles. We look at the new and the variance of the response and we compare it across the different models. And we see basically that on one hand we can recover the performance of the edge conf or somehow pay a little bit of a price. The red line is slightly worse if you want. But then in terms of resource consumption you see on the bottom, it's extremely faster and less memory demanding, which makes it attractive for us if you wanna run it in a trigger. Now, and now I go exactly to this point which is how do we run these networks in the trigger? So this is the same scheme that I showed before with the two levels of trigger. Now, I don't think that the high level trigger is an issue because what we're doing these days is that we're buying GPUs that will be plugged onto those machines. And so this is yet another HPC side with the GPU accelerator. So it pretty much the idea plays where you would like to use the primary. I don't think this is challenging. The problem is the first level of trigger which is basically you have no CPU. There is a piece of electronic, a custom board that are FPGAs on this board and you're trying to execute logic circuits. And here we need to all do something ourselves because it's basically, it's a sort of a unique problem. Maybe finance this problem like this to try to run inference at hundreds of nanoseconds on such a custom electronic. So we came out to this project which is called HLS4ML which try to use high level synthesis to do this. So we, the library takes a model that you compress. I will talk about compressions after and then converts it using to generate a C++ code which is written in HLS with HLS comments that you pass to an HLS compiler. Just compilers are libraries that are distributed with the FPGAs by the vendors that translate a piece of C++ written in a proper way into a piece of firmware. And so with the library in the middle that creates this C++ code, we can basically go from Python to TensorFlow to the firmware of the cart and deploy the cart as a logic circuit on the chip. Then you could even print an assake if you want out of that. Now we go back to the very same problem that I showed you before and in particularly to the dense neural network solution in the problem. We take 16 features in inputs and we, as I showed you before, you can talk jets depending on if they are quarks, gruons, Ws and top. Now, this problem is pretty small network. It's three layers with 64, 32 and 32 nodes, 16 inputs and five others. But if you look at what it means in terms of logic redirecting all the information, it's this. And this is way too big to fit in an FPGA. So what do you do? You compress the model. First of all, getting rid of all the connections that you don't need. So you make an assessment of which are the important connections and you drop the others. Dropping the others means you put the weights to zero and you exploit the fact that the typical HLS compiler is smart enough to understand that if he has to do time zero, it just doesn't do anything and you save resources. And the other thing is you use a quantized numerical representation, which means that you don't use the 32 bits or 16 bits, but you go as down as you can. So what we studied here is how fast we can send this network as a function of the numerical representation after the pruning. And what we see is that we can get basically the network to run within 75 nanoseconds to 200 nanoseconds, depending on which fraction of the network we allocate. If you look at the 16 bits, six integers number, basically you have something in between, I think it was 10% and 20% of the network being used, of the chip being used, but then the latency stay below 200 nanoseconds. So this was for us a demonstrator, but then we went beyond, we tried to play with extreme quantization, which are basically binary and 10-year networks, where all the numbers of your network are one bit or two bits. So for instance, in the top left, you see how an activation function will look like in that case, that's basically the quantization of a sigmoid two bits of three bits. And the other thing that you can exploit is that you can turn multiplications into logic operators that don't require multiplier units like DSPs on the FPGA. And so basically you turn the multiplications, which are the resource consuming operations into something extremely heavy to do because you basically do some bit shift. So this is implemented for you in the model, in the network, and you just need to come with some binary network trained and then you can put it in again, the state of latencies, which are below the microseconds with very small, in this case, utilization of the cart. And the last thing that we now, the next step that we took was to look at quantization aware training. So for this, we teamed up with Google that is developing a library which is called QKeras, which is Keras with quantization. So the idea is that you do this numerical truncation of your numerical representation when you train. And the idea is that once you specify limited numerical representation, it's not clear anymore that the absolute minimum of your stochastic gradient in the sense is still the absolute minimum. And it's better to see if any of the other minima that you ignored has a better chance to preserve its performance when truncated. And so what you do is that you train already truncated and it's basically it's Keras with some Q here and there where you specify how many bits you give to each part. And what you see on this plot, so the dashed line, the magenta line on the left is the same that I showed you here. So this is basically the performance as a function of the bits. And so what you, I don't have the resource usage is this. So you see that after 1660 there is a drop. And this drop is that vertical drop that you see on the right here. So the fact that the magenta line is going down and then at eight bits it drops vertically. This is why in the previous work we stayed at 16 bits. We didn't try to copy that. What you see here on the left you also see the accuracy with respect to the full precision model. And after 16 bits there was really a drop in performance. What the other lines show you is what Q Keras does, which is it's capable of finding other minima to keep the performance accuracy, sorry the accuracy constant on the left. So that's the ratio of the AUC for your classifier with and without the quantization. And you see that this basically reflected one until five or six bits. Which means that you can go, now you can go below the 16 bits that we used to stop to and you can go as down at six bits with saving a lot of resources more. But at the same time you drop completely the resource usage with respect to magenta. And now you can even go further because what Google is now developing is a library that decides itself based on some power consumption model where to quantize more and where to quantize less. So it does the exercise for you. You don't even have to specify the precision. We think that this is the ultimate solution to define a workflow that will make good candidates for us to use in the next rounds of the lesson. Now this works relatively well. At the level that now we're capable of taking the graph nets that I showed before, the second one, the one with the aggregator, the net, slightly modified to make it a little bit simple but that's not really essential to discuss. But it's basically the same kind of graph that I showed before on the similar problem on the same kind of data which is the calorimeter incoming particle. This time you have one particle only coming in and you're trying to see which kind of particle it is and to measure its energy. And this particle is overlaid with noise coming from other particles around. So we took the same architecture that I showed before, ten initial slides before. We trained the classifier and we trained the regression and it worked pretty much okay. And then we applied on this graph net the procedure that I showed before because we implemented in HLS for ML all the needed code to generate the HLS rendering of this complex architecture for a synthesizer to synthesizer. So what we see is that we can run this kind of network with keeping the performance pretty much high while going at the latency here is given in cycles. You need to take the number multiplied by five nanoseconds because we run with 200 megahertz of clock. But the idea is that you stay below the microsecond. So this is the first attempt to use a graph neural networks on FPGA for energy physics and for sure at this kind of latency. And we think that we can basically have this ready by the time we will have this large amount of sensors to be processed in real time. Now the package is growing. We are enforcing CNNs and RNNs in and the graph net support I showed you. And now we are also moving into a model of multiple backends. So far we worked with Xilinx, so with Vivaldi HLS. Now we are extending the library to support QARTUS HLS so that one can use it for Intel FPGAs. And we have a third backend which is Catapult which allows you to run, to design aesthetics. And we are now using this to run also conclusion on that. And I'll advertise here. We'll have a workshop about this fast mass free learning next, at the end of next month. And there is a link there when you get PDFs. You will be able to pick it. And next week, if you wanna learn about this tool with opportunity to connect it will be a tutorial at CERN on Zoom to learn how to use this tool. Now, I don't know how much time I have. 10 more minutes I think are good. Okay, so I'll keep going and then you stop me. I have another two topics I would like to talk about but I think I'm gonna not be able to do two of them. First thing is that I would like to go back to the problem simulation I was referring to. Okay, I don't know what is that one. So generative models are a very interesting front of research for the learning researchers. So you might notice, but very briefly I will summarize it. There are mainly two big classes of, it's not completely exhaustive but there are two big main classes of generative models. One are the generative adversarial networks which means that you have two networks, the generator and the discriminator. One, the generator is trained to generate images starting from a noise and discriminator takes as input images of this kind of the real ones, the target ones and is trained to discriminate the real ones from the fake ones. And what you do is that you train them. Now, this is one approach. The other approach is an autoencoder. An autoencoder is a very interesting object because you can do a lot of things with it. But in particular, you can do a generator if you use a variational autoencoder. So what you do is that you compress your image into using an encoder to basically Gaussian multidimensional multivariate PDF. And then sampling from this multivariate PDF, you reconstruct the image. And then if you only look at the right part from the Gaussian something to the decoder, you can use that as a generator. You can do other stuff with this. Maybe I'll get to this later. Now, again, the problem that we are facing is that our data are extremely sparse. So they are not images. And so we try to adapt this technology to these sparse datasets. And we started with the MIS dataset. There is a version of the MIS dataset which is called the super pixel MIS where people select a subset of the pixels and you can represent it as a table which is sketched on the top right. So basically you have the coordinates of the pixel and which is with our integer numbers and you have the intensity of the pixel. So now you turn from an image, you went to a set of points. Each point is represented by three numbers, two coordinates and one intent. And so what we did is that we built a distance between two images of this kind which basically use a nearest neighbor distance. So it's the expression that you see, D, K and N that you see on the bottom. K is just the order of the gradient distance. If you want to use the squared K equals two or just the absolute value K equals one and so on. And then with that, you can define, for instance, in an auto encoder, you can define a loss function if they put an output. The idea is that this thing goes to zero if you reconstruct exactly the same set of dots. But the nice thing is that this is permutation invariant. So you are not requiring that the first pixel of the first list is the first pixel of the second list. You're just checking that for given pixel is input, you are across by pixel in output and whatever. So you are enforcing the permutation invariant, the absence of an ordering in the loss. And then we use this on a convolutional decoder. And for the moment we're still using convolution and the ultimate goal would be to use a graph auto encoder filter here. And you see how this would work and you see it still that the reconstruction is not that bad but we need to do some work on the Gaussian sampling. We think that the Gaussian is not the right function. And that's why now we want to look at normalizing flow to morph that Gaussian into a proper prior and to improve the quality of the generation in the box. But the thing is that now if you go back to your jet problem and you try to generate jets, it does a pretty decent work job. If you somehow add the critics to the problem that force in the loss you add critic terms like a penalty terms, MSC terms that try to force the network to learn certain physics relevant, physically relevant quantities like the jet mass or the jet duty. Or what you see is that top pro again is the comparison with the input and the output tend to do pretty good job. It's comparable to a traditional MSC but our intuition is that when you go to a graph in your networks, the MSC will have troubles while this network should be able to preserve these loss should be able to preserve performance. But again, when you go to the bottom which is Gaussian sampling's plus decoding you see that in some aspects like on the PT which is the second column, the agreement is spoiled between the input and the output both for our loss and the traditional MSC loss. And again, we interpret this as the sign that the Gaussian prior is not correct. And again, we are looking into normalizing flows to try to see if we can fix this. And for the graphs, same problem but this time not only we use a permutation and variant loss but this time we use the permutation and variant architecture. We basically use the edge comb for the generator and the screen measure classic message passing your method. And the performance here seems to be better in the sense that both on the mist, qualitatively speaking, you see that there is a certain resemblance between the left and right but if you go to the right and you look at the distribution of the particles to get out of the jet and even mass over PT ratio, we took the ratio because of the way we normalize the data sets, the target data sets you cannot look at the mass itself but you're fixing this. But you see that there is a good agreement between the two. So we think that somehow there is more promising margin here but again, for the very short encoder we want to see what we can get with normalizing flow. And last set of slides the big question is, do we really need all of this? In the sense, do we need to go and train these kinds of objects to give us pixel by pixel what the detector is doing? Sometimes you do, but not all the time because if you think of how an LHC analysis is typically done you start with this one megabyte of data but at the end of the day you reconstruct the particles you apply selection on the particle and so on at the end of the day you do your final task which is typically a signal to background maximum light with a few, with a handful of quantities at most. So the problem is, can be rephrased into can I just learn the effect of all these big detector on these quantities? So can I go from what these quantities look like a generator level to what these quantities look like at reconstruction level and define in the analysis specific and application specific fast simulation that allows me to basically cut completely the computing demanding workflow. And we think that the answer is to this is yes and it's pretty much a simple problem. Sorry, I'll go back because it's you don't even need guns and VAE at the end of the day what you need you need a kind of density estimator kind of approach so you want to do Gaussian somethings around moving means and sigmas where you estimate the mu and sigma from a mean value regression which is in the loss term on the top right is the first term between the square brackets that's an absolute error that estimates the mean of your distribution and the other term it's a little bit less obvious but it's basically a quanta regression of the plus or minus one sigma ratio and then manipulating the disarrays then you take the difference you take the absolute value you estimate the sigma and then you can Gaussian sample and then you represent your distribution as a sum of little gaussians one on top of each other and one next to each other that we construct the full distribution and it works pretty well of course it works well when the effect of the detector is small as you can imagine is the case of the top quantities but it does a pretty good job also when the effect of the detector is big which is in the bottom row when you see the orange is the generator level quality so what you would see if you had a perfect detector that could see all the neutrinos that would see all the particles coming out with no resolution the red is what you actually see because of the detector you have and the blue is what this model gives you so you don't care about the disagreement between the blue and the orange you only care about the agreement between the blue and the red but the orange tells you the fact that the orange is distant it's telling you that you are not just learning local smearing of the generator level quantity but you are learning big shifts which means that you are really catching the most of the detector resolution is happening and if you do this now you can imagine that rather than generating all these large statistics of full simulation that it takes forever you just generate 10% of it and the rest of the 90% of the events that you need that you only stop the simulation at the very first task which is basically the pinnets it's super light and effectively if you do so you then use a model like this to process this 90% and give you what the reconstruction would give you for the specific quantities at hand which means of course that every analysis needs to train its own model fine, the training takes a few hours but centrally you only run the simulation the average task on 10% of the data and you save 90% of the resources which is the famous factor 10 I was talking about now I stop here unfortunately I cannot go into this which was how we do I leave you with this which is just the final message that we can improve searches at the LHC using the data encoders but you will get it in the PDF unfortunately there is no time to go through that but my conclusions are that particle physics is in a very delicate moment because it's basically with complete standard models not completely accurate because there are a few things missing still like the accident but we do expect certainly we do we did and we still expect new phenomena to answer too many of the questions are still open but unfortunately we don't see any and it's not clear why and the challenge I had is big because we're basically now searching in a dark room well before we had the very clear beam light beam that was telling us where to go and search and so the paradigm of the LHC as a discovery machine now is a little bit under discussion because of the fact that since we have to apply this very early strong selection on the events we might just be saving the wrong ones and I think that and this is unfortunately was heavily developed in the last part of the seminar but I showed you at least that we have a way to save resources in the processing and by saving resources you can do other clever things and enlarge your sensitivity, your searches I really think that this is something that deep learning should play a central role in all of this in the future otherwise we will basically doomed to keep running the LHC the way we are running the LHC now we will have fantastic precision measurements of known processes but we will keep not discovering anything because we didn't even so far so we think that deep learning will help us operate the detector in the future and also hopefully help us discovering something unexpected that's it wonderful, thank you so much Maritza that was a great talk let me give you some virtual applause for the like of a better solution thanks a lot so questions anybody has any questions? if so please raise your hand okay I see none so far then let me maybe go ahead I have a question on your quantization of our training that you mentioned so it seems looking at your famous plot of you know, few carers doing much better with much fewer resources than normal carers doesn't this also mean that you know, 32-bit floating point arithmetic that most people are going to be using is just way overkill and wouldn't this mean that there are things to learn or lessons to learn not just when it comes to FPGAs but maybe also when it comes to doing trainings on normal HPC infrastructure certainly and I think this is also why TensorFlow 2 now supports the limited sets of quantization you can run on CPU and GPU which is 16 32-bits and even 8-bit integers I don't know if they do booleans but the point is that certainly so the kind of scheme where we're trying to go at least within CMS no we're not necessarily going there I'm trying to push to go there with a few of others is that we basically should enforce a procedure by which a given neural network goes online which is done optimizing the resources because so far it's fine if everyone is running their own deep learning model on their laptop fine but the moment you start to put 20 of these models online in production they will compete for the resources and everyone should have an optimized one as we do in normal algorithms normal algorithms are scrutinized very very intensively up to the line by line there are like validation procedures that are established in a decade and here we have we turn the page we are on a white page and we should define the same kind of workflow and I think that this is the way to define the workflow is that you should never allow an algorithm to go in production if it was not trained with quantization aware procedure maybe a quick follow up on that when you is that like the huge drop off that you see in the post in the line that you have when you don't train in a quantization aware way but you just truncate the trained model is there does this position of this super strong drop depend in some way on the difficulty of the task or the architecture of the network or the size of it or is this more or less at a universal place no it certainly does and it's typically a very localized thing we actually have in the package a nice routine that was written by by one of our contributors that allows you to do some post mortem evaluation of these where basically you can go and check component by component of your network the typical range support in the numerical range that you would get at that component and you can compare it to the bits that you allocated and you can really see where you go in overflow when you go in underflow and and so you can pinpoint which part of the network was just given the wrong accuracy the wrong bit representation when you go there and fix it of course this is a very tedious operation the nice thing of QKeras and AutoQ is that they do it for free for you it's not the only library that does this because there are like vendor like silings as its own one but I mean people are so familiar with TensorFlow and QKeras that we thought this was such a good thing and we basically interfaced the HLS format to it so now it's like completely transparent workflow and the two packages together allows you to do things yeah I see thanks a lot so I see there is a question from Will on the chat and he's asking are your network depth limited for execution speed these ones yes in the sense that you cannot it depends on what you're doing in the sense if you are really targeting the sub microsecond you cannot go super deep but you can still do like five or six layers the issue that we're facing today when talking about veritive models is actually in the HLS synthesizer which is a black box to us it's a vendor distributed package that we don't really control sometimes it does weird things and in particular it can go nuts when you have very very deep networks there are solutions to that which is you could synthesize the network if you basically can take a network break it in chunks and synthesize with chunk at each time and then you need to really work on the connections so there are way around it but clearly the the direction to go is that the HLS2 should be improved and people are improving that because they are becoming more and more popular not just for this but also for converting generic group based algorithms so we expect that with time distribution there we get better and better but honestly at some point you run out of resources on the FPGA and then the only way out is that you recycle resources so you extend your computation across more cycles and this means you're going to get slower so there are things that you cannot fit into this thing but for instance we're working with the company owned by Volvo and trying to use this technology for self-driving cars and of course in their case the latency typically is 1000 times bigger so you could do a lot of this recycling and you can slow down the magnitude of the execution for a level 1 trigger you cannot but you can certainly do more complex things than what we do today thank you so I see he also has a follow up question to that which is do you find performance increases with network depth or is it not that important yes yes you do you never actually studied this intensively there might be a point where at some point it's not worth keep going anymore but we didn't really look into that okay cool if there's no more questions then I would like to thank Mauritio again for taking the time today to talk to us about this very fascinating and very important range of topics so thanks a lot and as I said stay tuned for the recording to appear on YouTube in the near future thank you