 So I suggest we get started. So hello, everyone, and welcome to this eighth seminar, you know, machine learning and physics seminar series this time. That's actually the last minute of the term, obviously. Today, we have the pleasure of welcoming Holyne Q. So Dr. Holyne Q is a senior research fellow at CERN. He received his bachelor degree from the Peckin University in 2014, and his PhD from the University of California, Santa Barbara in 2019. He currently works with the CMS experiment at CERN on the large undercollider, and his research focuses on search for new physics and measurements of the Higgs boson properties. He's playing a key role in searches for the Higgs boson that came to a pair of charm cards, I left out this week, but he's also possibly a thing to the search for the Higgs boson pet production in the high momentum regime and other searches for supersymmetric partners for the top quality. Link to these efforts, Holyne is very active in machine learning research for jet physics. He proposed a series of novel deep learning approaches for jet tagging, which substantially improves the performance and has been widely adopted across the CMS experiment, in particular also for the VHEC analysis, as you might have heard this week again. His talk today, we focus on this particular topic and his title, jet tagging in the era of deep learning. Thank you very much for joining us, Yiling, and the first floor is yours. Thanks a lot for the invitation. It's a good pleasure to be here, and today I'll be talking about some jet tagging techniques in the era of deep learning. So I think probably many of you are already quite familiar, but in case you're not, I mean, we are doing experiments with this large hydrogen collider at CERN, which actually locates between the border of Switzerland and France. So the HCA is the largest collider and also the highest energy particle accelerator in the world. And basically we try to accelerate protons through very high energy and try to make them collide with an energy of up to 14 TV. So for the detection of such high energy particle-particle collisions, we use very complicated detector systems. So this is an example, which is the experiment that I've been working on is the CMS experiment. And you see that this is a sketch of the CMS detector. And on the right, you also see the cross-sectional view of this, a slice of this detector and the various components. So from the innermost, it's a silicon tracker that we can use to measure the trajectory of the charged particles. And after that, we have the electromagnetic calorometers and also the hydrogen calorometers to absorb the particles and measure their energies. And then these are all like enclosed in the superconducting solenoid magnet, which produces a very powerful magnet, magnetic field of 3.8 Tesla. And outside of the magnets, we have also a muon system, which provides very good identification and also the momentum measurement of the muons. So in CMS, we actually use the so-called particle flow algorithm to reconstruct the full event. So the idea is that particle flow is able to combine information from all these detector sample systems. So then, for example, for like muons, we have information from the tracking system and also from the muon system. And for like charged particles, we also have information not only from the silicon tracker, but also from the various calorometers. So we can combine them to make a global event description. And at the end, we get a list of particles like electrons, muons, charged hydrons, neutral hydrons, and also photons. And we can use all these things to reconstruct the full event. So this is a display of actually a real events that we collect at CMS. So we collect protons at 13 TV in this case. And then here you see this is outgoing electron, but you also see here, we have a collimated spray of like a number of particles, which experimentally we reconstruct as a jet. So jet is basically a collimated spray of particles. Jets are quite common at the RC because we produce, this is a hydro machine. So it produces a lot of like quarks and guons and these all shower, radiate, and hydronize and these do all kinds of jets. But on the other hand, jet can also tell us a lot about the underlying event. So we use jets to, at the handle to probe the hard scattering events. And then we can use jets to search for like explosion digging into a pair of charm quarks or to search for para-production of explosives and then both takes decays to a pair of bottom quarks. We also use this kind of like Higgs to four P final state search for new resonance that decays to a pair of X bosons. And this is example of a search for new physics in this case super symmetry that can lead to multiple jets and also like large missing transverse momentum. So the key behind all these things is what we call jet tagging. So what jet tagging does is try to use the properties of all these outgoing particles in the jet and try to infer what is actually produced in these high energy collisions that initiate this jet, these like sprays of particles. So for example, we may wonder if this jet is coming from Higgs boson or W or Z boson or top quark or it's just a very commonly produced like white quark or long jets from the QC radiation. So this is what jet tagging is trying to tackle. And basically, as I said, jet tagging is trying to identify the origin of the jets basically to figure out what kind of particles initiate the jets. And essentially in the motion learning perspective this is a classification task. And since we have different kinds of jets we are also dealing with different kinds of jet tagging tasks. So the left is showing the so-called jet flavor tagging. Where to go is mainly to distinguish between jets from like bottom quarks or from charm quarks or from light flavor quarks or like long jets. On the other hand, we can also use jets to tag as you see on the right part, the hydronic decay of the tau leptons. So tau is a lepton, but its hydronic decay actually also produces shower particles and the jet tagging technique can also be used here. But today I'm going to focus on is actually the more interesting part which is the boosted jets. So the boosted jet arises when for example, a massive particle like the top quark is produced with very high momentum. So then as you can imagine with a high Lorentz boost all the particles become quite close to each other and they become merged into like a single large radius jets instead of like being resolved into several different jets. So in the land of boosted jet tagging we try to tag like top quarks decay to B and W and then to three quarks like Higgs or W or Z to a path quarks and the main background which we try to reject is just jets from a single quark or glon which are ubiquitously produced by QC interactions. So this is the scope of boosted jet tagging where we try to identify highly Lorentz boosted heavy particles mainly the Higgs, W, Z bosons and also the top quarks and to reject the QC background. And for this task there are actually some quite unique properties that we can exploit. One is that these basically heavy particles they produce kind of quite characteristic signatures or substructures of the jets. So for example, for the top quark since it decays to three quarks you kind of get the three prong distribution of the particles or the energies in the jet. And then for W, Z or Higgs it decays to a path quarks so you have a two prong substructure and then for light flavor quarks or glons it's more or less the one prong substructure. Another important aspect is the flavor content. So by flavor I mean if there's a big quark or a trump quark in the decay. So for example, for the top quark you always have a big quark from the decay and if you think about the Higgs, the dominant decay mode Higgs to BB also gives two bottom quarks after the decay. So with these heavy flavor quarks you then have the B-hydrons or like B-mesons they have relatively longer lifetime and then the detector you end up having displace tracks and second vertices and also because of their large or leptonic branch interactions of these heavy flavor hydrons you can also get more of the charged leptons that can also be a useful handle for identifying these flavor content. So the goal of boosted jet tagging here is that we want to simultaneously exploit both the substructure and also the flavor information of the jets to maximize the performance. And actually thanks to the rise of deep learning techniques we can actually get quite significant improvement in the performance for boosted jet tagging these years. So then before actually getting into the motioning part I want to start with something more fundamental which is how we can represent the jets and then use that for motion learning. So in the early days of using motion learning for jet tagging one of the approach explored is to treat a jet as an image. So this is just a 2D regular image in like the E105 space and you have various pixels which correspond to like the intensity of the pixel correspond to the energy deposition in that particular position. So this approach actually works pretty well because then we can link this to a computer vision task then use the very powerful convolutional neural networks to perform the tagging. But as you realize from this image it looks quite different from the regular image we get every day. And the one because this is really very sparse you probably have less than 1% pixels being activated in this jet image. And also in real detectors like CMS or ATHAS we also have inhomogeneous geometry meaning that it's not so trivial to convert a jet into an actual like regular grays of pixels. So then another approach that draws some analogy from natural language processing is to treat a jet as a sequence. So basically each particle in a jet then becomes a character or a word in the sentence. And then we can use like recurrent neural networks or one dimensional CNNs to process the particles in a jet and perform the jet tagging. So one notable example in this aspect is the CMS deep AK-8 algorithm. So it's a pretty advanced deep learning based algorithm for boosted jet tagging. And the AK-8 here refers to the type of jets that we use in CMS. So it's clustered with the anti-KPA algorithm and with the distance per meter of r equals 0.8. And the idea here is that we try to build a multiclass classifier. So with one algorithm we can do both like top clock tagging and also do like WZ or Higgs boson tagging. And in fact, if you see this table we also have sub-divided categories. So for Higgs we further divide into like X2VD, X2CC, et cetera. So with this approach we can actually aggregate output scores and treat them like probabilities and convert them. So at the end we get a very versatile tagger that can do many, many things. So deep AK-8 features the direct use of jet consumptions. That means the particles in CMS reconstructed as particle candidates and also the secondary vertices that are within the jet comb which target more like the displacement of the BN like the charm hydrons. And then in the network architecture we basically use two 1DCNs first and to separately process the particle inputs and also the secondary vertices inputs. And then we combine these 1DCNs with a Fulcic Nectar network. This gives us the final prediction and classifier jets into various sub-categories. So this slide shows the performance of deep AK-8 in terms of the raw curve. So on the left is for top quark tagging. So in this case the signal and X axis is basically the efficiency to correct identify the top quark. And then on the right axis the background efficiency are you can also think of as the probability of misidentifying a jet like a QCJET as a top quark. And basically that here is that for fixed background like misidentification rate we want to get as high or signal efficiency as possible. So this means that going into the like the bottom right direction is the direction of improvement. So if you compare like various these algorithms basically deep AK-8 achieves the best performance among all these algorithms. And if you compare with traditional approaches based on like groomed mass and subjectness you get the same signal efficiency more than another magnitude reduction in the background misidentification rate. And similarly on the right shows the performance for X to VB and deep AK-8 also achieves pretty significant improvement. So of course this is quite successful but if you think a bit deeper about the sequence based representation of a jets you realize there's some limitation of this approach. So basically we know that the jets I mean the particles within the jet kind of has this permutation invariance. So this means that the tagging like the tagger response of the tagger algorithm shouldn't really depend on whether you process the jets considering particles like in this order from one to three or in a different order like from one to three like this. This is because regardless of how you reorder these particles it's still the same jet. And this makes us quite different from the typical natural language because in a sentence or in a paragraph the order of the words indeed makes quite a difference. So this means that if we use the sequence based approach we have to impose an arbitrary order like either by the PT or like distance or some other metrics and this potentially could limit the performance of the machine learning algorithm. And now if you try to think beyond this ordered sequence then a very natural sort of analogy that is used in the 3D computer vision communities this point cloud representation of the jet. So basically here the point cloud refers to an unordered set of points in space and this point cloud is typically produced by a leader or like a 3D scanner as you can sometimes find in self-driving car. And the idea is that also these points just like large number of points are unordered in the set but the spatial distribution of these points actually carries a very important information which is that they encode the geometric structures of objects. So we can use these discrete points when for whether this object is a car or a cyclist or a pedestrian and that is can be used to guide the self-driving car. So with a very similar approach we can also think of a jet as a cloud of particles. So this means that the jet is a set of particles in space and the different in this case we put a jet in like our more familiar 2D ETI-5 space instead of the 3D Euclidean space. And then the coordinates is basically the flying direction of these particles but also we have more information measured by our detectors like the energy, the momentum and also like for charged particles we have the trajectory and also the impact parameters, displacement and various properties. But I think the common feature between the point cloud and particle cloud is that they are both permutation invariant with respect to the points and also with respect to the particles. So based on this representation we propose particle nets which is to perform jet tagging where this particle cloud representation and as I said here we try to treat the jet as an another set of particles that are distributed in the ETI-5 space and then we use a graph neural network architecture that is adapted from the dynamic graph CNN. So that here is sort of illustrated in this sketch basically this is our starting point is the particle cloud and then we treat each particle as a node of a graph and then we start to connect each particles with like a nearest neighbors in this case and this forms the edges of this graph. Once we have the edges we can start to propagate information from the neighbor to the center and this is using this kind of parameterized edge function and then once we have the edge functions to get edge features we can then aggregate all these edge features back to the center point and here we try to do it in a symmetric way so you're just taking them in and then we repeat this information propagation and also the information aggregation for all the particles in the particle cloud and this essentially gives us a special transformation from the input particle cloud to the output like a learned particle cloud and then we can stack this a few layers with this kind of edge operation and this builds a deep architecture so we can get not only the very close nearby particles but also like more further information from like a second degree or like third degree neighbors and essentially we get the correlation between the particles and also like more global information and this helps to improve the performance of jet tagging. So here because people summarize the performance of particle nets as you can see also the AUC and accuracy comes up quite similar between these algorithms but what is typically more meaningful for like real experiment application is the background rejection or the inverse of the background efficiency at a fixed signal efficiency and here you see that particle net achieves a pretty significant improvement if you compare to the AK-8 architecture this improvement is more than a factor of two and with the best other approach with this res net you also get about more than 50% improvement. So there's a quite powerful architecture and we do want to use it in actual experiments like in CMS but before that we actually have some practical aspects to take care of. So one of the feature with these powerful deep learning taggers the correlation with the jet mass and this is illustrated in these two sketches. Basically what you get if you train such neural network tagger without any special consideration is that you get this kind of mass scalping behavior on the here is showing the mass distribution for the background QCJS. And basically without any selection you have a falling spectrum for the QCJS but if you apply like a titan-titer selection with the tagger you start to create a peak and this becomes quite similar as the signal. So but this is not necessarily a problem but in many cases we want to avoid this behavior because either we want to use the mass distribution to further separate our signal and background or I mean if we have a mass independent tagger we can also use it to tag a signal jet with an unknown mass that could possibly come from new physics scenarios. So then I mean what we try to do is to perform some mass decorrelation and here I'm actually going to talk about two approaches. So one is a more traditional approach that was used by the big kick algorithm. So here is based on the so-called Adversal Training and the idea here is that this part is basically a nominal big kick tagger and it consists of a feature extractor which is the one DCNs and then we use a pretty good network to classify the jets for the final output. But then I mean on top of that for the mass decorrelation we also add the mass predictor. So this part actually takes the features extracted by the one DCNs and use that to try to predict the mass of the jets. So the idea here is that I mean the better we can predict the mass from the features extracted by the one DCNs this means that these features are more correlated with the jet mass and then we can actually use the accuracy of this mass prediction. So as a penalty and we add this to the previous classification loss but with a minus sign and also with a weight factor and then if we use this joint loss function to do back propagation and to optimize the network we can achieve both high classification accuracy and also to minimize the correlation of the features with the jet mass. So as you can see this more or less works you don't really see a striking peak after selections with the taggers but you do see there's still some residual scalping that is around the heat mass that we try to tag in this case. And then for the particle net tagger so we try to develop a new approach for the mass declaration and this is actually quite straightforward if you go back and try to think why we have mass declaration from the very beginning or why we have mass scalping from the very beginning. So this is because the training has for the signal a fixed mass. So basically the signal jet mass distribution has a quite sharp peak and for the background QCJs we have like a falling smooth spectrum. And then if the training just sees these two I mean it will start to exploit this kind of very striking difference between the mass distributions of the signal background. And also we don't give directly the mass it can still infer this kind of information from other features that we gave it. And then this means that if we want to avoid this mass scalping then what we can do is to make the signal and the background have the same mass distribution in the training. And so basically to achieve this we produce a special signal sample that covers the full mass range. And then we can rebate both signal and background to have a flat mass distribution. And then in the training the network sees the same distribution for signal and background and there's nothing to be exploited purely from the jet mass. So as you see this approach works quite well and basically we get an even smoother mass distribution for the background jets compared to this adversarial training approach. And then we, so if we try to compare the performance with these two different mass decalation approaches what you see is that for this particle net mass decalation using the flat mass signal sample you actually lose very little. So basically this is comparing the solid pink line versus the dashed orange line which is a bit difficult to see because it overlaps with another line. But you see in the bottom line here is that there's very minimum performance gap between after the mass decalation. On the other hand for the previous mass decalation approach using adversarial training you see there's a quite large gap. Basically the mass decalated when shown in the dashed green curve lose quite a bit of performance compared to the nominal tagger. So with this approach we can actually further like improve the mass decalated and particle net tagger performance and we use that in several analysis. So now that we have a very good mass decalated tagger so a natural question or a natural task we come up in our mind is to see if we can further improve the reconstruction of the jet mass itself. So this is because the jet mass is also one of the most powerful observables for the jet tagging. And as you see here this is showing the comparison of the soft job groomed jet mass for the QC background and also for the WZ and Higgs bosons. And you see that for the signal these give very sharp mass peaks while for the background there's a pretty smooth defaulting spectrum. So this gives us pretty good separation between the signal and the background and of course we want to further improve the mass resolution to have an even sharper peak. And then the idea here is that we try to develop a mass regression algorithm. So we basically use deep learning to see if we can reconstruct the jet mass with the highest possible resolution. And for the technical setup we use a very similar setup as the particle net tagger but instead of doing a classification task here we try to use the jet construction inputs to directly produce the jet mass. So for the regression targets we set it for the signal jazz to be the generated particle mass of this X particle and X can decays to BB bar, CC bar, or QQ bar and note that this X actually has a black spectrum in this range. And for the background QC jazz we basically use the soft job mass of the generator level, so particle level jazz and then we feed these into a large cost loss function which is illustrated in the screen curve. And in the small value region it behaves like MSC loss and in the large value it behaves like ME loss and this gives us a more smooth behavior to have been like more robust to the tails of the mass distributions. So here is how the mass regression works. So on the left you see the mass response for a signal jazz and this gets X to CC and the X axis is showing the mass response. So the reconstructed jet mass divided by the target mass and the screen solid curve is showing the regression behavior and you see that not only it gives much sharper signal peak but it also gets resolved some tails at low values and also at high values which were previously seen with the soft job grooming algorithm. On the right is showing the mass distribution for the background QC jazz and the key here is that we try to apply some like types of actions on the particle net tagger and we want to make sure there's no mass scalping from this regression. As you do see, there's actually very little change in the diagram jet mass distribution after selections with various tagging work in points. So basically with this mass regression we can get about 20 to even 25% improvement in the final sensitivities for analysis involving X to be VR, X to CC decays. And now with these powerful techniques we can try to apply them to real analysis and see what we get. So here I'm going to introduce a few, actually two examples. One is the recent same as search for Higgs boson digging to pair of charm corks, the Higgs to CC. So as you see that so far we have established the Higgs coupling to the vector bosons, W and Z and also third generation formulae, top and bottom corks and also the taleptons. And more recently we're starting to get like some first evidence for Higgs to Muon coupling. So of course the next milestone would be the Higgs coupling to the second generation corks. And of course the first, this target would be to probe the Higgs coupling to the charm cork. So however, the search for the Higgs to CC is very challenging at ARC because firstly for the standard model scenario the branching fraction of Higgs to CC has got small. It's just 3% and to put it in context this is only one over 20 compared to the dominant Higgs to K-mode Higgs to BB. Nonetheless, Higgs to Hydron colliders with this enormous bag ones that are arising from light-flavor corks or Gwong jazz or like even bottom jazz. And just in this sense the identification of the charm corks is actually a key and the most difficult part of this analysis. So for the same as analysis of the Higgs to CC we target associate protection of the Higgs boson with W or Z boson. And then the Z or W decays to laptops. The discuss of handles to trigger the event also to help reduce the otherwise enormous case in background. And to fully cover different Higgs to K topologies we basically have two approaches. One is the resolve jet approach that try to reconstruct the two charm corks separately. And the other is the merge at topology where we try to target both charm corks with a single large radius jet. And this is actually what really benefits from all these boosted jet tagging techniques that I just talked about. So just to put it into context this is on the left it's showing the performance for the particle net Higgs to CC tagger. And as you'll see that compared to the previous tagger which is the deep AK-8 or deep AK-15 shown in the blue curves. The new particle net tagger in the red curves actually for the same signal efficiency can improve the background rejection for about the back of five. But both the VPAS jets which is our main background and also for the Higgs to VB which is a very difficult and largely irreducible background. So it's a simple calculation this alone already gives more than a factor of two improvement in the final sensitivity for this Higgs to CC search. And on the right is we also use this mass regression and we get about the factor of 50% improvement in the mass resolution. And this gives us more than 20% improvement in the final sensitivity. So powered with these very advanced boosted jet tagging techniques and also combined with the resolve jet topologies the CMS search for VH to CC with the full rental dataset gives us a pretty stringent constraint on the Higgs to CC. So we actually get the upper limit of 14 times the standard model for this process and the expected one is 7.6. And if you compare with the ATAS for and through results so using the same dataset and the very similar search the expected result is about one factor of four better. And also this is actually getting quite close to the expected sensitivity previously projected for the high LumiRTC with 3,000 inverse point of data. So that is about like 20 times more data. And the expectation there is upper limit of 6.4 times the standard model. And with much smaller dataset we already reach a factor of like 7.6 times standard model. So the right plot summarizes the upper limits from different channels and also from different like that tonic channels. And what you see here is that basically the more jet topology which is using these advanced boosted jet tanking techniques is essentially driving the sensitivity. So it alone gives the upper limits of about 8.8 while the combined is just 7.6. And also as a side product of this Higgs to CC search so since we perform the validation of the pool analysis procedure by measuring a similar process the C2CCPK in VZ protection and we measure this process to be very consistent with the standard model expectation with about 20% uncertainty. And this also leads to the first observation of the C2CCPK at a hydrogen collider with significance of 5.7 sigma. So here you see the mass distribution of this large radius jets in a merge at topology. And after the background subtraction you see a clearly C peak near 90 GV. And those on top of that some like access from the Higgs to CC events. Okay, so another search that is also powered by these powerful boosted jet tanking techniques is the search for Higgs boson power protection. This is another high importance topic at the RFC but also the future high-lumia RFC because this would allow us to probe this like Higgs potential term in the standard model. And the need to see this high Higgs protection is a very rare process in standard model. So the dominant protection mechanism is VAR-Glong fusion which many comes from these diagrams and this actually gives us direct access to this tri-linear Higgs coupling which is critical to probe the Higgs boson sub coupling. And some dominant production mode is VAR-Vector boson fusion. So the Higgs boson is produced and this Higgs boson power is produced like by Vector bosons radiated from the initial state quarks. So this approach actually has a pretty unique sensitivity to this VV-Higgs quartet coupling. And a more like striking fact is that if this coupling, this VV-Higgs coupling deviates from the standard model value and let this coupling modifier is away from one, then the events will start to become more and more boosted and this also gives us a pretty special advantage to use this more jet approach and use this boosted jetting techniques. So then, I mean, we perform the search target in the VBF protection of this Higgs boson pair and then basically the event selection is sort of illustrated in this one diagram. So in the central region, we try to look for two high PT Higgs bosons both to decay to a pair of bottom quarks. And then for the VBF topology, we also asked for two additional jets that have a high pseudo-repleted gap and also have large event mass that are compatible with the VBF topology. So with this, we also use like the particle net tagger to identify the Higgs VV decays and also the particle net mass regression for the mass reconstruction. So with this approach, this set a pretty stringent limit on this VV-Higgs quartet coupling, which is summarized in this bottom left plot. And basically we constrain this copper to be coupling between 0.6 and 1.4. So one is the standard model value. And also this means that for the first time, we actually exclude copper to be equal to zero. Actually, we can be much higher than that sigma. And this indirectly proves this VV-Higgs coupling do exist because of course, assume the other Higgs couplings are at the standard model values. And again, if we compare with the Aftas approach, this also leads to a pretty like a strong improvement compared to the Aftas search. Okay, so then I'm going to talk about some new like developments in the jet tagging and trying various approach to improve jet tagging performance further with deep learning. And basically the focus here is try to really to incorporate more physics knowledge and try to incorporate domain specific like inductive bias to have the deep learning algorithms. So I first start with our moon net which is based on this moon jet plane representation of a jet. The idea here is that basically the moon jet plan can be used to represent the radiation like within a jet. And the idea here is that if you have a jet and we have like emissions or splitings, we can represent these emissions or splitings in a moon plane. So in this case, if this splitting is from like the primary primary emission, then they are both like represented in the primary moon jet plane. And then if we have a splitting from like a secondary particle, then this splitting is what are represented in the secondary moon jet plane. So basically as you can see this moon jet plane can provide a very efficient description of the four radiation patterns within a jet. And the moreover it has another advantage being that different kinds of kinematic regimes are well separated in the moon jet plane. So for example, the number of regions is characterized by a smaller kt values. While the large angle regime is mostly coming from SR. And you also have like large g which corresponds to the hardcore liners and splitings, et cetera. So this not only I mean provides a way to like represent the radiation patterns within the jet. It also gives us additional handles to control actually what kind of information we want to use for jet tagging. So with this representation, we'll try to develop the moon jet moon net which is a graph moon network architecture based on the moon jet plane. And technically we actually use this moon tree which is basically a binary tree built on the Cambridge art and clustering of the jets. And essentially I mean this moon tree is really equivalent to full moon jet plane. And from this tree we basically converted to graph and each node then correspond to emission or splitting. And then basically associate with each splitting we can define a set of kinematic variables that describe the kinematic information like this current splitting. So for example, we can have the the tar between the splitting, the KT and the round mass and also like the momentum ratio and et cetera. So with this we can input this moon tree into a graph moon network. And the overall the network architecture is quite similar to particle net but the main difference being here that the graph architecture is fully specified by the moon tree. And so we don't need to do any like k-nearest neighbor finding as we do in the particle net. And also a big difference is that here since it is a binary tree for each particle it has only up to like three neighbors. While in particle net each particle actually connects to either like seven or like 16 particles. So basically we studied two variants of the moon net and they have the same network architecture and the main difference is the like input features that we use for this network. So in my case we use the four like five dimensional inputs and in that case we use only three dimensional inputs in the KT, the tar and also the on C which is the momentum ratio. So here, I mean, we try to look at the performance of moon net. So on the left you see a rocker for top tagging and basically what you see is that with the three dimensional inputs moon net three already gets a pretty comparable performance compared to particle net. And with the full five dimensional moon net you get the significant improvement in the top tagging performance. And another advantage is that if you compare the computational costs you realize that the moon net can lower the like the training time or the inference time by almost an odd of magnitude is because we get rid of the like the dynamic graph architecture and also the expensive like k-nearest neighbors. And another absolutely more important advantage of moon net is that it provides a systematic way to control the robustness of the tagger. So this is shown here which like illustrates the robustness versus performance. So on x-axis this is the direction that gives like more resilience versus number of effects and on y-axis and higher is like better performance. And the idea here is that by applying you know increasing TP cuts we can get rid of the like the non-productive region which is characterized by small k-t for the splitings. And then we can substantially improve the robustness of this tagger versus non-productive effects. Okay, another approach that we recently explored is to add Lorentz symmetry to the graph neural network. So this leads to the Lorentz net which you see the network architecture on the right and also description of this graph like message pricing process on the left. So basically for this graph network we actually give it two kinds of inputs. One is the coordinate inputs which is input like four vector of all the particles. And then we also have like feature inputs here to preserve the symmetry we actually only consider the scalar inputs for the nodes. And then from these two inputs we start to build the message that we try to propagate along the graph. And we want the message to be also Lorentz invariant. So then what we use as inputs are the two scalar like inputs from the center neighbor. And also we define actually two pairwise Lorentz invariance because based on like the Minkowski norm and also the Minkowski like inner product which are fully Lorentz invariant. And from that basically since other inputs are Lorentz invariant the message also Lorentz invariant and we can use that to update the coordinates and also update the feature. And then the full like building block of this graph net is fully Lorentz group equivalent. And this approach actually gives quite good performance. On the left this is showing the performance on the top tagging benchmark data set. And you see that with this Lorentz net it improves quite substantially over the particle net think by going from like 1600 for the background rejection to more than 2000. And the line if you consider model complexity so Lorentz net since it explicitly used the Lorentz symmetry it actually achieves this high performance with fewer number of parameters compared to the particle net. Although it still like has a bit higher computational cost this mainly because it uses a fully connected graph while particle net uses only like in nearest neighbors. So this is probably something that could further improve in the future. Another advantage that comes with this symmetry preservation is that so if you look at the left plot if we try to boost like the whole jets all the particles in the jets along say the x direction or an upper direction. And you see that this Lorentz like net shows very stable performance after various Lorentz boosts while the other architectures that are not Lorentz invariance start to degrade substantially as the Lorentz boosting queries. And another advantage of incorporating symmetry is that we can now achieve much higher sample efficiency. So this means that we can achieve a pretty good performance by training over very small data set. In this case we are using like 0.5% which is roughly just 6000 jets and it reaches already a comparable performance when if you train it with particle net we need like an order of magnitude larger data set. So in this case incorporating the symmetry also gives us like a more sample efficient algorithm. Okay, so then the last thing I would like to touch a bit on is the particle transformer. So this is motivated by the recent like rise of using this self-attention and also these transformer architectures not only in like language models but also in computer visions and even in like more like scientifically related tasks like after photos also using attention mechanisms. So what we're trying to explore here is that if we can benefit the kind of universal architecture of transformer and try to improve the jet tagging performance. So what we realized from this study is that the pure a plane transformer architecture is actually not that efficient, not enough to improve the performance. And what we have to do is to introduce some tailored like features that are helpful for jet tagging. So this is the particular interaction term that we put into this transformer. And the idea here is that we add these kind of like features that are also used in like Lundnet and then we use embedding to transform these things and then we inject all these into the self-attention like just as a paradise feature before the softmax. And with this approach, we also kind of need a very large dataset because so far we have tried on small dataset and doesn't really fully reveal the performance of this transformer architecture. So to tackle that challenge, we also prepared a large scale dataset that was intended to make public. So this dataset actually consists of 10 classes of different jets that covers like X2VB, X2CC, most of the top decays, but also some like new scenarios like X2Google and the X2WW to Bulkworks and also like X2WW to Avenue, Cuckoo, et cetera. So it's this very large dataset that is like more than almost two others magnitude larger than existing datasets. So we tried to test our new architecture on this dataset and the performance is like this. So as you read from this table, this new architecture trained on this very large dataset do give us performance improvement. For example, for like X2VB, we get about 30% improvement in the background rejection and for the X2CC, I think we get something like 70% improvement. And there are also some cases we do get like more than a factor of three for example, for the top quark in this case. And so these are quite impressive results. And another thing is that here also shows the performance of the transformer without this addition of the interaction term. And you see that its performance also still better than ParticleNet in some cases, doesn't really outperform ParticleNet that much. And also if we look at the model complexity, we see that also this transformer architecture has much more perimeters. In terms of the computation, it doesn't really add that much. So at the end, the inference and also the training time are still quite comparable. And another sort of nice thing introduced by this large scale dataset and also these like transform architectures that this now allows us to use a new training paradigm. So the idea here is that we can actually perform a supervised pre-training on this very large JetCast dataset, and then we can do a fine-tuning to tailor it to the downstream tasks. So these are the results we get for this approach. So on the left is the pre-training on the JetCast and the fine-tuning the top tagging benchmark dataset. And on the right is fine-tuning to the quark-long dataset. And you see that basically this pre-train and fine-tune model achieves a pretty significant improvement compared to just purely like architectures, purely trained end-to-end on these smaller datasets. And also here you also get like the performance of the particle transformer like trained end-to-end on these smaller datasets. And as you see, they don't show us a significant improvement compared to existing architectures. Okay, so this brings me to my summary. So we have seen that the rise of deep learning has brought lots of progress in JetPhysics, especially neural approaches like graphing networks and et cetera have significantly improved the jet tagging performance. And they also lead to substantial increase in the physics ratio at the RPC. So towards the future, I think there are two, basically two main aspects. One is to try to push further the performance, possibly with like new architectures like graph networks and transformers, et cetera. But I think here the key would really be to incorporate like more domestic physics knowledge and use our physics knowledge to guide in the construction of architectures. Another important direction I see would be to have more like large public datasets. So we intend to make public this large JetCast dataset, but we think it will also be beneficial if we can like open data from others and same as et cetera. So I think these large datasets would also benefit the whole community quite a bit. And another interesting direction is to look at new training strategies. So far we have mostly performed like hand-to-hand training on specific datasets. And with larger datasets, we can start to think about supervised pre-training. But I think the future has already been like widely adopted in language models and also computer vision is to really look into unsupervised or semi-supervised pre-training. And this would open the error, like to train on real data and then to find your own specific tasks that we have at Criter Physics. So another aspect I think is equally important is to increase the robustness of the algorithms and also to have the systematic and serious under control. So from the machine learning perspective, of course we should investigate to explore like more robust architectures and the different training schemes, especially those like to train on real data directly. And on the other aspects, I think we can also work together, jointly from the experiment and also from the zero community to improve our molecular simulations, be like metrics elements and also be the like bottom shower modeling. And lastly, we also need to really improve our calibration methods. And this is, I think the last step, but also a crucial step before we can actually use all these advanced algorithms in real experiment analysis. So as I conclude, I would say to really push things further, we need deep thinking as physicists, but also we need deep learning, like with a very strong synergy with the developments in the computer science community and with the machine learning community and only with the combination of the two we can really push the performance even further. So, okay, that's all from my side. Thank you. All right, thank you very much. That was very interesting talking to you. And yeah, thank you for describing so many different interesting architecture. So let's move on to questions. Does anyone in the audience have any questions for the speaker? Yes, Rob. Yes, hi, thank you very much for the talk. I have a few kind of technical questions. The first one is about your transformer architecture where you add in this interaction term. So I was wondering, so I've read other papers that use transformers where they just introduce a bias in the softmax, which sounds kind of similar to what you're doing. Is this a significantly difference? I think it's very similar. Essentially, it's the bias term that is at before softmax. Right, so because these are not trained, right? They are directly, where is the embedding trained? Yeah, it's only the embedding, yeah. Right, okay. Okay, the other thing I was wondering about is how do you, so you incorporate particle flavor in, for instance, in this transformer architecture as well, right? Yes. Do you do that by embedding it or is there some other way of doing that? So we basically use like one hot encoding to encode the particle identification like PID inputs and then they are treated together with other features like other continuous features gathered in this particle embedding. Right, okay. So it just looks like the four vector and then appended with a one-hot vector. Yeah. Yes, okay, great. Thank you very much. Thank you. That's it. Yo, hello. Thanks also very much from my side for this great talk. It's super impressive to see all of those various advances that you have put together. And that's also how my question, which unfortunately isn't super well defined, but I was wondering if you had any sense, you know, how close these advanced techniques are actually taking us to the absolute limit of, you know, be taking identification or flavor identification given a certain data set or maybe more practical given a certain detector. You know, how close, what are the, what in your sense, what is the main bottleneck going to look like for the years ahead? Is it going to be tuning the architecture? Is it going to be retaining calibratability of the result? Is it going to be something totally different? Sort of what are the, you know, cornerstones that we shouldn't lose sight of? Yeah, I think that's a very good question. It's also, I think it's a difficult to answer question. So I don't think, I mean, we can easily see where is like the upper limit of the tagging performance because so far what we are trying to do is to just try to push it a bit further and see how far we can go. My feeling is that we still haven't really hit the wall. I think there's still some room to gain. I mean, this particle transformer, you see that with large data set already improves the performance quite a bit. But this is, I mean, if you compare with like really large models used in like natural language processing in computer vision, this is like probably one to others of magnitude smaller in terms of number of parameters. So I think, yeah, if we can really scale up to that level we can see what performance we can't. If we don't see much improvement then maybe that's an indication of we are hitting some limitation. But then I think another, I would say equally and probably more importantly like practical aspect is free to understand and to control the systematics and to think of ways to better calibrate these daggers because at the end if we only get like 10, 20% improvement in the efficiency then, but if you suffer from like 20% skill factors or like 50% uncertainties you're still not going to improve the outcome of your analysis. So I think as experimental physicists, I mean, we do need to think about both aspects. We need to push the machine learning. We also need to think about how to calibrate and also I think with our theory colleagues to think about how to improve our like simulations because this, I think in, I think all these combined will really benefit the actual physics race using these advanced techniques. Thanks a lot, great answer. Thank you. Yes, Federick. I don't know if the others can hear you. I can't hear you. Oh yeah, I get this on the side now. I cannot hear you. Can you hear me? Hello? Yes, no, no, yes. Okay, yeah. So I was just saying, have you looked at direct comparisons between particle transformer and lunettes? Do you have a sense of like whether particle transformer can get some insights that you miss Okay, I think I missed the last part, but maybe let me just try to answer. So if anything, yeah, you can pull up. Sorry, my Wi-Fi is not very level. No problem. So we haven't really looked at the comparison between particle transformer with some of the more recent models like lunettes or even the learns nets. I think one, I would say technical challenge to look at lunettes is to figure out a way to really scale up the training pipeline because now it tries to load like everything into memory and try to use fast chat to convert things. And now with this very large, 100 million data sets, we have to figure out how to technically to break and to do the training pipeline. Yes, okay. Thank you very much for your nice talk. Thank you. Thank you indeed. And I was thinking of actually asking a question myself. So it's more connected to the HCC analysis to show. You use a particle net network on that one. And I was wondering what kind of events do you use because you mentioned like you blow away the mass difference, but do you use generic multi-jet? So like a random nose and even probably not official when they get into BBCC and like a leptin, or like a co-op, sorry. Or do you also actually use a process that are simulated and connected to the analysis itself, like an HCC and a QBB in a V plus chat? Okay, so you're asking about the training samples for the Tiger, right? Exactly, yeah. Okay, so this is like, I think, like back to Graviton, so a new residence sticking to a pair of X-Bowdowns. And so basically we have two varying masses. One is the residence mass and the other is the Higgs mass that ranges between 15 to 250 GB. And so with like varying these two parameters, we basically get like pretty good coverage on both like a full PT range and also full mass range. And then we can perform the training of these algorithms in a like messy coordinated fashion. Okay, and could I ask you just more details on how do you, for example, how do you approach for then the G plus chat background? Do you work on flavoring physics? Like do you differ V plus CC from V plus BB? Seems to be the case, I don't know. Ah, so for the, so we don't specifically train on like V plus chess, but since the training, okay, so the training, yeah, for the signals like that for the background, we just use QC multi-jet. So there it gets the jets of all kinds of flavors like black quartz or as a one, like one speaking to VB and CC. And then when we evaluate on the actual backgrounds of this analysis, like V plus chess and TT bar, of course, so we also need to calibrate them and to get like correction factors for the backgrounds. So for that, we use a basically data driven method for the background estimation. So we apply, always apply the same cut on this tagger for the signal regions and the corresponding control regions. And this ensures that in the control regions, we also have the same jet flavor composition like V plus having flavor composition as the signal regions. And then we can use the numberization difference in the control regions and to transfer that to the signal regions to estimate the V plus chess background. So that's basically what we did. Right, thank you very much, Lair. That's a great day out. Thank you. Does anyone else in the audience have any question or speaker? It doesn't seem so. So let me thank you again for a really great talk and thank you everyone for joining us. See, this was the last seminar of this term. So we'll be back next term with a new programme that we wish to have on the team channels. And see you then. And thank you again for your talk. And thanks a lot. Thank you very much for the invitation. Thank you very much. Take care, Prashan.