 Okay, hello everybody. My name is Alejandro de la Puente. I am a particle physicist, but I'm also working at the National Science Foundation. And today I'm really, really, really excited because today we have our first colloquium ever as part of a new series that we're going to host to the Latin American webinars in physics. I'm very excited because I'm going to introduce a friend, a colleague. His name is Jamie Geiner. Jamie is currently at the University of Hawaii at Manoa. He's a lucky, lucky fellow. And he's currently a postdoc there. Jamie received his PhD at Stanford University. And after that he did two postdocs, one at Argonne National Lab at Northwestern University and the other one at the University of Florida where I had also the pleasure to see him and give a talk there. So we're excited because today we're going to not going to go away from physics, but we're going to stay within this physics world, but we're going to talk about aspects of computational physics, sorry, of computational, of computer science that health physicists do their jobs better, especially when we have to deal with lots and lots and lots of data. So with that, I'm going to introduce Jamie. And he's going to, the title of his talk is going to be Introduction to Machine Learning in Particle Physics. And I hope you find this really interesting. Like always, you can ask your questions through the YouTube channel on the right hand side of the YouTube video or through Twitter using the handle, Love Physics. And with that, I leave you to Jamie. And thank you so much. And look forward tomorrow this full of you. Well, thank you very much, Alejandro, for the introduction. Hopefully everybody can hear me and see my first slide, which should be up. It's working very well. It's working very well. Excellent. Excellent. So today, I'd like to talk to you guys, give you an introduction to machine learning. And then I'll say a bit about applications of machine learning to particle physics, not today, but soon I'll have the slides or this talk available at my GitHub page and maybe other places as well. So my goal in this talk is first to give a self-contained introduction to machine learning for people in particle physics, astro particle physics, and so on. And that'll be the bulk of the talk. Though I'll also talk at the end about how this stuff is applied, mostly to particle physics. There's only so much time and that's the subfield in which I work. And so that's what I wanted to focus on in terms of applications. So briefly, I'll say some words about what is machine learning. Move on to a short history of machine learning, artificial intelligence and computing. Go through some of the main algorithms while introducing some of the basic ideas in machine learning. And then as noted, talk a bit about the applications of machine learning and really focusing on particle physics. So what is machine learning? Machine learning is an approach to artificial intelligence where the algorithms learn from the data. And artificial intelligence I think is hard to define because what is intelligence? But I think we all kind of know what intelligence is. Of course, artificial intelligence trying to make computers that are intelligent. Machine learning algorithms come in two big classes. One is supervised algorithms. This is when the algorithms are learning from data, where we know the right answer. In particle physics, this is often that we're training our algorithms on samples of events that we've simulated using Monte Carlo generators where we'll have a sample of signal events and a sample of background events. And we use the fact that we know which are signal and which are background as part of the training. The alternative to supervised algorithms are unsupervised algorithms. And this is when we don't really know the right answer. We're not training on some data set. We're just trying to find patterns in the data. How many clusters are there? What are the clusters? That sort of thing. Now, especially coming from a physics background, when I say something like machine learning is when we learn from the data, that might seem like a trivial definition because isn't that how we always learn, especially in the sciences? I think the answer there is that machine learning is at least in part defined by being different than another very historically important approach to AI, which is the use of expert systems. What are expert systems? These are the things that you build when you're doing an older approach to AI called symbolic AI or now called good old fashioned AI. Symbolic because it's based on knowledge that can be encapsulated in rules and equations and things like that. Good old fashioned because this was the dominant approach in the latter part of the 20th century, but is no longer the dominant approach. And specifically, what are we doing when we build expert systems? We're figuring out how to solve problems and then telling the computer what we know, and then using the computer's ability to perform many calculations, look at many possibilities, but valuing those possibilities using rules or heuristics that a human has given it. And I think the example that makes this a lot easier to understand is we think about the concrete example of building computers that can play chess. Hopefully most of you are at least somewhat familiar with chess. I'm not very good, but I know the rules. So if you wanted to build an expert system for chess, and of course, people did this, what do you do? Well, you want your chess player to select what's called an opening. I mean, people have looked at the ways that you started chess game and analyzed every set of moves that you can do. It's called the opening book. You want your expert system to pick some opening and kind of know the pros and cons in some way. You then tell your computer how to rank various moves and various positions. You tell it that, well, you know, it's really good to have your queen. It's more important than a rook. Rooks are in general better than bishops or knights, and you can assign point values and so on. It's important to control the center of the board. It's good to protect your king. You probably want your rooks to have clear rows and columns to act along. You want your bishops to have diagonals that they can move along. And so these are all human discovered rules that you're giving to a computer that can then use these rules to evaluate a tremendous number of positions looking as far ahead as possible just by brute force. In fact, in the late 90s, IBM built a chess playing supercomputer. So sort of a supercomputer, but you actually add a special chip that's optimized for evaluating chess positions. And this computer, Deep Blue, defeated the world champion at the time, Gary Kasparov, three and a half games to two and a half games in 1997. Deep Blue is an expert system. It was using the procedure I outlined on the preceding slide. It is evaluating 20 million positions per move. And you know, there's basically like maybe 10 each moves that you can make on a given turn in chess. So this is you know, this could be seven or eight moves ahead. Maybe it's a little less. But six or seven moves, let's say. And it's but again, it's using a program where it knows whether a particular position for the pieces is good or not, based on the wisdom of human chess players who played a role in programming computer. So it's an expert system. So that example was chess. But if you're thinking about sort of the great classic board games, maybe the one that leaves out at you if you have maybe solve chess is you want to think about go. And this is a much simpler game in terms of rules. Basically, you have a board and you can put stones on the intersection of these grid lines. You have 19 by 19 points where you can put the stones. And you're trying to capture territory on the board essentially. You know, it's a simpler game in terms of the roles, you don't have to explain castling and how all the pieces move differently and all the come the wonderful complexities of chess. It's still harder for a computer, because you have a 19 by 19 board so depending on how many stones are already on the board, there are, you know, 100, 200, 300 possible moves on a given term turn rather than the order 10 that we had for chess. So it's much harder for the computer to evaluate all the possible future positions that is in chess. And so for this reason, people thought that, well, maybe maybe we'll be able to build computers that beat humans and go sometime in the 2020s, maybe that's when we'll have the computing power to do this. So a group called DeepMind, it was originally its own startup that was acquired by Google. So it's now Google DeepMind built a program to play Go called AlphaGo. And this, this program is not an expert system like DeepBlue, but instead is based on machine learning. Specifically, it learned it was trained on a data set of 100,000 games of strong amateurs, because this was the data that was available online in large quantities. And from this it learned how to mimic sort of a competent amateur, a good but not great human player, basically just to learn roughly what someone might want to do when they're playing Go. With this, with this background, it then trained by playing itself millions of times, getting a huge data set. And then it's using what's called reinforcement learning, where we're rewarding it when we're rewarding strategies that when we're penalizing strategies that lose. AlphaGo defeated the European Go champion, five nothing in five games, in October 2015, which was the first time that a computer had beaten a human professional Go player without a handicap, without sort of basically extra points or extra moves. And this was an amazing achievement, but the center of Go is not in Europe, but in East Asia. So it was seen as important to really have AlphaGo take on one of the top human players. And so in March 2016, AlphaGo played an 18 time world champion, Lisa Dola Korean in sale, and defeated him in four of those five matches. And this was very well followed in real time. And some like 80 million people watched the first match on TV. There's a wonderful documentary called AlphaGo about all of this that you might find interesting. Something that's fun, it's actually true of Deep Blue as well, but with both expert systems and machine learning, when you build these computer systems to solve problems, sometimes they do things that we just don't understand. And that's a fun thing, I think, to think about. In general, we think about computer solutions to things that we as humans think about. So that was AlphaGo. That was the state of the art way back in 2016. But since then, the state of the art from DeepMind is a program called AlphaZero, which is something that trains entirely from self play, and it can do games besides Go. It's actually also basically in a few hours of training, was able to beat the state of the art computer programs for Shogi, which is the board I've shown on this slide, it's a sort of Japanese chess like game, and chess itself. So having given some examples for flavor said a little bit about what machine learning is, what expert systems are, maybe a little bit about the contrast and competition between those approaches. I'll now go through a brief summary of the history of computing, the history of AI and really with the idea of being a history of machine learning. But it's important to understand this in the broader context of AI and to understand AI in the broader context of the history of computers. And along the way I'll highlight a few of the algorithms that we'll look into more in more detail when we really kind of drill down on what machine learning is. So just the fun pictures I have on this slide, the one on the left is the blueprints for one of Charles Babbage's early mechanical computers. I just love these old timey blueprints. And then on the right, this is one of the early attempts at self driving cars. And of course, sort of the intelligence you need to drive a car or something that now people do a lot of work into developing through machine learning. So the so to start with a history of AI, I'm only going to go back to the 19th century, you can imagine going back much further. Of course, to do AI, you need something artificial that's capable of doing calculations, which broadly we should call computers. computers are calculating machines on some level go back to antiquity. But probably the first thing that you could really call a modern computer was developed in the 1820s by Charles Babbage, who was the Lucassian professor at Cambridge. So this is the same job that Isaac Newton had that Stephen Hawking had, Drac had good gig. And it's a modern computer because it has a memory which is distinct from the processing part of the machine. It has another part of the machine for dealing with input and output, what we call IO. Some of his later ideas use punch cards, which of course, were used in computers, you know, well into the 20th century. But these were, you know, primitive by modern standards, his, his best model. And of course, he he made blueprints, he designed things. We know today that they would have worked. But he wasn't actually able to get them to work in his lifetime. He was a theorist, he couldn't always talk people who could actually build these things into into doing it. But anyway, his his best blueprint, the analytical engine, its memory was 675 bytes, you could do, you know, seven operations per second. That's, you know, my very generic a couple year old laptop, your larger by fact by 10 to the nine and memory by a similar factor in speed. But still, it's a modern computer and a tremendously important step forward. An interesting note is his assistant, Ada Byron, Lovelace Byron, because she was the daughter of the famous English poet Lord Byron, their notes in her handwriting about how to actually calculate sort of numbers, important mathematics from this machine, we don't know totally the whole story if this is a collaborative effort or not. But she's really the only person who has a good claim to be the world's first computer programmer. So not in any language just in sort of the knobs you would turn on an analytical engine. But yeah, as I noted, none of these machines were actually built in his lifetime, though we people have gone back because of the great historical interest and shown that they would would have worked. And in fact, there's some fun things online where they have emulators and you can kind of run your programs on these machines. And it's fun, though not maybe not so practical. But indeed, these machines would have worked if they were constructed. So the sense machine learning also begins in the 19th century, with the ideas of using things like least squares to fit curves and so on to perform regression. And this goes back to maybe many things do to Gauss in actually the very end of the 18th century, though he was maybe a little bit casual about publishing things. So this method was first published by Legendre early in the 19th century. And I'll talk about how regression and curve fitting is machine learning, this is a very sort of familiar thing to do that turns out to contain many of the ideas that will ultimately want to look at in machine learning. After I finish my historical introduction. So in the 20th century, of course, we get electronic computers. The early models were not amazing in terms of memory or power, but it's a huge step forward. Of course, these are these big room size computers with vacuum tubes and all of that. Surprisingly, me, I was, when I learned this, I was kind of surprised at the date on this neural networks actually go all the way back to the 1940s by a guy named Walter Pitts, who had a actually interesting but kind of sad life. It was really just sort of an off scale genius just didn't do the human thing very well. A little bit later, one of the first really important algorithms there was nowhere near the kind of computation power to do sophisticated things with neural networks was something you could think of as a simple neural network, but you don't have to call the perceptron. And it was actually originally implemented not as an algorithm and software, but as a machine. And that's the sort of the room size thing. That's the bottom of my three pictures on this slide. And I'll say more about neural networks and perceptrons and we're going through algorithms later. So there was already in the 40s, 50s, 60s. Some of these machine learning algorithms were being developed. But basically, machine learning doesn't end up being the dominant approach in AI in from say the late 60s, through the end of the century. And this is us. And part of this is actually sometimes called the AI winter, it wasn't looking good for AI at all. So the whole period is a machine learning winter. And the end of it is maybe an AI winter, it's the field was not doing well. And there's some different reasons for this. In terms of machine learning, perceptrons were probably overhyped when they first came out in the late 50s and early 60s. The New York Times in 1958 described the the perceptron machine as the embryo of an electronic computer that the Navy, the Navy paid for it to be built, the American Navy, I should say, expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence, which of course is very dramatic. And is, you know, not conscious of existence, I don't think we even have today, at least I sort of hope that we can do a lot of these other things. And but certainly in 1958, this was this was way too optimistic. various experts, Minsky's a famous computer scientist pointed out some of the limitations of perceptrons in late 60s. An important funding agency for American research the at the time advanced research projects agency, it's it's defense, it's now the defense advanced research project agency DARPA became less willing to pure pure computational research in the late 60s and early 70s so that slowed research in these areas. So if we actually to the limitations of early machine learning, we don't actually have a lot of computing power, led to more focus on expert systems rather than on rather than on on machine learning. Now in the early 80s, there was a whole industry, you know, private industry, building machines like I've shown in my picture on this slide called Lisp machines, which were workstations optimized for running expert system, AI programs, using a language that's very good for that called Lisp. But this market was disrupted by general purpose workstations and ultimately by, you know, good desktop computers from people like IBM and Apple. So that basically killed that industry and with it a lot of the interest commercially in AI. But there were other problems to one of them is that expert systems in general are hard to scale. I mean, it's fine to build an expert system for chess. Chess is not getting more complicated. It's going to be it's still going to be an eight by eight board in 1000 years. But with with various human scale systems, you're going to want to you're going to want to scale things up as they develop. And it's actually hard for expert systems to learn. And yeah, in the field was maybe further hamstrung by funding cuts in the 80s and 90s as well. So but now we're in the 2000s, the 21st century. And the so there's sort of two big images on this slide that I want to explain. The one sort of on the left is showing how in a very heuristic way, how in the 80s and the 90s and before, when there wasn't as much computing power available, and when the scale of systems was smaller, approaches other than machine learning, this, this graphic actually focused on a particular machine learning algorithm, neural networks, were generally the best way to do things. But as we go up to higher and higher scale, which we can do with more and more computing power, a neural nets become the optimal approach. And then the second image I have on this, this slide is just showing how the, it's an expression of what's called Moore's law, the idea that that computational power increases logarithmically that this this plot actually is in a different metric, it's calculations per second for thousand dollars. And the increase is actually faster than logarithmic. And you can kind of see that we're reaching the point where we're going to start having, you know, before long, it's going to be pretty easy to have have machines that have computing power comparable to brains. And that is also something that drives, you know, our ability to think about doing very ambitious things in machine learning, when our computers are as powerful as the things that we already know do learning, which is animal and particularly human brains. So driven by this dramatic increase in processor power, in the 2010s, we get dramatic growth in machine learning within AI, we get a dramatic growth in the interest in neural networks, within machine learning. And then ultimately, we get a lot more interest within machine learning, or sorry, within neural networks, in relatively complicated neural networks, which have, I'll say more that will make what this is clear, that have many different hidden layers. These have become more prominent within recent years, I've shown sort of Google trends, what the the big Google searches are in terms of artificial intelligence, machine learning and deep learning. We see that early in the period, machine learning wasn't necessarily the dominant way of thinking about AI, even as late as 2004. But today, people search for machine learning, where often they do for artificial intelligence, even though machine learning is a subset of artificial intelligence. And an appreciable fraction of these queries are for deep learning. So now I'm going to switch gears a bit to talking about some of the algorithms that I've mentioned in going through the history. One that I really wanted to talk about, because I think it makes clear that machine learning is not this, this different mysterious other thing that we don't know about. It's really a very straightforward extension of things that we already know about and that we already do as physicists. So when we talk about regression, so what is regression? Well, we have some, some data, maybe simulated, maybe real data, where we have, which we can think of, let's think of our inputs as n vectors, which are called Xi. Each x is a vector in, it's well, the real numbers and their M of them. So a vector in RM. And for each of these vectors, we have some known function value, which we'll call Y. So that Y is given by some true function that we may never know, plus some noise term, they're always, there's always experimental error, that sort of thing. Just to introduce a terminology in machine learning, we would call these vectors samples that dimensions that the components of these x are called, well, the dimensions really are called features and the vector space itself is called feature space. And they're two very important cases. One is where the the yi the true values take on, you know, a handful of integer values, maybe just two zero and one say, or maybe zero one two, whatever. These are corresponding to class labels, you know, maybe, maybe zero means background and one means signal. And then we have what's called a classification problem. The other situation is where these y values can take on continuous real numbers and then it's regression. As you can see really classification is just sort of a special case of regression, though you do things a little bit differently with classification. So how are we going to do this analysis? I want to go through a little bit more detail to introduce more of the big ideas in machine learning. So if you think about your goal in regression, or you could call it curve fitting if you want, is to find an appropriate function of our x. It's also the function of some set of parameters, which here I'm calling beta, you know, examples of this might be maybe linear regression. So we're going to have some vector that we dot into x and add some overall constant. And here, alpha and beta are parameters. And there's a choice of, you know, what kind of function we use? Is it linear? Is it quadratic? Is it exponential, logarithmic, something else? And so the choice of the model, how many terms it has, things like that are what are called hyper parameters. They're not, they're not parameters, they're sort of meta parameters, but people, people say hyper parameter rather than meta parameter. And to actually perform the fitting, what we're going to do is once we've chosen our function, our hyper parameters and so on, we're going to minimize some function, it's often called an objective function, some people call it a loss function or a loss metric, or actually a cost function is another thing that's called with respect to the parameters. And the classic example of an objective function why I'm calling this least squared regression is to have your objective function be at least squares to be the sum of the, of the differences between the what you're getting for your your fit function with some set of parameters, and the true values summed over your your data set. And for each of these differences, you're squaring it and you're adding them, them up. And so you're trying to minimize this, of course, as we know, this can be challenging. Because when you look at the value of this, the sum of squares as a function of your parameters, you can get local minima. So in general, you would like to be a situation where your function is such that it's easy to calculate things like gradients, and maybe more even advanced information about the function like Hessian second derivative, that sort of thing. So we can use it in optimization algorithms, an important example of steepest descent minimization. But of course, there are many, many others. If you're doing classification, it's only slightly different. Basically, then you're doing regression, not with zero and one. But with what's called the log odds, you're actually trying to find the right probability for each point. And you use an objective function generally, that's related to maximum likelihood ideas. And as a result, it ends up an interesting way of being related to entropy. So the big problem, of course, in doing anything like this is what's called overfitting. And I've shown an example of overfitting on this slide where we have some points that are, you know, little blue dots. We have the correct best fit line through these points. And they're actually generated from a function that's that line plus some relatively large noise term. And so if we try to actually fit all the points, we get maybe something like that orange curve that hits many of the points in the left plot. But of course, this is nuts, you look on the right, and you basically use the same function with a statistically equivalent data, it's just it's different noise terms and different random x values. And you don't, you don't fit the function very well at all. And this is always a problem, of course, we all know that you can fit anything if you have enough parameters, but you don't want to do that. But it's especially a problem with some of the powerful machine learning techniques. I'll talk more about like, like neural networks, because you have so many parameters, it's actually very easy to overfit. And so we do this by to avoid overfitting. The biggest thing you do is this sort of procedure of dividing the data set into three parts. The first part is what's called the training set, where you you train, the name implies your algorithm on this training set you this is what you actually where you actually obtain your parameters by minimizing your objective function. But then you have another set of data that you haven't fit on, where you test various choices of say what the function is. And you make sure that you have an overfit. And then finally, there's a third set where once you've made your decisions on what function to use what hyper parameters to have, and found the parameters through optimization, you then see how well you're doing on yet a third set of data that you haven't used previously. And that's to give you sort of an unbiased idea of how how well you fit things. And if you've overfit or not. So that was why I wanted to say about regression and sort of introducing these ideas of objective functions and parameters and hyper parameters, training sets, validation sets and test sets and other important ideas in machine learning. So now I want to say some words about maybe the first machine learning to be kind of, you know, maybe really exciting to the community, even though it was developed after neural networks. So and this is the perceptron. And so the sort of the use case for the perceptron is that we have data, for instance, shown here in this somewhat cartoonish plot, as blue and red circles. And we want to classify this data with the assumption that you can actually that it is linearly separable that you can draw some well line in two dimensions hyperplane in general dimensions that separates the data into two classes. So of course, this dotted gray line, the separating hyperplane is defined by some vector w and some bias. Since it doesn't go through the origin, b, the correct choice of w and b will have the dot product of w and x plus b less than zero say for the red points and greater than zero for the blue points. And then so to actually find this w algorithmically, you take some initial guess, you evaluate w dot x plus b for each point. And if it has the the right sign for a given point x, you don't do anything. But if it's wrong, you add some multiple of that data point to the vector with the idea that you're you're sort of pushing the vector to line up to be perpendicular to the separating plane, because it's basically in the direction of sort of the average of points that are in that class. Just a little technical side, we can think of that bias B is just another component of the vector and then the algorithm is the same for the biases is for the other weights. So this is a cool algorithm and it works. It always converges if there's a solution and so on. But a drawback is that it does stop converges at some separating hyperplane, but there are actually many if you think about it. So here, for instance, I have slightly different data, maybe qualitatively similar. But we see that we have a hyperplane, the dotted gray line, that's very, very close one of the data points. And of course, this is dangerous, right? We all know that, you know, there's statistical stuff in our data set. Probably the real line between red and blue is not where this line is. It's very likely that if we had a larger data set, we'd have some red point on the other side of it. So you don't actually want your hyperplane to be very close to your data points. It's much more robust to find to get it as far from any other point as possible. And this is, so okay, so this is the same data. And we now have shown a better hyperplane separating our points. The distance from the nearest points to this plane is called the margin. And these nearest points are called support vectors. And as a result of that sort of funny name for the nearest points to the separating hyperplane, this algorithm is called support vector machines. And we find the best, the best hyperplane in support vector machines by using Lagrange multipliers and so on to optimize the margin subject to constraints. Only these nearest points only support vectors actually affect that margin via dot products. And that's physically reasonable. Of course, only the points closer to the boundary effect where the boundary should be some red point very far to the right or some blue point very far to the left has no effect on what the best, the best way of dividing the spaces is. Now, of course, not all data is linearly separable. So you might think, Well, gee, this is a real limitation of support vector machines or perceptrons in general. However, it turns out that's not the case because non linearly separable data can be handled by some transformation to data. An example of this is shown in the plot here. What if we just had a unit circle, we have blue points inside the circle, red points outside the circle, it's very easy to draw a boundary between the red and the blue points, it's just not a line. But if we transformed our coordinates to new coordinates that had x squared and y squared in them, well, then it would be very easy to draw a line in this new space where the data is linearly separable. It turns out you can do this in general. So as I mentioned before, the boundary turns out to depend only on the scalar product and the support vectors. And if you replace a scalar product with what's called a kernel function, you can you can basically find the transformation of the dot products in the new space without doing the transformation explicitly because you don't want to say like, Okay, in this example, it's easy. But in general, you don't want to have to think of the new space where your data actually is linearly separable. You want to pick some kind of kernel function. That gives you the right answer. And if you have a sufficiently complicated kernel function, it'll be successful for many different possible transformations. The example here is just if you your kernel was instead of the dot product between two vectors, the square of the dot product, if you wrote that out component by component, you would see that it actually is a dot product in the transform space I showed above. And you can also convince yourself if the kernel were some long polynomial, it would work for arbitrary, you could do a boundary that's an arbitrary polynomial function, for instance, an exponential would get you of course, a long way. So really, this kernel trick lets you use support vector machines very generally, which is an incredibly powerful result. But unfortunately, when you start using these nonlinear kernels, actually doing these calculations gets to be very computationally intensive, which is part of why SCMs probably had their heyday in the early 2000s, but now are less popular than say neural networks, which is my next example. So I'm going to explain how a neural network works. I'm going to basically work my way through this diagram of a particular neural network. Though first, I wanted to go back to what I was saying about regression, where you'll remember that our goal was to find some function f with parameters theta, and it's a function of x, which is some vector in our input space. And then that we're finding the best values of theta by minimizing some objective function. Well, that's all still true for neural networks. So this is still this is actually very familiar. The only thing that's changing is that instead of our fit function being some kind of linear quadratic function that we may be more familiar with, they're a Gaussian say, our function is this neural network with some particular architecture, the architecture specified by hyper parameters, the parameters that we're obtaining by by optimizing some some loss function, which might be something likely squared still are the weights of the neural network, which we'll see what those are in a slide or two. And it's the same workflows for other classification regression problems. The same issue of taking our data, having a training set having a validation set and having a testing set. It's just that now we have a more complicated function. So what is this function? Well, this this function involves different layers. The first layer is what's called the input layer. And we have one of these input neurons, the circle things are neurons, and neurons are our little machines that do something very simple. So here are neurons taken, look for a component of our input vector, and that's their output. So the x one neuron tells us what the x one component is of our input vector x, the x two neuron, it's the x two components and so on. We also have a bias neuron so that we can, we can have some kind of offset. It's not true that that the vector x would map to zero. The next step is involves both weights and a hidden layer of neurons. So these weights we can describe as w i j. We're basically describing the i and j are telling us which neurons in the input layer connected to i j is describing a weight that connects some neuron in the input layer, j to some neuron in the hidden layer i. And it's, I think you can kind of maybe stare at this for a second and see that this means that the inputs to the neurons in the hidden layer are given by multi matrix multiplication. So we look at a particular hidden layer neuron say h one, and its input is, you know, w one zero times the bias neuron output one, plus w one one times the output of of neuron one, which is x one and so on. Since this is really just matrix multiplication, you know, matrices, you could say tensors, this is why, for instance, one of the most popular frameworks for working with neural networks is called tensor flow. Google to try to speed up neural network calculations is developed, what are called tensor processing units, which are dedicated computers for doing these sorts of manipulations. That sort of thing. So now that we have inputs to the neurons in the hidden layer, each neuron takes its input and applies some function. Why does input and it's important for that function to be easy to evaluate computationally, to have relatively compact output to be differentiable and we'll see why that's a big one and to be monotonic. And so normally, this is something like a sigmoid function or the hyperbolic tangent. And this function y is called the activation function or activation of the neuron. And now in this network, we have another set of weights, which multiply the outputs of the hidden layer neurons to give us the input to one output neuron. And then we can apply an activation function to the output of this output neuron. And that gives us the final value of our function f, which is a function of parameters, really the weights w and x. Since we chose our activation functions to be analytically differentiable, and since the input to a given neuron is found from the previous layer by matrix multiplication, it's actually relatively straightforward to calculate calculate the gradient of our objective functions, I mean, like some of squares or something like that, with respect to our weights, using the chain rule, it's just composition of functions. And doing that in sort of a specific efficient way is called back propagation, that's a word you'll see a lot. And there's also other information that we can obtain things like secondary but those Hessians that sort of thing. And so using, you know, function information, gradient information, maybe Hessian information, we then do some use some optimization algorithm to minimize the objective function or the cost function with respect to the weights of the neural net. And that's what we mean to train our neural net. Generally, there are many local minima in sort of the space of weight. So this may be a non trivial problem and people think a lot about the best way to do it. So in our example, we had a single hidden layer of neurons, and data moves only in one direction from the input towards the output. So I mean, a single hidden layer means it's what's called a shallow neural network, having multiple of these hidden layers, which you could easily see what the extension would be. These are called deep neural networks, and using them is called deep learning. Of course, that's a very important term nowadays. The fact that data moves only in one direction, that means it's called a feed forward neural network, the alternative where we could take some value from a given layer and move it back to a previous layer leads to what are called recursive neural networks. That's a very interesting area of study. So I don't I'm just kind of saying this because it explains a lot of terminology that you'll see I don't always think it's the most helpful way to think about things. But we could think about what we do in the perceptron algorithm where we evaluate the product of a weight vector with with an input vector. Well, this is actually if you think about it, this is a neural network with with weights as it was before, there's our same input layer. But the weights are only mapping to a single output weight, where the activation function is basically just checking whether the sign of WX plus B is the same as the sign of the label of that point. So you can think of a perceptron as an extremely shallow neural network, it's even shallower than a shallow network, and there's no hidden layer. But you just see this terminology where especially other neural networks are referred to as multi layer perceptrons. I don't know that's useful, but I wanted to explain this connection in case this is something you encounter when you're when you're looking at the literature. So my fourth example that I'll go through relatively quickly, what are called boosted decision trees, I'm talking about this is important method. Anymore, I'd say more in particle physics and probably the broader machine learning community, though it is an important algorithm. So, so boosted decision trees obviously involve decision trees. And this is a series of yes or no questions that identify bins of data, which given single purity, if we're thinking about single versus background. So I have an example here where we have three single points for which the first variable is, you know, zero, the second variable is one for the first point and so on, and three background points. So the best question we can ask is, you know, what's the value of the first of these variables? And if we do a cut at 1.5, if we satisfy this cut, there's a bin with four events, three single and one background. And the answer is no, we get a bin with two background events. Now there's no need to subdivide the no bin, it's already pure background. But we can then apply a cut on the other, the other variable, the other feature to divide the yes bin into a bin with three single events, and a bin with one background event. So in this example, we showed that this entry is a very good classifier for our sample data, because we've correctly classified every point in our training set, at least. And you can always do this for a sufficiently long tree, you can just have enough yes no questions that you separate every point unless of course there are, you know, two points with the same inputs in a different, different label, and then you can't. Now you can't do this for a short tree, you need some number of branches to actually do this. The problem is if you're doing a really long tree to fit your training data really well, you're probably overfitting, you're probably getting some of those yes no questions, you've chosen based on data that's noise, you don't want to do that. So you want to use short trees, those are much more robust with respect to noise, discloverience in the training set and so on. So you really want to have what's called a weak classifier, a short tree. But in that situation, the bins at the base of the tree will not be pure single or pure background in general. So there's a procedure for building strong classifiers out of weak classifiers called boosting, which involves things like reweighting the events which are misclassified and rerunning the algorithm and so on. And the resulting algorithm, this is called boosted decision trees, and it's a very important algorithm in particle physics, which brings me to my final section, which is machine learning and particle physics. So of course, machine learning is an important part of data analysis and experimental particle physics and elsewhere, but I'm focusing again on particle physics. I would say really for two major use cases. The first is identifying physics objects, you know, is a jet-to-be jet, that sort of thing. The other, of course, is determining what the underlying process is for some event, is it single versus background. Of course, we always do curve fitting. We almost don't talk about this machine learning, though it's an important point that it is actually machine learning. But I would say the main algorithms are things like neural networks, mostly shallow at this point, but there's a move towards deep learning, boosted decision trees, and things like likelihoods derived from the data. Historically, actually a lot of the discussion in particle physics has not been of machine learning versus expert systems, but of multivariate analyses versus single variable analyses. And so I've drawn this sort of Venn diagram here to illustrate, you know, when is something machine learning, when is it multivariate, when is it neither. And basically the idea is that, you know, most machine learning algorithms are algorithms that use multiple variables, but technically you can have machine learning for a single variable. And most multivariate analyses, most analyses that use many different variables for doing data analysis are machine learning techniques, though some of them aren't. And of course the main toolkit in a very popular framework for doing experimental analyses called root is called TNVA, because again the emphasis historically was on multivariate analyses versus single variable analyses, not so much on machine learning per se. Now of course it's clear why we'd want to use a multivariate analysis, you want to use more information if you have it. But as I've said, you know, multivariate analyses aren't necessarily machine learning algorithms. Two examples of important multivariate analyses that are not machine learning algorithms are what's called the matrix element method, which is something I've done, worked on a great deal in my own research, which is an approach where we're determining the likelihood for single background directly from theoretical information, from the differential cross sections. Another approach is something we're all familiar with is if you do an analysis where you have cut so many different variables and you didn't find your cuts by looking at your data, but just by sort of what's physically reasonable. That's also actually a multivariate analysis that's not machine learning. So why is machine learning important here? Well, there are a lot of things in particle physics that are hard to model sort of analytically, and you really kind of do want to look at simulated events. And that's because of things like hydrogenization, showering, and especially detector response are a lot easier to understand by a Monte Carlo simulation using tools like Pythian-Heroic for hydrogenization, showering, things like Jay-Aunt to model detectors to generate realistic events. And then machine learning can tell you about how to classify single background events once you know what realistic events look like. And in practice, some machine learning is almost always necessary for this reason. I'm running a little bit low on time, so I don't necessarily want to dive too far into this. But I'll say that excitedly the matrix element method was part of the discovery of the Higgs on CMS. But even there, even though the matrix element method is not a machine learning technique, what they actually used in their analysis was a 2D likelihood found from the data. So using machine learning, where one variable was a matrix element method variable, and the other variable was the invariant mass of the four leptons, which was smeared somewhat because of detector resolution. So, historically, a lot of the neural networks used in analyses are shallow networks, one hidden layer. But people are transitioning to deep learning because they're a little bit better. And this is a plot from a paper of, I guess a few years ago now, that showed that for a particular analysis for Higgs and Tau-Tau, you could gain, you know, almost a sigma, have a sigma to a sigma, depending on exactly what you're comparing, using deep networks rather than shallow networks. So, broadly, why are people in physics and particle physics especially interested in the same things that were interested in the tech sector? Well, basically, both in both particle physics and silicon valley, you deal with a ton of data. The LHC produces, you know, tens of petabytes a year. I mean, terabytes, you know, you could have in a hard drive, petabyte is, you know, a thousand times bigger than that. But, you know, at Google, say, they were producing, you know, 10 to 15 exabytes, so another factor of 10 to 3, already in 2010, so it's got to be a huge number today. People have talked about the future, especially with things like the internet of things. You're talking about tens or even hundreds of zettabytes, so even another factor of 3 beyond that. So a tremendous amount of data, you want to use your data in a smart way, generally that means machine learning. However, one challenge is that the goals are sometimes different between physics and the tech sector. A lot of that is because, you know, in physics we're very concerned with establishing sort of true statements about the world with a high degree of confidence. We don't call something a discovery unless it's five sigma, which is a tremendously low p value, so you're really interested in knowing, you know, about very extreme, very strange background events, whereas, you know, a private company isn't so interested in what, you know, one in a million of their customers do, I guess unless they're very rich customers. They're more interested in what the typical user does. So there's some difference in focus. In both sectors, I think it's important to understand why analyses work, though because we in physics need to know, understand these tales of distributions. It's probably even more important for us. So to wrap up, I would say that machine learning is broad as sense, has always been part of experimental particle physics. There are many algorithms for doing machine learning, since it's a very broad idea of learning from the data. There's been a great excitement in recent years from deep neural networks, from deep learning, which will play a larger and larger role in particle physics and elsewhere. I think in particle physics, ultimately, displacing shallow neural networks and business decision trees. But this is a very fast developing area. So stay tuned and thank you. Thank you, Jamie. Thank you so much. Let me turn this to me and my camera. Well, thanks for the talk. It was really interesting. I have lots of questions, but before I ask my questions, I'd like to open the floor for questions and then also look at some of the questions that were asked by anyone during the transmission. So does anyone here have a question that I would like to ask? Oh, hold on. I'm still presenting to everyone. Sorry. Now I stopped it. Okay. All right, sorry. There's some things that have changed with Google Hangouts. Let me just ask the first question. So a lot of people say AI and we're going to make a lot of progress, but it seems like we still need lots and lots of data for things to learn. And in particle physics, we still depend a lot on the models we use and the simulated part of the data in order for things to learn. Is there a way we're ever going to be able to just use how we observe events to make decisions as to what events we're observing and then sort of translate that to a model. It could be like a phenomenological model, but in that sense, we're losing the dependence on on the standard model, on how we differentiate in our background, rather than working with what the machine learns and then try to construct a phenomenological model that could lead to new physics and then develop a model from that. I don't know. I think the basic answer is that it's hard, right? The more straightforward thing to do is supervised learning, where we're training on data where we know the answer. But of course, we're very sensitive as you point out to how we have simulated the data. So ultimately what you would want to do, what you're, there are things you're talking about, would be to have very powerful unsupervised learning procedures that could just maybe look at the data and know sort of what to do with it. But that's hard. That's that would probably be involved having more powerful sort of general purpose AI, something that people talk about. I mean, ideally you could write imagine if you had a sufficiently powerful computer. It could do everything we do. Our brains are just computers. So it could do it. It could even be more powerful. And then it could do anything, anything we do at all. But that's going to take, that's not right around the corner. That's some years in the future. I'll note briefly that for addressing the specific challenge of OG, are we learning from some feature of our simulation rather than something that's really there in the data. There are techniques like adversarial learning where basically you have an automated way of seeing what's fooling your neural network and then you can improve based on that, iteratively. That can help you understand if this is, you know, OG might, you know, maybe Pithya get something a little wrong or whether it's something else. Because you could imagine just out of, out of, you know, just talking about LHC physics for now. You could imagine just having a machine learn, supervised or not through just calorimeter information, right? And then yeah, absolutely. It's been really, yeah. Yeah, something I think is exciting is, you know, we think about, you know, doing this on, you know, we think often about algorithms on, you know, sort of physics objects. But of course, you know, physics objects themselves were constructed, maybe even through machine learning algorithms from underlying data. And in principle in the future, we can just go straight from the underlying data. Actually, if you look, this may be a bit of an aside, but I think an interesting one, you look at very future colliders and extrapolate developments in electronics and in computing power, you could really kind of read in everything and not have a trigger and just, and it could all be processed in some sort of very intelligent way. Yeah, that's the future. And then the last thing I have to ask is, when you talked about the Go program, it learned from amateur data online. And then it was able to beat a professional, like a complete leader, like the best person at Go. Well, there were two steps. And the first step was learning from amateur players online. And that's just because when you can make, you know, 300 moves, you want to at least start by, you know, you want to keep things reasonable for starters. The big thing was that the program played itself millions of times. And that's where it went from being, you know, maybe, you know, a competent amateur, which I'm not, I don't really even play Go, but, you know, for maybe someone like me at chess to a world champion if it were chess kind of, it's from playing itself and learning from its mistakes and trying every different thing. That's what I was, yeah, okay, so that's important, right? If you just give it a set of rules, and then just let the machine optimize its own playing, and it can get better by itself. And whether or not it gets better than the best person, that's right. It's just still got better, right? Right, that's just a threshold. It is interesting that, you know, now even that version was better than the best person. And actually, the more recent one alpha zero is totally from the self-play. It knows the rules and then it goes through the plays and plays and plays in a few hours it can beat anybody at chess. That's amazing. Anyway, thank you for answering my questions. Does anyone here have a question two or three or four? I have a question here. Well, first, thank you very much for this very nice talk. And my question is, well, according to your conclusions, the community in particle physics have been using machine learning for, well, the beginning. And basically what they're doing is changing the techniques. And now they're changing to deep learning, let's say. Right, right. The problem that I see is that at least in deep learning or well, also in these other machine learning methods, we physicists always want to have control on what is happening. Right, right. And in deep learning, it is not happening, but I don't know what is your feeling about the community now changing to deep learning in which we are not completely sure what is happening. Right, right. No, it's an incredibly important question, right? I mean, when you use more sophisticated algorithms, when you use neural networks, when you use deep neural networks, you have a lot of power to, in principle, learn from information in your data. But it's at the cost of having learning things that a human can't see and hence maybe we don't really understand. And there's a lot of fear, especially in physics, that maybe we've learned from a systematic and it's not a real difference. And that is a problem. So if these become, as I think it probably will, this becomes more and more important, there's going to be more and more research in learning how to understand what we're learning from the data. Yeah, it's a huge, it's a huge issue. Is anyone else looking at? Yeah, I have a question for Jamie. Also, it was very nice the talk that you gave, like a lot, especially because you you teach us about the, all the key words important in machine learning because, yeah, for someone that is not doing machine learnings, the only word that praise that is machine learning itself, but not the important part. So in the case of particle physics, in experimental particle physics, how long is usually this, a program can run to have the some results in the sense, because other methods that are more expert tends to be more efficient in regarding the computational power, but they like the the feature to learn from the or to extract this hidden physical meaning that are in the data, but more or less. That's a good question and I don't really know quantitatively. I'm a theorist rather than an experimentalist. I would certainly say that I think why there hasn't been more deep learning at this point is really in part the time issue, but not the time to run, you know, to train once with with a deep neural net, but because, you know, and this goes back to the last question, because there's so much concern about understanding systematics and so on, that you're not just going to, you know, do a given analysis once you're going to do it many, many, many times on different parts of your data, varying parameters, maybe modifying different variables to see how they're having an effect on your data. So having that sort of deep understanding of where your sensitivity is coming from, if you're also going to pay some 10 to 20 whatever feature penalty in terms of how long it takes you to run each step, that's very painful. I mean, as you know, analyses already take, you know, a lot of work and a lot of time. So, of course, the nice thing is that computers get better and better. So, so that's not going to be a roadblock looking forward, though it's, I think it's still a consideration right now. Yeah, just another question that is also a kind of curiosity also for the people that is watching YouTube stream. So, in your case, or in for for more PhD students, all these tools of machine learning, how they how you're learned about these because let's say that it's not like a common course that maybe now that people follow in courses for all generation, let's say this something that is new. I think right now, unfortunately, it's not as so I know personally, for me, it's a combination of actually just, you know, going to the course at the university where I am now and you're to Hawaii of, you know, online courses and then trying to read about things. There's some great online textbooks, but they're generally a little bit more heavy duty on the statistical side might want just to be able to apply things. I'm thinking in particular the element of statistical learning, which is one of the major texts in between Google for and download, but it's from people from who are really ultimately kind of statisticians who are interested in, you know, doing some of the mathematicians about it. So I think there's really still kind of you know, a gap really in terms of, you know, what's, what's, is there one simple reference to give who say beginning an experimental graduate student so they can really understand these methods. And I think a lot of it's passed down person to person and and so on. So, yeah, I don't have a great answer for that, unfortunately, other than, I mean personally, for me, it's been from a variety of sources. Yeah, like most of us, like I don't we learn the by experience. I mean, yeah, by doing this stuff, we would learn to use the tools that you know. Thank you. And another, another question that it's interesting, there's another one of those emerging technology, but based on a lot of years of theoretical work, quantum computing. Yes, yes, yes. Do you think the, the algorithms are as things, as things get, as the things we're trying to address get harder, we're going to need algorithms that use at least a portion of them quantum, quantum mechanical algorithms, algorithms. Well, that is a really great question. That's, that's very much an open question. People are starting to look at the research level. This is the cutting edge into quantum computing for machine learning. Not a lot of people, because this is really, you know, the very edge or maybe a little beyond the edge. There are reasons to be hopeful that it would be helpful, you know, in particular with in principle with quantum computing, because you're looking at, you know, many, many states in superposition. You might be less likely to end up in your optimization algorithm in some being dominated by some local minimum. Hopefully you're going to be somehow sensitive to the global minimum, of course, because it's quantum computing. You're, you're doing all these states in that are in super position, but you're actually really get one answer and somewhat probabilistic. And it's not obvious to me that you always get are going to be able to optimize as well as you'd like to maybe these local minimum are dominant. Maybe there are many more of them or something like that. But it's, yeah, it's, it's absolutely an exciting research level question, whether quantum computing is going to be a big part of the machine learning story ultimately. Well, I have a question here. My question is regarding, well, well, it's not a secret that the big player in this machine learning thing is industry or rather than academy. And are you aware of some contacts between industry, I don't know, Google or Amazon, the big players in this with academia? Well, there's two levels of that because there's when we say academia, we can mean kind of two things. We can mean any academia, including computer science academia, or we can mean physics academia. And actually in the in sort of the research arms of big tech companies are a lot of people who are actually still in academia in computer science departments. You know, the big, you know, Jan Lacune is a huge guy in this area who's at Facebook, but he's also at NYU, Jeff Hinton, there are other big people in machine learning who have both academic affiliations and affiliations and major tech companies in their research groups. So on that level, there's a lot of interaction. I mean, there are important questions about, for instance, do too many people who get PhDs in machine learning end up in tech versus in academia? Is there some concerns about who's going to teach next generation? And then, of course, you know, I mean, I think people do in, I mean, you know, I've had conversations with people at you know, Google Brain and places like that who, you know, certainly people are interested in solving problems of academic science of physics or of chemistry or medicines, a big one, of course. Now, you know, is it the right amount? I don't know. I don't think I have really developed a opinion on that yet. But what I hear a lot is social use in this to address highly complex human challenges, right? And when physicists know how to do this, when other scientists know how to do this, they definitely contribute to social problems like, you know, people talking about IOT, right? How is that going to help us transition into the next generation of cities and communities or whatever. So I think I think I think I see it more as an opportunity for people in the physics community that know how to do this to, you know, either transition out of academia if they want to or just be part of something else if they want to as well. I think of this as a skill that we are, that we could have if we wanted to work on things like this. So and I'll make just a point, such as to know you have a lot of interest around education, I think it's important that we make sure going forward that our undergraduates and graduate students are good enough at this that it's a relatively easy transition into the private sector. Oh definitely, I think I agree. And this is what we have now. Yeah. I think definitely your colloquium does this, right? We're trying to motivate people that learning this could contribute to your science, but also to other things that you want. So but I agree, we should definitely be addressing this in the curriculum somehow. I don't know how yet. Yes, it's a hard question because, you know, you can't cut, you know, electromagnetism to doesn't learn more computer stuff, but it's a useful skill. But you we could definitely, we could definitely take some of this and the on the curriculum and see if we can implement certain examples that use machine learning or at least the primitive aspects of machine learning, which then changes into the more advanced anyway. Does anybody else would like to ask anything? This is going to be available in GitHub, right? Your talk and then. Yeah. Yeah. Within the next year, so I realize there are a few little typos all set. But take your time. Take your time. And we will we will share that. And thanks for being our first. Thank you very much. I really appreciate it. I have a question over here. Sorry, Joe, go ahead. So you mentioned at some point the overfitting, right? And you showed it with some very extreme examples. But I was wondering if there's like a procedure to figure out if you are really overfitting or not, right? Because of course, in your example, it was obvious, right? But I can imagine more subtle. Sure, sure. No, it's it's tricky. It's tricky. And that's why you have the you know, you have the the test set where you you see how well you do with your fit on some data that you haven't used yet. There are also more maybe more algorithmic ways of trying to avoid overfitting. The big one is called regularization or basically instead of optimizing your objective function, you optimize your objective function with an additional term that penalizes you for using a very complicated function. So the example might be you have some vector of weights. Now, in addition to optimizing your function, you optimize you minimize your function plus something some number times the length of your weight vector. And that the idea there is that you're just you're encouraging to have you to have as few parameters in your function as possible. I think answering that question in a great deal of generality beyond just sort of looking at your at your test set and making sure that you still fit well with whatever function you have is a real challenge. And actually, especially when you look outside of physics to other areas of interest, there's a tremendous amount of overfitting, right? I mean, if you watch sports like I do, you'll hear someone say, well, you know, this team has never come back when down by more than, you know, two goals in the second half on Tuesdays in, it doesn't mean anything. It's overfitting, but but that's that's what people do. I mean, so it's a really big, really big challenge and a really big question. And it's, I think it's just it takes vigilance and common sense and maybe it's hard to quantify those things. Okay, I see. Thank you very much. And then thanks for the colloquium. It was fantastic. Oh, thank you. Yeah, with that, I like to thank our speaker. I like to thank everyone for being here. And I look forward to our second one, but keep staying touch because we have our seventh season just begun. And we look forward to a lot more talks about physics and a lot more talks about general aspects of science that we physicists are doing and that we could do anywhere else. So thank you very much and have a great day, guys.