 Someone wrote, let me see if I can, like I'm not gonna read out the whole thing, but like don't get me wrong, I like the all the stuff that we've done so far, so much math, toolbox, and it seems cool to have these thought out engineered solutions which are great because they achieve similar results more efficiently, faster through expert guided simplifications of the models. Ah, but at the same time, it seems like these simplifications, they prevent emergent behavior that we're now observing in AI systems, these artificial, general, intelligent systems, and which seems to be the thing to do in AI now, so I mean, I guess what you didn't say, but sort of between the lines is, why are we learning all of this stuff? Is it maybe just time to stop worrying and love the AI and just build really, really, really, really big networks and then hope that somehow by building them really, really big, that's gonna solve all the problems. Why do we have to still do all this math, pretty much? And to answer this question, I would like to use today's lecture, and I should say it's of course a very big question that no one in the world, not even the people at the very heart of the AI community can answer meaningfully. So instead, I'm going to try and answer it with a bit of a story, and then you can try and tell, like take your own messages from that story if you like. So what we've done in the course so far, not so far in this entire course, has been to build this toolbox, this way of thinking about machine learning as inference, as extracting information from data. And the core mathematical foundation that this is all built on is the notion of probabilities, a way of distributing truth values across sets of possible hypotheses. To do this right, we have to make sure that the amount of truth stays constant while we manipulate those measures of probability, and for that we saw that these three, or actually two rules, need to be employed. One, to get rid of variables, you don't want to be part of your statements, and one, to condition on variables that you have actually observed and want to know. And if you combine those two, they give rise to Bayes theorem, which is the fundamental mechanism for extracting information from data. This math though is very abstract. It's just a statement about what you're supposed to do. And in practice, it's in general intractable because keeping track, because it requires keeping track of all of the possible hypotheses at once at the same time. And that's exponentially hard in the number of variables we're keeping track of. And it's linearly hard in the, well actually it's power law hard in the number of states that these variables have. So if you have continuous variables, that's completely intractable. So the entire lecture course consisted of coming up with tools for dealing with this complexity. And it's fundamentally a question of complexity, of how to build algorithms on real computers and on paper to deal with these problems. And I introduced various kinds of models, classes of probability distributions in which these operations become tractable. We spoke about conditional independence, which can be represented through directed graphical models, or can be sort of investigated through these graphical tools. About exponential families, and in particular Gaussian distributions is a special case of that, that map this complicated abstract mathematical statement onto tractable computations, finite computations, or even linear algebra computations in the case of Gaussians. And then we actually used this Gaussian mechanism to build quite powerful models, mechanisms to learn functions, to reason about infinite dimensional spaces of functions called Gaussian processes. And to keep track of states through time, through Markov chains. And all of this required us to use non-trivial complicated algorithms. We needed to learn about how to use gradient and curvature information on how to find point estimates and construct uncertainty around them using large-scale numerical linear algebra, which is actually so important that I spent two lectures, or we Marvin and I, talking about these algorithms that you really understand what they do because they turn out to be very smart ways of looking at the data, loading parts of the data and then keeping them in relationship to each other. And then over the last lecture, actually the one that you found too difficult, I sort of branched out and said, there are more tools you could use. And if you had two or three or four semesters, I could talk a lot more about them. For example, variational inference, a general mechanism to construct probability distributions that approximate other probability distributions and to optimize entire distributions rather than point estimates. What I didn't talk about is sampling methods, methods that use random numbers, Monte Carlo methods, that use random numbers that are designed such that they in density get close or approach the distribution we're trying to reason about. So all of these tools that we talked about, they come from history. They've been around for sometimes a few decades, sometimes centuries. And I showed you some of the people who were involved in inventing them. But typically the really old historical people who like hundreds of years ago, like Laplace came up with these ideas for the first time. So today, I want to maybe help you think about what the development of these methods means for us today by, as I said, telling a bit of a story that's a little bit closer to us. Actually in time, it's still relatively far away, but maybe closer to us than we might think. And that story starts here. So this is Central Europe. It's Germany over here. We're down here in Tübingen. But the story doesn't actually take place here. The story starts here in a place that you can't read. Does anyone know what the city is called? So this is modern-day Ukraine that you see here. And the place, well, it had lots of names. So it used to be called Lemberg when it was part of Austria-Hungary. And then after the First World War, it was called Lwów, because it was in Poland. And then after the Second World War, it became part of what was then called the Ukraine's People Republic. And now it's called Lviv. And at least that's my way of pronouncing it. I'm probably pronouncing it wrong. And it's in Ukraine. And then it's a pretty nice old town. Maybe it's a little bit like Tübingen. Just a little bit. Not so much, but it's a little bit. And in the center of town, there is the place of academics. It's called, because it has an old university. And on that square, there's this building, which is, I'm not gonna try and pronounce the Polish name. It's called the Scottish Cafe. It's actually still exists today. You can go there. There's still a cafe in there. It's now called Atlas for some reason. And in that cafe, the local clique of mathematicians would meet. There was a school of mathematicians. The Lemberg school of, well, the Louis school of mathematicians. Here they are. At least that's a picture from the late 1930s. And there's a bunch of people here that whose names you might recognize. Many of them you wouldn't recognize. There's someone here sitting in the front who's called Stefan Banach. We've heard of Banach spaces. So that's him. There's a few more people who you might recognize. There's Kaczmarch here in the front. And in the back is a young guy called Stanisław Ulam, number 10, and various others. But the story doesn't actually start with them. They would meet in this cafe and they would discuss mathematics. This is what the cafe looks like today, actually on the inside. I have not been there, but you can find it on Google these days. You can find any place in the world on Google. They would sit there and they would talk about math as you do, because what else is there to do? There is no phones. So in the 1930s, they would sit and they would just discuss abstract, complicated mathematical questions. There were no computers. There was no other way to spend the time. And they would write down the most interesting problems that they could find in the Scottish book. This is, he's had some excerpts from it, photocopies actually, in which the people who were there in the school but also visitors would come, they would pose questions and then people would try and solve them. And there was a rule that if you would solve one of these questions, you would get a life goose. Because it was a bit of a nerdy thing, I guess. It's this kind of games you play. I'm sure they had some alcohol during this whole process as well. So here are some excerpts. One interesting thing about this is that you can tell, some of these are written in Polish. Some of them are in German. So here is an entry, actually, here's an entry from a guest who came by a Hungarian young mathematician called John von Neumann in 1937. He writes down a problem. This is in German, by the way. Gegeben eine unbeschränkte additive, multifikative, bursche, algebra, some problem. And then, above him, there is an entry by a Polish mathematician called Hugo Steinhaus, who writes here in 1937. Actually, I'm not sure whether this, I think this might be Polish. It's difficult to read, actually. But it says problemat up there, so I'm guessing not German. Here is Hugo, Hugo Steinhaus. He was one of the elders among this group, one of the founders of this community. He was born in what was then Austro-Hungary and died in Poland, Wroclaw, which was like a part of Poland that back then wasn't actually Polish. Well, now it was when he died. He studied in Göttingen with David Hilbert, where he got his, David Hilbert, where he got his PhD. And then, like many mathematicians, he had to somehow vanish during the so-called Third Reich. But before that time, he was an important figure in this town, where he taught lots of young mathematicians, among them, Banach, for whom he was the PhD advisor, and Marc Kacz, who some of you may have heard about if you've done anything with physics. And also, this guy called Danny Fafu-Lam, who I've already pointed out in a picture, although I'm not actually sure that he was his PhD advisor. It might be that he was just some kind of more distant advisor, and Banach was actually the direct PhD advisor. And he kept this book, apparently, for a while at least. He would take it home with himself after coffee, or after the evenings or whatever. And one of the algorithms he was thinking about, one of the problems he was thinking about, it was on how to separate how to structure material objects into sub-elements. So if you have a rigid body of material and you'd like to describe it in terms of parts, because then you can think of the dynamics of these parts as individual bodies, you can sort of try to reduce the complexity of describing a complex-shaped object by reducing it into individual components. So he came up with an algorithm for this that he published in the Polish Academy of Sciences Bulletines in 1957 in French. Sur la division des corps matériels en partie, the separation of material bodies into parts. And it's this algorithm. So it works as follows. Given a bunch of point masses, effectively, you decide that you're going to use K parts to separate them in. You initialize these K components to K different means at random. And then you iteratively do two things. In the first step, you assign each of the X i to its nearest mean. So the one that is just closest in Euclidean distance. Another way of writing this is through a binary variable that is set to zero or one. And it's one if this datum number i is assigned to the K component and otherwise zero. And then after you've made this assignment where each part belongs now to one of these bodies, you then compute the center of mass effectively of these collection of point masses, which is just the sum over all of the point masses assigned to this center. And then you just keep doing this over and over and over again until it doesn't change anymore, until it stops. This algorithm is called K-means and you've probably heard about it in some earlier statistics or machine learning lectures. It's an elementary form of finding centers of gravity. You could also use it to separate groups of people into individual locations. It's a very ad hoc kind of algorithm. It doesn't provide a description of the distribution of the points. It just separates them into individual parts. And also it has all sorts of weaknesses. For example, it assumes that all of the centers of, all of the objects are basically of the same shape roughly. So this was 19, well, okay. So the thing, this paper was published in the 1960s, but Steinhaus' productive time was in the early 1930s and maybe even a bit before because he was born in 1887. So actually probably more like around 1900 or something. And this is where our story turns a bit dark. So, well, not just a bit actually. So, the Wief was in, quite in the center of Europe very much. It was a multicultural place. There were people speaking different languages. A third of the population of the Wief was Jewish and it was right in between these complicated power dynamics of the 1920s and 30s and even before that on the First World War, it shifted, it's like not allegiance, but how, which country it belonged over the years, over and over again. And then when the Second World War started, it very quickly found itself again in the center of violence. So the Russians moved in when they took over the town and also imprisoned various political prisoners. So people who were fighting for Ukrainian independence at that point in particular, several thousand of them actually were imprisoned by the Nkavidei. As you say in German, I don't actually know how to pronounce it in Russian, the interior ministry. And then the Germans moved in. So on late June, 1941, the Wehrmacht approached from the west towards this part of Europe. At that point, it wasn't really clear whom it belonged to. And as they approached the Wief, the Russians realized that they had to leave town and they started killing all the political prisoners, 5,000 of them, within a police station. On the 24th, 25th, 26th of June, probably, who were largely Ukrainian nationalists. And then on the 30th of June, the Germans reached Lviv and they dug out or they found the bodies of the killed political prisoners and they used them for propaganda purposes. They forced Jewish residents of the town to carry them out on their knees. And then they did something on the, starting from the 30th of June, that was typical of the German strategy in Poland in particular. It was called an AB-Aktion, an Außerordentliche Befriedungsaktion, which the idea was basically, if you kill off all the leading people in a population, then it's like lobotomizing an entire populace and afterwards they just, like the normal people follow along much easier. So with the help of Ukrainian nationalists, the ones who had survived, they identified the intelligentsia in town, in particular the professors of the university. So the Ukrainian Nationalist Student Network who was part of the OUN, the Organization of Ukrainian Nationalists, helped them find 22 professors who were rounded up within three days or so of the German surviving in Lviv and they dragged them away and killed them really brutally. Some of them were shot, the professors typically were shot, but their families, their children, their assistants were sometimes just killed with hammers. It's the kind of stories that we hear from Ukraine these days, sometimes again as well. And they were supported by this Organization of Ukrainian Nationalists, which had a faction that was outright fascists and antisemitic. It was led by a guy called Stepan Bandera that is a name that you may have heard again in 2022 when the ambassador of Ukraine to Germany called Andrei Mielnik in an interview didn't want to denounce him as a fascist and an antisemit. And that was maybe part of the reason why Germany initially was very slow to react to the war in Ukraine. So if you wonder where all of this came from in 2022, here's a connection. This is a memorial to the so-called Lemberg Professor Mort. This is actually not in Lviv, this is in Poland. Can anyone read Polish? Okay. So this, I can't pronounce it either, but this means let our destiny be a warning. It wasn't the only, it was only the start of a lot of killings. So it's not like they only killed 22 professors. Hundreds of thousands of people died in this region, mostly Jews during the occupation, but it was a start of it. They also killed three of the mathematicians from this school of mathematicians, but they couldn't catch all of them because many of them had already left before. They'd left because they saw what was approaching, much like maybe some people now saw what was approaching in 2022 again. And a lot of mathematicians, especially Jewish mathematicians by that point had left not just Lviv, but they've left Europe. Two of them are these. So here is John von Neumann again, who you saw, who's entry in the Scottish book you saw. He was actually just a visitor to Lviv. He didn't study there. He studied, he's from Hungary. He was born in Budapest. He studied in Budapest, but also mostly in Germany and Berlin in Göttingen with David Hilbert. He was also at the ETH in Zurich for quite some time. And he actually wrote his Habil, his Habilitation in Berlin, which was evaluated by Isai Schuhr, who you might remember from Schuhr compliment, the guy with the algorithm that is connected to the Cholesky decomposition. And he left in 1929 or so. So that was before Hitler took power, but even then it was already difficult for a mathematician of Jewish descent to get a professorship in Germany. It was just difficult to get these kind of academic positions, senior positions for someone like him, even though he was a polymath and an absolute genius. So he moved to Princeton because there was more money to be made in the US. It didn't seem maybe as attractive back then in 1929, but the academic landscape was a bit more fluid. This guy, Stanisław Fulham, he actually was a member of this mathematics school in Louisville. He was actually born there. And he left in 1938, right before the Second World War started, also to the US, where he started working on more physical problems that we'll get to later. But what does all of this have to do with K-means and machine learning? Well, so to understand more about this algorithm called K-means, we have to start thinking about what it actually does. So the algorithm that I showed you, this one seems very ad hoc. It's just an iterative procedure that steps back and forth between doing two things. So to understand what it does, we have to give it a mathematical structure. We need to understand what kind of algorithm it actually is. And it turns out that it's an optimization algorithm. To understand why, we need to show that it actually optimizes something. And the way to do that is to find a function that it minimizes. This function we can find thanks to a beautiful theory by a Russian mathematician called Oleksandr Ljapunov. No, Ljapunov, I think is the way to pronounce it. He's from a previous generation. He died at the end of the First World War. He actually lived for most of his... So he was born in modern-day Russia, but he lived for most of his life in modern-day Ukraine. He taught in Kharkiv and died in Odessa in the end. Odessa is again in the news, of course, these weeks. And so maybe actually the most important thing that you find whenever you try to read up about Ljapunov is that he was very handsome. I mean, look at this guy. So basically every report of him, every story about him starts with how he looked so good. But he was also a very good mathematician, and he came up with this idea that you can think of any procedure as an optimization algorithm if you can find a function that it decreases in every step. And these functions are called Ljapunov functions. So in the context of iterative algorithms, such a function J is a positive function of the algorithm state variables that decreases in each state, in each step of the algorithm. If you can find such a function, that means you can think about the algorithm as an optimization routine. So that's a second concept to use to describe algorithms. So the first one is this kind of iterative thing, the for loop. The second way to think about it is that it optimizes something, some object, a function called J. What is this function in K means? So here's the algorithm again. Now as a pseudo code, K means consists of you taking a data set and a number of clusters you want to build. You randomly initialize the means of the clusters, the K means, and then in each iteration, you find for each datum the nearest cluster, assign it to that cluster, and then build the sum over the empirical mean of the data. So a function to consider is this one, which you may have seen in a previous lecture. So J because of Yapunov function. It's the sum over the clusters and the data times the assignments. These are called responsibilities often, RIK, times half the square distance to the mean. Why is this Yapunov function because it's minimized in every single step? The very first step evidently minimizes this function because we are finding all the Rs that literally make these numbers in the terms of the sum small. So it's clearly a reduction of this function. And the second step, line five, also decreases this function. Why? Because we can think about the gradient of this function with respect to the means. The means are being reassigned in this line. If you write down this gradient, it looks like this and set it to zero. You can see that the update that line five does, so this is meant to be an element wise division by the sum over the Rs, I should have written this in Python notation, is exactly the update you would get if you want to minimize this function as a function of the means from one step to the next. And so that means we can think about K-means, this algorithm that was published in the 50s but existed much earlier, as a minimization procedure for this function. And so one way, one other name you could give to this function is that it's an energy. It minimizes some energy this process. Energies are a physical concept for well things that describe actually probability distributions from a modern perspective, actually negative log probability distributions. So a third concept to use to describe what's going on here is that we're not minimizing some energy, but we're actually maximizing a log probability. And I already told you that this is what this algorithm actually does. It's a process for constructing a generative estimate, a generative model for the data in a relatively crude fashion. The assumption is that each datum is drawn from a Gaussian distribution. Why Gaussian? Because there is a square distance in K-means. And if we're minimizing a square distance, that's like maximizing a negative square distance or maximizing the exponential of a negative square distance, which is a Gaussian distribution, where each datum is assigned, hard assigned to a mixture component. And this algorithm clearly iterates between these two steps where you compute the assignments. And then given the assignments, you compute the most likely value for the unknown means of the Gaussians. So over the last two lectures, I tried to convince you that there is a more general way of thinking about what this algorithm does that allows us to apply this notion, this connection between energy minimization, probability maximization, and iterative algorithms in a much more general fashion to a very broad class of problems, which is connected to the idea of the EM algorithm or more generally actually to the form of variational inference. So these algorithms iterate between two steps. The assignment part is effectively computing a posterior distribution for the assignments to clusters, for the Z variables, or the responsibilities, which in the cart case are the same thing, really. They're like maximum likelihood type assignments by effectively computing a posterior distribution, a most likely assignment or a posterior assignment, depending on whether you want to use a probability distribution or a hard random variable assignment. And this effectively minimizes a KL divergence. Seems too general to describe this as a minimization of a KL divergence, but that's really what it is. It's a bit, for this simple algorithm, it's a bit, maybe too big of a hammer to use to describe what it does, but it's still correct. And then the second step, we maximize what we call the elbow, the evidence lower bound, because the sum of those two functions, L plus the KL divergence to the posterior, happens to be equal to the log evidence, or the log marginal probability distribution over X, given the parameters of this model, which are mu. If you integrate out Z, or in this case, we sum them out, the possible assignments of all the Zs. So because this is for fixed mu, this is a constant by setting this Q to this P, we make this zero and therefore mean that locally, this function is equal to this function at this particular choice mu. And then as we maximize mu, we have to increase this function as well, which is why we find a maximum likelihood assignment for mu. So I showed you already some theorems, some statements about this. So here it is again, it's really just a copy over again. But this kind of algorithm, this kind of class of algorithms actually didn't at first come from statistics, the way we think about them now, or machine learning comes from a time much before computers were even a thing. It's associated historically, long before our story in Central Europe starts, in physics with what's called variational free energy. And that's why we call it variational inference, because it's associated with the notion of free energy. Why free energy? Well, because as I said, physicists like to think of probability distributions as the exponential of minus an energy function. Something like this is called a Gibbs measure. It's a way of writing a probability distribution. And it turns out that this language for describing probability distributions is very general. Almost all meaningful probability distributions can be written in this form. You can also see that it's in some sense very close to an exponential family. It's a bit more general than an exponential family. You can actually write any Gibbs measure if you really want to in the form of an exponential family, but it doesn't help much because the log normalization constant usually becomes very complicated. So we can see that when we take the, and we compute our elbow, so this L of q, then we're computing, if you look up here, an expected value under q of log of the joint. So the log of the joint is minus the energy minus integral over q log q. So integral over q log q is just the entropy. So we can think of actually plus the entropy, sorry, minus q log q is the entropy. So if it's why the minus in front of our elbow, then this object looks a bit like, or it is the expected energy, right? So we take the log of it and we can admit of the minus, we take the expected value. So it's the expected energy of the system minus the entropy up to a scale. Actually in physics there is often a one over t here in front for a temperature that somehow shows up. So in physics this notion is connected to the idea of statistical physics, which is used to describe the dynamics of large scale systems with lots and lots of particles. And the typical notation that physicists then use is to write something like this equation, which is u means the energy, the expected energy of the system, sometimes called the potential energy. The entropy is sometimes called s in statistical physics traditionally, and then the t arises from multiplying through with a one over t here. And then this thing is called the variational free energy. Why? Because so physicists like to make a statement that systems minimize their free energy. This comes down to statements by Helmholtz, who came up with this, it's one of the names associated with the idea that you can describe the world in terms of minimizing energy. That somehow systems find static or stationary points in which they minimize their free energy. So now from our perspective, we can think about maybe like sort of hindsight. So the idea decided by physics is we would talk about it completely differently, right? We would say there is this system out there, which we don't understand. So we're going to model it by writing down some generative model for the process. We're gonna write down some p of x and z, all the variables that we come up with. And then we try to construct an approximate distribution that, well, actually no, we don't try to construct an approximate distribution, but the world constructs an approximate distribution which maximizes the elbow, right? It gets really close to this model. That seems a bit backward, right? Because we like to think of the real world and then the real world has p and we construct the approximation q, but in physics it works the other way around. You get to write the laws of nature, you say, ah, there is potential energy and chemical energy and then there is entropy and they're just defined as variables. And then the world just makes sure that the free energy is minimized. So that's a backward way of saying you're building a model that is trying to get as close as possible to how the real world works. And then everything else, every deviation from empirical data to our theory is just this system trying to minimize its free energy. It's just some additional free energy in the system because it's not exactly described by our posterior and that free energy is just the world being complicated. So a lot of people in the development of statistical physics in, I would say happier times, but I'm not sure they were actually happier times. This was also a very revolutionary time that these people lived in. Thought about these kind of descriptions of the world and they came up with different models, things that in 2023 you could think of them just like people inventing different architectures of deep neural networks, different descriptions of the world in terms of other descriptions of state spaces of how to think about the physical world in terms of pressure and volume and temperature and free energy and enthalpy and these different words that describe how a thermodynamic system of particles that interact in particular ways in chemical fashions and mechanical fashions might evolve over time. It's connected to these people, Hermann Helmholtz, Josiah Gibbs and Ludwig Boltzmann, roughly German, American, Austrian, but maybe from a time when many of these countries weren't so well defined. And of course, so these ideas survive to today. We still use them to build our algorithms. I introduced you to David Bly who, to my knowledge, came up with this notion of the elbow, the evidence lower bound. So he came up with a new name pretty much for this minus L that used to be called free energy or enthalpy and has something to do with the idea of entropy that Boltzmann may have introduced. And now we call it elbow, but it's the same thing. We're still using these tools people use to describe the world by trying to come up with reasonable descriptions and then fit them to what we actually see in the world. So physicists have been using this mechanism for fitting such distributions for quite some time. And it's sort of, well, this is the name of the object we're talking about, but reasoning about it, how to fit it, how to construct on paper the correct approximation, this bit that stressed you so much on Monday when I talked about constructing factor-vising variational bounds. This is connected to a mathematical idea of optimizing functions in a distribution space called the calculus of variations. It's basically a generalization of the idea of optimizing a bunch of numbers by computing gradients, by instead computing gradients in spaces of functions and considering what happens to a function if you disturb it, if you perturb it by an infinitesimal function, what perturbs it. These ideas were already early discussed by Leonhard Euler, Swiss mathematician, and by the French mathematician, Lagrange, who you associate probably with constrained optimization. That's not an accident, so that's the same kind of connection. And they were like really industrialized, sped up drastically or empowered drastically by Richard Feynman, who received his Nobel Prize in 1965 for a formalism, a theoretical formalism for quantum field theory that is fundamentally based on variational inference. In very abstract, complicated spaces that we as normal people don't tend to think in, weird, partly discretized spaces, but he really much came up with a variational bound for the description of quantum mechanical systems that is connected to these Feynman graphs that you may have seen, these beautiful pictures that some people have tattooed on their forearms. Feynman graphs are essentially a way of writing down the terms in a variational expansion of a probability distribution. That these are systems that are partially discrete, so you can write down a series expansion in terms of potential interactions between particles that can't enter into such a variational bound. So why am I showing you all of these people? And why are we talking about this mathematical way of describing algorithms that assign probability distributions to systems? Well, because our story isn't finished yet. I said that not everyone in the weave got killed. Some of them left early and they made it across the pond to some other place. They arrived in the US in the late 30s, sometimes even in the 40s through some detours in the UK sometimes, sometimes in Denmark, Northern Europe or Italy, and then somehow made it out. And they arrived in the US, scarred by their experiences in Europe and they felt that they needed to do something and needed to help. They stopped doing nerdy things in the evenings in cafes and they started to think about how they could help the war effort. One of the things they were really worried about, not all of them, but many people like them were, was that the Nazis might actually get a nuclear bomb. Why were they scared? Well, because they knew of Otto Hahn and Lisa Meitner who had discovered nuclear fission and they were kind of aware that you could use them to build very powerful devices. So the joint forces secretly having convinced the politicians, as you do, the first thing you do is you convince the politicians that what you're working on is very important and they started working on building a bomb. So this is a later picture, it's from 1946. This is long after the war had already ended, at least in Europe, and then it ended in Japan as well thanks to the works of some of these people. You may know that actually the Germans never quite planned or managed to build a nuclear bomb because they didn't have enough fissile material. The German industry was already too weak at the end of the war to actually do this. There wasn't enough heavy water around, it had to be produced in Norway and it didn't, there was just never enough of it. At the end of the war, the German, what was left of the German nuclear program was hiding in Heigerloch down the road from here in a little bunker underneath the castle. So about, you can cycle there on the weekend if you like. It's about 30 kilometers away. Heisenberg was there with a bunch of cubes of uranium trying to build a nuclear reactor, not a bomb because it would have been bad if it exploded, just a reactor, they wanted to get it critical but there wasn't enough uranium, it didn't work. It was roughly 75% of the size it needed to be but they couldn't get more and then the Americans moved in and they captured them all. But by then the Americans had already built their nuclear bomb and Rico Fermi, who was sitting here in the front row here, another European, had led the Manhattan Project and actually built the first bomb together, led it together with this guy who is now of course these days very, well known for the last two or three weeks suddenly, J. Robert Oppenheimer because of the movie that just came out and they built the first nuclear bomb who helped them build it. Here, right next to Oppenheimer is a young Richard Feynman who worked back then on experimental physics because it seemed important to do that before moving back into theory. Here's in Rico Fermi, another European and then there's another Hungarian mathematician of Jewish descent who had left Germany earlier called Edward Teller in the back or in Hungarian, Hungarian, his name is Teller Ide. He, I have a slide about him later, so I'll tell you later. This picture was taken in 1946 during a colloquium. It's a bunch of physicists, scientists, mathematicians sitting together listening to a talk about the design of an even more powerful nuclear device that Teller was involved with that was... So this was 1946. They were still thinking about how to build it. It took them a few more years to get it to work. In 1952, it started working. It's called the Hydrogen Bomb. This was the very first one, the IV Mic Shot that first worked out. You may, I mean, okay. So the side noticed that you may have heard that Oppenheimer was against this and because he was against it, he was sort of sidelined and moved out of the political circles and it was just a little bit too big for his liking. So who was involved in building these weapons? Here they are again. So I already mentioned John von Neumann and Stanislaw Wuram, the member of the Louis Mathematical School. He had left to work on something more important than math, to work on physics within the US and he had to find one of the main problems when they were constructing the... Well, thinking about building a hydrogen bomb. When it became clear that you could maybe create fusion reactions as well, the biggest question was how to design such bombs that the density of neutrons in the material would be high enough and that required really complicated statistical computations, which they couldn't really do because they didn't have the kind of computers necessary to do them. So one of the things that Ulam definitely contributed to was the design of computer algorithms to do efficient simulations of probability distributions. And he did that by coming up with randomized algorithms together with Neumann. So that's one of the stories of the origins of these sampling methods called Monte Carlo methods was that apparently Ulam was sick and he was in hospital and then Neumann visited him. By the way, so how did they know each other? Well, because they overlapped in Louis. They were just old friends. So when Ulam arrived in the US, Neumann, who was already there, helped him get a job. And they were talking about some math to sort of pass the time and randomized algorithms. So later on, based on this kind of conversation and of course further developments, Ulam, that's him again by the way, a little bit older by then, built one of the first analog computers called the Fermiak, the Fermi analog computer. He's holding it here, this thing. It's a device that uses random numbers to produce physical simulations. Here's another picture of it. I've shown it to some of you in an earlier lecture before. You put it on a piece of paper or actually on this picture you can see this is maybe a nuclear bomb design. You can see sort of layers of the material that led up. You put a pen in the back, so the pen goes here. And then you draw some dice, some random numbers to sample what kind of physical reactions are taking place in the device. So there is every scattering event of a neutron changes the direction of the particle and then there are two different kinds of events. You could get a fast neutron or a slow neutron. That's why this thing has two different wheels here. One with a larger one with a smaller diameter so that you can count effective path length through material and then depending on the density of the material, you take a step in a particular length therefore we should draw another random number. So Ulam used computations like this to try out different geometries for nuclear weapons to create critical states by effectively doing statistical simulation. I wouldn't call it inference, just statistical simulation of probability distributions of complicated thermodynamic probability distributions. And he worked together with Teller on how to use these to build fusion bombs. So here is Teller again, on the right. That's a picture of him in younger days. Teller is another Ukrainian mathematician, so you can maybe tell by now that there's a lot of, as a Ukrainian, a lot of Hungarians involved in this whole process. They were sometimes called the Martians in the US because they seemed so weird. They sort of spoke a different language that one could understand and they behaved really nerdyly. But Teller studied actually in Germany. He did his degree in chemistry in a place that we now call the Karlsruhe Institute of Technology which was then called the University of Karlsruhe. And then he moved to Leipzig where he worked with Anna Heisenberg to get his PhD degree. He also had an accident at some point where trying to get off a tram, his foot got caught, and then the foot went under the tram and it had to be amputated afterwards. So he lost his left foot and for his entire life he was limping. Apparently he didn't take painkillers during that time because he didn't want to become addicted to them. They had these really strong painkillers back then, basically morphines. Heisenberg said that he was such a harsh, concentrated guy that he could bite through the pain and didn't have to take painkillers. And he also had to emigrate because he too had a Jewish background as a lot of scientists said. And so he first moved to Copenhagen where he actually met his later wife, Augusta Maria Teller she called herself Mitzi. And this is a picture of her. Actually there's pictures like this of all of the people, pretty much all of them that I've showed you with these little numbers in front that's from their name tags in Los Alamos when they worked there but there's really not that many other pictures of her. So interestingly, so in all of these stories you tend to not hear of the women because of course back then people didn't write about the women so much but they were just as involved. So Mitzi Teller actually worked in Los Alamos with Ulam because of course they knew each other on Markovche Monte Carlo methods. So she actually helped write the first Markovche Monte Carlo method, the Metropolis Hastings method. It's just her name didn't really get added you know, because it's 1940 something. But as a picture of when Teller got some big prize I forgot which one, the Fermi Medal or something from JFK, she's standing right next to him in the picture. So the story I've tried to tell you so far is about people who originally studied seemingly arcane abstract mathematical algorithms that they maybe initially just used for fun. It was just a way to pass the time in the 1920s in a café somewhere in a long forgotten town seemingly in central Europe. But they were all destined for a much more complicated story afterwards. So when the world took a turn for the worse they shifted their attention from little games on paper to try and apply what they had learned to real world problems to physics and to statistics and then ended up contributing to some of the most horrible technology that mankind has ever constructed. Of course they tried to do that for a good purpose because they were really worried that if they didn't do it someone else would, even if the other side in hindsight wasn't actually capable of doing so. Maybe partly because all of the smart people had left for the US to do this kind of work. So most of this work was done with extremely simple hardware. It was done on paper, with pens, on blackboards, with chalk and maybe with devices like this and with dices to produce random numbers. To construct closed form approximate probability distributions or stochastic simulations that in expectation produce the right kind of distribution. And they had to do all of this because it was the 1940s and computers well they looked the way that you know computers looked they were just these room-filling devices that could deal with a few bits. So after the war everything changed because computers suddenly started allowing us to do computations, I say us the people back then that were previously completely impossible and maybe a sort of representative metaphor for this is this guy, Frank Rosenblatt, who you've probably heard about if you're studying AI and computer science he was maybe from a more lucky generation he actually started studying in the US in the 50s and then worked on this thing that you see in the background it's actually nice in this picture that you hardly see him because the device is really what it's all about. This is one of the first perceptrons effectively a neural network and you can tell that maybe even then it wasn't so easy to train these devices because you had to actually fiddle with wires but it's a symbol for the change brought about by the hardware by the computers that allow people to simulate or to compute things that they couldn't possibly have computed before and over the course of the last half century maybe almost a century by now these advancements in hardware have occasionally overtaken the advancements in math and understanding to speed up the process rapidly not by better understanding but just by making the computations much easier so these advancements were led by people who I haven't told you about could have given a similar story about people like Holerit they were people who built hardware to make computers run faster and even today in computer science we tend to think of these people as completely separate but they just kind of hide away somewhere they built the next chip generation and then we don't talk about the companies that build these chips but we somehow buy all of their products and then do something cool with them because the cool stuff tends to happen in the software then somehow so AI as a field evolved by iterating in some sense at least that's sort of a high-level picture maybe it's an oversimplification but it's a bit of a high-level picture for its iterations between advancing rapidly because people understood things better and then advancing rapidly because with the algorithms that they had come up with you could now suddenly do much more because the computers became faster here's a few examples so of course some of the people some of the communities that I showed you even though you may connect them with physics and nuclear physics and so on were actually very much involved in AI developments as well Turing didn't work on nuclear bombs but he worked on cryptography and you know breaking the enigma code during the Second World War and so on of course he's also one of the fathers of AI and he's maybe connected to this very abstract of the idea of formal languages of Turing machines of computability and uncomputability and so on then with people like Rosenblatt we have to sort of turn towards hardware developments let's just build machines that somehow adapt and just find the right connections connectionism then in the 70s and I'm of course picking a few favorites you get people in math again thinking very hard about how to build good algorithms operations research people people who came up with the fast optimization algorithms of the 70s like Broiden and Fletcher and Goldfarb and Shano and all the other ones that were connected with them then in the 90s we have people focusing in the what was then already beginning to be called the machine learning community and computational statistics on probabilistic models on graphical models on non-parametric models like kernel machines this is connected with names like Judea Pearl and Vladimir Vapnik and people who really wanted to understand what was going on and build a theory of computational statistics of machine learning of learning they used the word learning maybe for the first time really in earnest and then since that I would say the 2010s because I actually feel like I've sort of consciously noticed this development myself a new wave of hardware in particular accelerators graphical processing units and various other forms of accelerators came in to rapidly speed up certain kinds of computations again effectively linear operations array-centric computation and that kind of pushed forward the unstructured models again the deep learning world that we maybe currently live in and then we'll see what comes next so what's the connection between contemporary deep learning this unstructured world and all of this math that I showed you over the course of this course as well is it like completely separate should we not have learned the math at all so maybe one way to think about this was this problem is to make a direct connection between this variational inference the statistical physics tools that I just told you a story about and the deep learning world and that falls to our millennium and I'm going to pick one representative for this story and it's Max Welling so you can see it's a color picture he's still alive very much with us he's a fellow of Alice working in Amsterdam Dutch mathematician physicist everything and he's actually wrote a paper in 2013 together with his PhD student Dorkingma on how to build an unstructured model that maximizes a variational bound so I'm going to take 10 minutes to tell you how it works it's not going to be in the exam it's called a variational autoencoder and it's based on the following idea so I've told you this complicated story about graphical models and Gaussian mixture models and factorization variational bounds that you can optimize on paper but it centers around this object called the elbow the evidence lower bound which happens to be the difference between the log evidence and the KL divergence between some approximating distribution over latent variables and the posterior over those latent variables under the model so we assume that there's a world there's a model for the real world that is a joint probability distribution over some data that we can observe some latent variables theta and it's parameterized sorry some latent variables z and it's parameterized by some parameters theta and now we want to approximate this with some distribution Q that is supposed to be close to the associated posterior so this thing here you can think of as a prior times a likelihood which is an unnormalized posterior and if you normalize it you get a posterior and for that thing you would like to have an approximation so now instead of saying oh I'm going to assume that Q is a Gaussian and it factorizes somehow and then we just write down the full P and then we try and do the derivation on paper and see if it works you could also say ask who this we've got GPUs so let's just model both P and Q using a deep neural network how are we going to do this so we're going to say we're going to give them two names so notice how Q is a function of X sorry Q is a probability distribution that is a function of X it's a posterior over latent variable Z so we can think of it using another kind of language as an encoding you take in the data X and then you find some latent representation Z an encoder and then P of X and Z we can think of as a generative model for the data that we could draw X from it's a probability distribution over X so we could think of it as a decoder that if you have the latent variable Z you could generate data now what we're trying to do is to learn a pair of encoder and decoder that can reproduce the data from the latent representation Z as much as possible or as well as possible so the first step we're going to use this this language and just say we're going to parameterize both of these distributions P and Q by a bunch of parameters let's call the parameters of Q phi and the parameters of P let's call them theta and they will be the weights of a neural network somehow you get to pick your neural network and we just want to somehow well maximize the elbow and minimize the the KL divergence as a function of theta and phi so how do we do that? well first of all we have to say how we actually parameterize these distributions P and Q so for P for the decoder we're just going to say that it's a Gaussian distribution on X but centered on a deep neural network that takes the latent representation Z so F is some deep neural network which parameters theta and we'll just say it's just a standard Gaussian because if it's not standard Gaussian we can do some tricks to transform both X and F and then somehow make it a standard Gaussian that makes the exposition easy so if you want to train this part of our model then we could take a gradient of the elbow to maximize it with respect to theta and that actually works quite well because theta only shows up in P so P is in the elbow only here in the back in the logarithm of the joint so we can take the gradient inside of the expectation so if we can somehow take the expectation over this distribution Q we'll have to deal with that later because we haven't talked about yet what Q is then we just take the gradient of a log Gaussian well okay we can take gradients of log Gaussian so think about what a Gaussian is it's the logarithm of it is minus one-half X minus F of FZ squared so we take the gradient of that theta does not show up anywhere in the covariance so there is no log determinant term here it's all fine and we just need to take a gradient of a square loss okay so that we know how to do that that's just out to diff right so very simple it's just this thing just involves a gradient of the deep neural network okay so that part is going to be easy we can use Adam or actually SGD for that no actually so Doh King Ma is also the author of the Adam paper so maybe he used Adam to train this thing and now comes the actual hard part we need to also optimize for Q for the encoder and Q shows up in two annoying ways so one is that or maybe the most annoying thing is that it's it's the thing we're integrating against so we kind of have to be able to do this integral but we also have to be careful that when we change phi we actually change the distribution Q so how do we drag a gradient through this kind of through the integral and then Q log Q the term that we're going to need to optimize so here comes an ingenious idea that I think is as far as I know so to max it's called the reparameterization trick and it says oh remember how Gaussian how Gaussians actually work remember how we spent so many lectures on Gaussian distributions so Gaussians are parameterized by a mean and a covariance we're going to write both the mean and the covariance as outputs of a neural network for that we have to parameterize them well so we're going to say mu is of x it's just a neural network but for sigma we have to be careful because there has to be a covariance matrix has to be positive definite so we're going to parameterize it in terms of actually not sigma but in terms of the Cholesky decomposition of sigma that we spoke about for two lectures so I'm going to write sigma as l l transpose and l is a Cholesky decomposition of the covariance matrix that depends on the inputs x and it also depends on the weights of the network phi so now imagine you wanted to draw from this Gaussian distribution we learned how to draw from Gaussian distributions in our gaussians dot pi piece of code you take a matrix square root of the covariance for example the Cholesky decomposition in our code we actually use the SVDE but yeah it's almost the same thing then you draw standard Gaussian random variables u you multiply them with the square root of the covariance matrix and then you add the mean and that gives you a z so why is this re-parameterization useful? it's useful for two reasons do I have a slide for this and yeah in a moment actually yeah I'll just go to it I'll just take this math and move it up on the slide so here it is again if we write z with this re-parameterization then we can write our expected value under the distribution over z as an expected value under the standard Gaussian random variables u and the only two things we have to be careful about is that we correctly transform the probability measure that's a change of measure we are applying a linear map to you so we have to multiply with the Jacobian of the transformation the Jacobian of the transformation is literally just l actually the determinant of l which we can compute and then invert it and then we plug we basically compute an expected value over our elbow by replacing all the z's with these re-parameterized forms that allows us first of all to take the gradient with respect to the parameters of our neural network inside of the integral because the integral is now not over a random variable anymore that depends on phi otherwise we'd have to do some complicated trickery right but we've basically done that trickery now we've translated into a parameterization that doesn't depend on phi so we can take the gradient inside and the other nice thing we can now do is well what's this expectation it's an expectation over random number u well we can use Monte Carlo methods we can use Ulam's and Mitzi Teller's ideas to approximate a complicated integral over a probability distribution to a bunch of random numbers just some pre-chosen standard Gaussian random numbers that we don't even redraw in every training step of the algorithm we just keep them constant because that creates a smooth optimization problem so notice what we've done here we've used probability theory Monte Carlo methods variational free energy all these formalisms that come from these nerds somewhere in western Ukraine in 1920 and turn them into a deep learning algorithm they're still there we're still doing Monte Carlo integration we're still doing variational inference and you can't do it without understanding these mathematical tools if you don't know how to translate probability measures you just kind of get the wrong answer it just doesn't work and if you don't know how to compute Monte Carlo estimates or that they even exist you can't possibly do this integral and if you don't know what a variational bound is well then you can't even start thinking about this so these models variational autoencoders, VAEs for a while for a few years they were the hot shit the generative models of the day the way to create images and that's of other cool data now they are a bit out of fashion again right after them came generative adverse aerial networks GANS which are actually kind of more based on game theory to players competing with each other but we stopped using GANS again and now we're talking about score-based diffusion models which are also probabilistic models they are models that are constructed by solving a stochastic differential equation remember how you learned about stochastic differential equations from Natanael in lecture 22 which can be which are a form of Gaussian process effectively which is parameterized by a neural network now to build these score-based diffusion models so all of these ideas they are still there and you can't build stable diffusion and you can't build variational autoencoders and you also can't build general purpose transformers without understanding the math behind them otherwise you're just someone who downloads a piece of code and hopes that stochastic gradient descent converges so these ideas of mathematical models to describe the world to extract information from data have been around for a long time they are maybe like the mathematical condensate of the scientific method it's the theory of extracting information about the world from the world by building models for what the world might mean that fundamentally has to come from an initial idea typically from a human brain somehow and then using the laws of probability to construct posterior distributions sometimes you don't construct a full posterior which is construct a point estimate but whatever and these algorithms were used by physicists, by chemists to talk about the thermodynamics of solid state material to build nuclear bombs to infer the meaning of the genes that run in our cells over generations and they're always the same ideas it's always about extracting information so we need notions like probability probability measures measurable spaces think of Stefan Banach again there in Louis and it has taken generations of people thinking about these objects to give them structure to the point where we now take something that used to be the potential energy of free molecules in a box that interact with each other as a free gas and have abstracted them enough that we can talk about elbows about general probability distributions Q that somehow are associated with a negative variational free energy that we call the elbow that we now maximize rather than minimize the one thing that has changed over the past 10 years, 20 years, maybe 70 years, I don't know is that we have also machines now that allow us to make these computations on a much grander scale and sometimes for a while because some smart engineer somewhere comes up with a cool new piece of hardware those machines are really enticing they are so cheaply available that they allow us to get away with doing things not so smartly that we can just use them and do whatever we want with them really quickly and machine learning has been maybe the field that has taken these ideas and translated the ideas of physics and mathematics onto these tools onto computers and this has really made this field extremely powerful because it allowed us to work on types of data and problems that were impossible to solve otherwise and apply them to pretty much everything in particular also to this artificial world that we live in now this space that doesn't look really like the natural world anymore so during the phases when computers take over when the hardware makes it easier to do certain computations it can feel very tempting to not care so much about the math because you know you can just run gradient descent and maybe at the moment that we're in a phase like this where it just seems like you don't really have to understand the math because you can just you know copy paste some code from Ante Carparti for a mini little GPT and just run edam on it and it somehow it'll work but even for you it's hard to do that because you don't have access to the hardware right you have you're lucky maybe here in Tübingen you get to work with the GPU maybe with two if you're doing a PhD maybe you get eight or 16 GPUs to work with but there are companies out there that have much much more computing power like Google and Facebook and Microsoft and therefore OpenAI and Amazon and so on and because these companies are so powerful at the moment because they have so much money which actually is a little bit unrelated to AI right it comes from the dot-com boom they the people who work there have the luxury of not having to think about computation so much they all have access to their hundreds or thousands of GPUs and they even advertise to people come and work for us we have GPUs a lot of them so you can do cool things it's like come instead of instead of saying come to us we have the smartest people they say come to us we have a lot of hardware right because you can do lots of stupid but cool things right just run a lot a lot of things in parallel somehow something cool will happen and if you if you think about it actually a lot of the like advancement at the moment in what you may call artificial general intelligence or the emergent behavior of these big models it's it's not like the people who build them actually understand any more than the other people who are watching from the sidelines it just it just so happens that they sit a bit closer to the hardware so they can claim that they're somehow associated with it but they're using the same algorithms they're using the same architectures the same hardware even just more of it actually so if you want to do something cool if you want to get an in on how this field might evolve in the future it's maybe useful to think about how much you want to rely on computing power it allows us to be lazy sometimes at times when computing power is cheap but I think we're already entering a time now even now even today when we realize that it's a maybe time again to think about how these algorithms actually work and what we can do to make them work better here's a few data points that or a few arguments for why you might not just want to rely on computation the first thing is that at the moment large-scale AI systems are already bound by the constraints of our hardware running an instance of GPT-4 takes something like an 8xA100 machine which are insanely expensive it's so expensive that these companies actually have to make you pay if you want to use their tools because they can't possibly afford to give them a way for free it also is why you have to be on the internet to interact with them because you couldn't possibly put them on a phone because they need way too much powerful hardware to do that and they need insane amount of training data to work so the large language models are being trained on datasets that are effectively without much without stressing the metaphor much actually effectively the entire internet so they train on extremely large collections scraped from the internet of text so this works reasonably well for text because it's available in digital form in particular to these companies that have access to the internet through their crawlers Microsoft and Google so you can train on very large datasets but we're already resource limited by those amounts of data so some altman has said scaling up training data by more than a factor of 10 or so is pretty much impossible because there is just nothing else to train on it's just not there anymore all of human knowledge available in digital form is already being used as a training dataset so maybe the next step is not just to train more because there isn't more to train on can we make the models even larger hmm not until nvidia comes up with the next generation of GPUs and that's going to be a while maybe who knows and then not everyone's going to have them and then if you build a tool that's so expensive that it's difficult to make money with it so if you want to look for a way to actually like make massive progress in the way that AI works you have to find a way to use less energy to train on less data with a smaller model and train more reliably train in a single goal rather than in 50 different runs where you just stop and start and freeze and thaw so if yeah i think david that's my point actually if you want to contribute to fields that also really matter like material science or climate modeling or biomedical applications health care applications of AI we need to have models that can work on small data that can incorporate physical laws of nature so symmetries in the model and conserve quantities and partial differential equations lots of complicated math that you somehow have to understand to be able to put it into your models and then make all of these infom all of these kinds of information work in these models so i'm sure that that process will somehow involve neural networks so in the form of parameterized functions that be fit with some optimization method but i'm not sure it's going to be stochastic gradient descent i'm not sure it's just going to be resonance and i'm not sure they're just going to be trained on text data and changing any of this will require understanding the math having this toolbox that i started the lecture with so i've i'm maybe from a generation well i am from a generation before you at least in terms of like academic developments and so maybe it's easy for me to say that because i live through my sort of academic coming of age during a time and the field was more theoretical but maybe it's also my task to tell you that it's useful to spend some time to understand the math because that turns you into someone who can actually contribute to the conversation rather than to just have to i don't know simp on twitter to just hope that you can somehow train your model as well because someone gives you enough gps so i've been very fortunate to interact with a lot of people over the course of my career not just interact but also watch them from the sidelines sometimes and i've met a lot of people who are very passionate about probabilistic modeling and all of them so here are a few of them that i just picked at random yesterday um have had as well at least the people on this on these slides have had absolutely amazing careers built on understanding the math well and being able to explain to other people how to build better AI systems so if you're worried that by spending too much time on math and computer science and complicated code structures and complicated models and probabilistic processes and so on then let me tell you that it's very much possible to build cool careers in most of these cases academic careers out of understanding these concepts but actually many of these people haven't just had academic careers they've also had extremely successful industrial careers in some cases they're currently running the research groups that drive the agi revolution if you like at google and at other companies so um i in my group i tend to offer to people both the opportunity to work on deep learning but also on structured probabilistic models on large-scale Gaussian processes for example on simulation methods for solving partial differential equations and ordinary differential equations but also you know training deep neural networks and i'm i notice how there's this up and down in the generations of students or sometimes at the moment i get a lot of people who just want to work on deep neural networks and then invariably when we start to work on these projects it becomes very frustrating it becomes difficult to do because deep not deep learning is just really fiddly you have to learn with like just tune lots of parameters and sometimes i notice that even the smartest students get really really stressed about this unpredictability of this process and at the same time i have people who work on very structured problems complicated Gaussian process models for partial differential equations who know exactly what they're working on and they actually find beautiful structure that sort of i can see sort of clearing up their heads as they think about it but they are at the same time worried that they are not connected enough to this agi thing that somehow everyone's talking about and i'm more and more thinking that maybe this is a psychological problem that we are all getting just riled up by what we see on social media by this sense of urgency a little bit like how these people in Los Alamos felt that there was an urgent sense to do something because someone else might do something dangerous and therefore you can't think about the stuff you might actually want to think about anymore simply because you think someone else is going to do it and if we take a step back and think about how that world actually evolved over the last few decades and how advancements were actually made i hope and i'm really convinced actually that the next big advantages will come advances will come from understanding things better rather than to just scaling them up or maybe i'm wrong but even if i'm not wrong it's still a more fun thing to do a more enjoyable way to do your degree or your phd or the career you might want to work on afterwards but this is my final slide there is still a qr code in there for feedback i very much hope that um you're going to use your opportunity to tell me about this lecture and all the other ones maybe the entire course but also that you feel kind of interested in joining the bayesian club all of these people i think identify as bayians sometimes very aggressively so sometimes not quite so openly um and um think about how you want to build your career in this field how you might enjoy it most and how it might be most fruitful of course for yourself as well either here or somewhere else although of course i hope that many of you are going to stay here with us and help build something here in europe thanks