 Hello and welcome back to the third lecture on probabilistic machine learning this term Today we are going to begin to move forward towards real Applications of probabilistic reasoning that you might even call a very basic and elementary form of machine learning we are Here in the course we've done lecture one and two in which we found out that Probabilities are a way to distribute truth over not in a binary way But in a continuous way over a space of hypotheses. We learned that this way of distributing truth recovers a certain Common sense aspects of our reasoning in our daily lives. We also saw that doing so Structurally complicates computational processes of inference beyond those of propositional logic Because we now have to keep track of a potentially combinatorially large space of Hypothesis we saw in lecture two that one way to simplify these reasoning processes is that probabilities can be independent or probabilities over variables can be independent or conditionally independent of each other and when we find such conditional independent structure inference can become Drastically cheaper than combinatorially hard today We will do two things we will first complete our Mathematical machinery so that we can then actually are a finally have a license to talk about probabilities on all sorts of interesting variables And then we'll do an example an extended example to really see how inference works in Waguely realistic applications That first part fixing our Machinery consists of two things that I've so far left out and haven't talked about so much What you haven't talked about at all in the previous lectures, which is that? The way I've introduced probabilities makes it hard to talk about certain kinds of variables Continuous variables and devolved variables. We will start with the devolved variables so just to remind you again of our definition of probabilities This they rest on essentially the notion of sets so to define probabilities. We take a space of elementary events And then define a collection of so-called measurable sets on them. We call this collection the sigma algebra these This is a collection of sets which has the property that it is that it includes both the entire space and the empty set and it is closed under intersection union and differences of Sets this just makes sure that we talk about sets That we talk about a collection of sets in which we are allowed all the operations. We would like to do we can think about joining Intersecting and separating from each other Sets that we think about to define probabilities. We take these to be a function P which maps from the elements of the sigma algebra on to the V the real numbers such that all empty sets are assigned Probability zero the entire set is assigned a probability of one and Everything in between naturally has to have a probability between zero and one because of our Third or in this case the second axiom, which is really like the master idea of probabilities, which is so-called sigma additivity which says that the probability of unions of this joint sets is Assigned a probability that is equal to the sum of the individual probabilities of these this joint sets This axiom ensures that we are not inventing truth out of thin air when we combine sets with each other And also that we do not lose truth into thin air when we take intersections or unions of of sets now This is all nice and well and in all the examples We've had so far the process of inference that I showed you was that I define this Elementary a set of elementary events this measurable space or Burrell space that consists of this elementary set and the sigma algebra and Then assign probabilities to every element of omega and then we could do some reasoning on them For example, we did this with the example of earthquakes and alarms in last week's lecture Which essentially had a Set omega that consisted of binary strings over these four binary variables in this graphical model that we looked at But sometimes there are actually almost all the time There are reasoning processes in which we want to talk about so-called derived variables variables that aren't Directly the ones on which we define the sigma algebra to give you a trivial example or an easy one That might be easy to follow consider the following situation. We have one coin We're going to throw this coin n number of times every time we throw it It has a probability f of coming up heads. Let's assume we throw it Independent of each other so every time we throw it we create new binary variable that you will call these binary variables x1 to xn if we throw the coin n times and our natural question you might want to ask is What's the probability for this These n coin tosses to produce exactly our heads and n minus our tails naturally now notice that here our definition of probabilities breaks down, so the Atomic event is the set of atomic events is the space omega which consists of all binary strings of length n if we say that we encode heads and tails as binary variables zero or one But the variable are that we talk about isn't actually an element of this Elementary space It's a real number. Sorry. It's a natural number that lies between zero and n of course, right? And that's a different space than omega so We haven't actually defined a rule yet that allows us to talk about variables such as these and of course These are important. These are actually the variables. We will typically talk about the vif variables that form by Taking sums or other kind of algebraic operations on these elementary events these kind of variables will be called random variables and We will construct them in a formal way that does exactly what you want it to be so if you I mean if you just look at this example for a moment You can stop the video here if you like and think about what the probability for this random variable is Then you'll probably come up with exactly the right rule and what I'll do now is I'll just give you the formal definition Of what this rule actually should be I do so I'll define two concepts called measurable functions and random variables and then use this to define This probability on this divide variable, which is called a distribution measure or a law of the probability of this random Sorry, the law of the random variable. So first of all, we need to define a measurable function Let's consider two measurable spaces. So that's two spaces which both have Atomic set and a sigma algebra on it in our example This space is the space of all binary strings and this space is the space of the natural numbers between zero and n Which have sigma algebras in both cases these sigma algebras are trivial. They're just the power sets of these sets and Now consider a function X So in our example that was the function R that maps between these two spaces This such a function will be called measurable if For any element G of the sigma algebra of this output space the pre-image of this element is in the sigma algebra of the input space so this means in our example if every if the if the pre-image of any Union Intersection or difference of these of these power sets over the this finite length segment of the natural numbers is a Corresponds to Measurable set of binary strings in the so if every for example in particular if every variable n of a number of heads showing up Maps it is reached from set of Configurations of the of the coins that such that they're the sum of their heads is actually it So why do we need such functions? Well because if a function is measurable then we're so we're going to use such functions to if you like Push forward the notion of probability from the atomic set of events Cointosses on to this derived space of numbers of heads and we need to make sure that the if we now talk about measurable sets of this derived space that they that It's such measurable sets kind of inherit the good properties of the original space if There is a probability on The original space as we've just defined in our example Then because because then we want to use these measurable functions to construct a new Probability on the derived space and we will call this derived probability a measure at a distribution measure or the law of the random variable x This distribution measure is defined in exactly the way you would imagine it to be So I'm just going to tell you what the definition is and then we show I'll show you we'll go back to the example and check whether this definition actually makes sense So consider a random variable then the distribution measure or law of which will denote by px of this random variable x is defined for any element of the elementary events of the output space as and therefore also of course to every element of the sigma algebra because we can then use Sigma additivity to construct probabilities of elements of the sigma algebra as The following function. So that's px of g the law the probability of the set g under the law of x by Taking the pre-image of g Which is an element of the sigma algebra of the input space because we've assumed that x is a random variable. So therefore this is a measurable set for x is a measurable Function and this is a measurable set and then just look up what the probability of that measurable set in the original space was that's the probability of the set of all elementary events which are in the pre-image of x Under x of g. Okay. That was a complicated sentence. What does that actually mean? Let's go back to our example with Cointosses so here our Sigma algebra on the original space the original space is the space of all binary strings from of length n The sigma algebra on it is the power set The random variable is called r. It takes values little r from zero to n and our Law so we could write on I've already shortened out the patient a little bit should have written p index r of r equal to little r is What well we take the set of all configurations of heads and tails such that the total number of heads is r and Then just check what its probability is so under this generative process that I find up here This probability is just a product of the individual probabilities because I've assumed that we are throwing this coin independent of each other In a more general setting it might be something else You just have to look in the original space to see what an original probability space to see what the probability is here It's just the probability To get r heads and n minus r tails. What is that? Well, it's just the probability to get heads r times times the probability to get tails Which is 1 minus f n minus r times. That's our probability for r and we could actually say Write this also as a conditional probability with some variables f and n But we've assumed that we know what f and n is so maybe it's not so necessary to actually write this And we will often just drop this kind of these kind of variables from the from the notation This is the law of this random variable just to repeat again The original space is the set of all binary strings the sigma algebra on that original space is the power set of these binary strings The random variable is defined by the function r which is a measurable function And we can use it to construct the distribution measure. We're almost done We just haven't actually done this sum yet So we haven't constructed what what the actual law is to do that We have to do again a little bit of combinatorics, which is a bit tedious like all of you have done it in high school So I'm not gonna dwell on it much. We just need to get out the sum So for that we need to compute the number of ways there are to choose our heads from the n possible From from the end coin tosses that we've done in total So how many binary strings are there in the set of all binary strings of length n such that? They contain our once well that is just given by and to get to this You need to basically think about what you learned in high school n factorial divided by n minus r factorial times r factorial Which is this n choose our function that you can just compute with your computer and this gives rise to this law Of this random variable that looks like this. So this is a probability Distribution as the name suggests that distributes truth over the values From zero to n here. I've chosen n to be 10 and f I've chosen to be one third and this gives this kind of distribution and What will actually typically do is I've written this here as a note as well. It will abuse notation This is a very common kind of thing to do and we will first of all drop the Subscript capital R because typically we will know what function we're talking about and I won't write Capital R is equal to the value R. So the random variable R takes the value little R But I said I'll just write little R I'm allowed to do this because we make this assumption that probabilities know the name of their input variable Actually, this seems sort of complicated like a while ago This seemed really complicated to do this or what would be maybe a weird notation But in your generation it should be easy to think about it this way because you use two programming languages that do that as Well, so for example in Python Python knows the name of the individual variables You know you have this concept of a named variable that can be entered into the into each function And they are also Python functions that just assume that the first input you plug in could be interpreted as a certain kind of variable So this is exactly what's happening here We could either tell the function that we are talking about the random variable R Or we could just say you probably know what I mean when I write little R Okay, this allows us to define this distribution. By the way, there's a name for this distribution It's called the binomial distribution, which is the law of the random variable given by the sum over successful so-called Bernoulli experiments Bernoulli experiments are coin tosses Okay, that was problem number one We are now allowed to talk about derived variables, which is obviously a very powerful concept And there's one more problem that we briefly have to talk about and that's continuous input spaces so in all the examples I've shown you so far the set of atomic events omega was a Countable actually was typically a finite discrete set On such sets. It's trivial to define the sigma algebra You just take the power set the set of all subsets of Omega and it's easy. You can show this yourself that this power set is a sigma algebra It fulfills all of the axioms. However in continuous spaces, which we want to talk about as well, of course Measurable sorry not all sets and in particular not the power set in general are measurable So of course, we want to talk in our applications about real valued variables We want to talk about rates and velocities and positions in time and space and these are all real valued So they come from an uncountable space and there is a weird problem that in such spaces not all sets are measurable If you haven't heard about this before I'm not going to tell you how this works because actually showing Constructing a non-measurable set takes about 10 minutes itself and it's really confusing These are typically really odd sets to construct But I if you want to know more about it just go on Wikipedia and check and look up non-measurable set And you can find out for yourself. These are typically constructed in a very cautious way such that they are end up being non-measurable It's a feisty to say that this problem exists and therefore we can't really define sigma algebras by using the power set And early on in the 20th century by the way these non-measurable sets were a real problem and they caused all sorts of big discussion Today, we just know that they exist and we have to deal with them. So what that means is We can't that this would so just to be clear This doesn't mean that we can't define sigma algebras on continuous spaces It just means that we can't take what would be the canonical sigma algebra on discrete spaces, which is the power set So instead we have to find some other way of constructing sigma algebras and one canonical way to do so or actually the canonical way to do so is Works on continuous spaces that are topological spaces Topological spaces are spaces that allow the definition of open sets. So that's actually a circular sentence I just said here is what topology actually is so Consider a space omega and consider a collection of sets on that space Such a collection is called a topology on the space omega if it contains both the entire space and the empty set and for all its elements it holds that potentially infinite unions of these sets and Finite intersections of these sets are also an element of the topology That's an abstract definition That's one of the advantages of topologies that they are very abstract But for typical applications for real vector spaces, you can think of them as the canonical neighborhoods so For off as the topology for R to the D as the collection of all sets you Which have the property that if X is in you then there Exists a positive number such that all real numbers which are Closer to X than epsilon Strictly closer are also an element of you So literally intuitively this means for any real number the set of neighborhoods is the set of all Sets around this real number that in include all points of Distance at most but not an Epsilon any arbitrary epsilon We can do this on the real line because of this the existence of this function of this norm So we can compute the Or this metric right so we can compute the distance between these two points in the natural way So if if D is one in the univariate case We can just take the distance between two points So that's the difference between these two real numbers and then the absolute value of it and call that the distance and In the d-dimensional multivariate extension. We just take the sum of squares and take the square root of these distances, right? So you might have noticed already staring at this definition Even if you haven't seen the definition of a topology before you might have noticed that it actually sounds quite a lot like the definition of a Sigma algebra. It's almost a sigma algebra. It's just missing a few key pieces So here I have the two definitions next to each other Here's the definition of a sigma algebra literally copied and pasted from the previous slides Here's the definition of a topology a sigma algebra is a collection of Sets so that here it is. It's a subset of the set of all sets. So that's a collection of sets just like a topology a Sigma algebra Contains the entire set just like a topology Sigma algebras also contain all Potentially infinite unions of their elements just like a topology and Sigma algebras also contain all potentially infinite Intersections of their sets. This in particular also implies that it contains the empty set just like a topology Why does it imply that? Do you know? It implies that because you can take two sets that don't intersect and then Their intersection is the empty set and it has to be better rule in the sigma algebra now However, the sigma algebra also requires two additional things first of all differences of two sets have to be in the sigma algebra That's not true for the topology By the way, what's the difference between the two sets? Just in case you haven't heard This is the set a This is the set B the difference between a and b is this part that excludes this bit Okay now so that this is The complement in a of the intersection of a and b So to define that we need complements of sets and There's a very subtle difference which you might have noticed looking at this that topologies allow finite intersections and sigma algebra allow infinite intersections so If you think about this for a little bit You might come to the conclusion for yourself That's to turn a topology into a sigma algebra We only have to add certain sets So what you can do to construct a sigma algebra is to take a topology and then sort of check Mentally which sets we have to add to get to in particular allow all infinite intersections of elements of the topology and To get all differences of elements of the topology It turns out that those sets are always available and we can always add them to get a sigma algebra And the resulting sigma algebra is called the Borel sigma algebra Here's the definition consider a topological space the Borel sigma algebra is the sigma algebra that is generated by the topology that means you arrive at it by taking the topology and Include all infinite intersections of elements of the topology and all complements in omega of elements of the topology This I then ensures that you can build differences between sets and therefore you have a sigma algebra now this sounds like a lot of abstract nonsense and maybe it is but it's just a Permission for us to talk about Probability measures on continuous spaces. I haven't told you yet how to actually construct these in practice We'll do that in a moment But what this sentence actually says what this definition up there says is that on continuous space There's that are topological spaces there exists a sigma algebra. It's called the Borel sigma algebra By the way, the Borel sigma algebra is the smallest by definition sigma algebra Generate that which includes all the open sets of the topology and in This lecture, we will just assume that we have such a space Which means that we will only ever define random variables that map from discrete typically even finite or Euclidean spaces Euclidean spaces are topological spaces. So typically are to the D and or binary strings Essentially, so discrete finite spaces on both of these We now know that there are sigma algebras. Actually, we have identified which sigma algebras. We're going to use either the power set for Discrete spaces or the Borel sigma algebra and therefore we are allowed to define probabilities And we can we now know that the both the axioms of probability theory hold and the mechanisms base theorem in particular and the sum and the product rule are allowed to be used and we can be sure that all devived properties of probabilistic inference will actually hold By the way on continuous, sorry on on Spaces that allowed to construction of Borel sigma algebras There is a nice result, which I'm not going to prove that if you have two spaces which have Borel sigma algebras, then any continuous function X is Measurable and can thus be used to define a random variable. So this allows us to say And this is actually follows almost by definition because that's how you define a continuous function that Pre-images of open sets are open sets. That's the topological way of defining continuity So this gives us basically a set of of situations which we are now to allow to consider and it's a very rich set of situations It's a set or it's a situations that can be described by variables that are either an element of a discrete space or an element of a topological space actually a Euclidean space and If you allow functions that are continuous and as you know the set of continuous functions is very large then any all such functions all such continuous functions can be used to define Random variables and therefore laws of devived variables We only have to be careful the only alarm bell we have and I have to have in our head is when we start talking about devived variables that arise as Non-continuous functions Good now that we can define these these Borel sigma algebras. We can have one gray slide that Summarizes what we just did I introduced two notions the first one is that of a random variable the second one is that of a Borel sigma algebra a random variable is our way of constructing probabilities on derived variables such derived probabilities are called The law of the random variable or also known as the distribution measure of that random variable and quite often I'm just gonna say the distribution of this variable and I'm not even going to say it's a random variable I'm not I'm going to say that it's the law or it's the distribution measure I'll just say the distribution or the measure and use that almost interchangeably because in all of the applications We will talk about these technicalities really don't matter. It just was important to do it once properly and Borel sigma algebras are our license our permission to talk about probabilities on Euclidean or more generally topological spaces it they allow us to talk about Probabilities that are defined on continuous spaces. I'm Specifically saying they allow us to talk about them because they don't actually tell us how to talk about them to do that We have to move to the next slide, but maybe this is your chance to take a quick break So this thought process has not shown us that we can define probabilities on continuous spaces It just hasn't really told us how to do so in practice now. It turns out that all reasonable probability distributions if you like allow a Representation that is particularly convenient and this representation is known as a probability density function Probability density function is a function that satisfies the following definition so consider a Borel sigma algebra and a Probability measure on it now some probability measures have the property that there exists a function P which Is a non-negative measurable function on the That the space that we have defined our probability on Which satisfy that the fact that the property that for all elements of the sigma algebra this Probability can be written as an integral over this particular function and these functions are then called probability density functions now So that what that means is if P has such a density then All probabilities that you might be interested in so all probabilities over arbitrary elements of the sigma algebra can be written as Integrals now it turns out that not all Probability measures have densities in particular there are Measures that have point masses and those can't be written with densities because then they don't allow integrals however, it works the other way around all non-negative measurable functions which integrate to one on the entire domain are probability density functions of some Probability measure which is then defined in exactly this kind of way so if we have a density we can use it to define a probability measure and It turns out that many interesting measures actually also can be represented by a probability density function So therefore almost for the entirety of the course. We will talk a lot about PDFs probability density functions These are connected to another notion called the cumulative density function the CDF the cumulative density function is actually a more general notion that Exists for all probability measures. It's the at probability measures on continuous domains. Sorry, so consider probability measure on our Borel probability space Then you can define a cumulative density distribution function CDF. Sorry, I did I just say combative density function That's wrong a combative distribution function CDF by Defining it as a function f that takes as its input a point in this space and then computes the probability for a That is Assigned to the set that contains all values x that are smaller in all elements than x So in the univariate case, it's easier to think about this It's the probability that is assigned to the set of all variables to the Left-hand side of x. I hope this is the right way around for you. Yeah left-hand side of x Now if it's actually sufficiently differentiable which it isn't always but if it is sufficiently differentiable then P actually has a density and that density is given by the derivative of f That this is then true follows from the central theorem of calculus now this sounds like CDFs are the more fundamental object and PDFs are derived if they exist so CDFs always exist but PDFs only exist if the CDF is sufficiently differentiable However in practice the PDF the probability density function is actually the object we really care about why? because the PDF Essentially transfers the rules of probability without change Directly to the continuous domain. What I mean by that is the following PDFs have the property that the year they are integral over the entire domain is equal to one that's true because Of the of Komogorov's fourth axiom which is one which is the axiom that says probabilities of the entire elementary domain Atomic set have to be one Secondly if we have a bivariate space of random variables x1 and x2 with a density a PDF then the Marginal density, so that's the probability density function of associated with the measure on one of them x1 is given by what you expect the Integral over the other variable over the bivariate PDF and of course the are the same for the other variable x2 So this is the sum rule the only thing we've done is if you place the capital piece with little keep lowercase piece I mean if you place the sums with integrals very natural to do right and conditional densities have the same form so if we have joint PDF over Bivariate random variable then the conditional distribution of one given the other Assuming that the marginal distribution of the other. Sorry the marginal density of the other is non-zero He can be written in this exact way. I'm not showing that this is the case It's just the case and I'll just leave it to you to think about why that's the case So therefore PDFs Essentially fulfill the sum rule and the product rule at least intuitively, right? You can just replace all capital piece with lowercase piece and all sums with integrals and you get back the rules Essentially that we've already seen for probabilities again in probability densities That's not true for cumulative density functions because they are like integrals from a certain direction Up on through this space So therefore we can also apply we also like because these two hold and base theorem is a direct corollary of them base theorem also applies to PDFs and that's actually what we are going to do all the time when we talk about probabilities We will not actually talk about the probabilities themselves We will talk about the densities and operate on densities by applying base theorem to them Why does this work? You might think well the reason it works is really just intuitively that these are densities We're talking about probability density functions. These are just densities of masses So you can think about probabilities as mass as truth Distributed across a space so that each part of the space contains a certain mass and the densities are the infinitesimal Versions of that and if you think about your physical intuition then densities transform like masses the only Thing that separates a mass from a density is the amount of space over which you integrate So that will mean that that that will also define the one difference between the density and the probability Which is that if we change the space over which we integrate to get Our unit mass Then we have to think carefully about how the densities change and we'll do that in a moment before we do that Or I want to show you a bit of an intuitive picture. So here is a probability density function in red I've just chosen one. So what a probability density function is is a potentially multivariate function So that takes a potentially multivariate input in this case over two variables x and y and it assigns to every point in this space a Non-negative real number Such that the integral over this entire space is unit is one by the way, this doesn't necessarily mean that This PDF never exceeds in value the value one there might be locations with density higher than one Because you might concentrate all of the mass or a large part of the mass of the entire probability Into a sub part a subsection of this space that has measure less than one and Once we have such a bivariate distribution Then we can and density then we can talk about both conditional and marginal distributions So this colorful thing in the middle is the joint distribution the marginal distribution over the variable y is What we get if we take this multivariate distribution and we project it onto y So we integrate out x on the other hand the conditional distribution is What we get if we Cut through the distribution at a particular point so for example in here oops, I'm sorry from Through this part of the distribution if you take a cut through here, then you get this Leg line and that's a conditional distribution for x. So that's a function of x Given that we know that y is at a particular point So this is an intuitive picture of what a probability density is and probability density functions are the objects We are going to use almost exclusively when we actually operate with probabilities because most of the time the variables we care about are not going to be discrete but continuous so we have to use them and Thankfully some product and base rule apply to all of the two PDFs So therefore they're quite natural to use there is however one caveat with probability density functions and that is because they are constructed Essentially indirectly by taking the derivative of the cumulative distribution function they do transform non-trivial and this is something that is best just written down and I will give you a quick proof Just pictorially almost to give you a feeling for how this comes about and then just actually done in an exercise Which we'll do later on in this term so consider a probability density function over Variable which we call x a random variable capital X which might take values little x and It's defined over some domain from the which has a left and the right hand side then we can construct a new random variable by Constructing a function u which We require to be monotonic and differentiable because then it has an inverse and So this monotonic differentiable function creates a so this is itself a random variable that defines a new quantity y And this new quantity y also has a probability density function if this Transformation is monotonic and differentiable and this new PDF is given by this object p of y So here I why there are two y's in here at the index is supposed to say that this is a function that Takes as its input the variable that we call y at the value y Which is given by the old PDF that we just know and then multiplied by this Mmm Absolute value of a derivative of the inverse of me which actually for these monotonic Differentiable invertible functions is given by the inverse of the derivative of the function you this is like an infinitesimal version of The definition of a random variable that we did earlier today and Why does it have exactly this form while it has this form because PDFs are defined through Commutative distribution functions through 3d CDFs and the CDFs are really the fundamental objects because they they define the probabilities We've defined the entire theory around probabilities not around densities So we have to make sure that the densities conform with what we've done to probabilities Why is this the exactly right form? Well, here's a simple proof below Just consider the derivative of this function which exists because we've assumed so and it's larger than zero because it's a monotonic function Now we could consider two points. This is the situation here consider two points C1 and C2 in x Then why which is u of x lies between a domain d1 and d2 Where d1 is less than d2 because it's a monotonic function And we can now think about the cumulative density function at distribution function CDF of Why this cumulative distribution function is by definition given by This expression. So now we want to replace the y with x. We do that my first Thinking about what this actually means in x in terms of u but u is invertible so we can also directly talk about x and This is basically what the definition of what we previously did for the definition of Random variables or we essentially using the definition of a random variable to replace the probability of to write the probability of y in terms of a probability on x and Then use the fact that x has a density to write this A cumulative distribution function of x as an integral Now this is a way of dividing the cumulative density function of y In terms of an integral over the PDF of x That's convenient because we now can construct from here the PDF of y Which is just defined to be the derivative of This CDF of y well, what is that? Well, we know that it's this function So we just take this function and take its derivative with respect to y so that's a simple form of Calculus which using the chain rule tells us that we have to take this p x of v of y and Multiply with a derivative of this function v with respect to y now notice that v because u is a monotonic function It's inverse is also a monotonic function and it's also monotonically increasing So you take a look at this picture you can think of so this is u of x You can mentally flip this around and notice that it's still a monotonically increasing function So for monotonically increasing functions, this works. How about monotonically decreasing functions? So for these the situation is ever so slightly more complicated, but it's basically not analogous So imagine that we have a function that is Monotonically decreasing now the first thing that actually changes now is I've taken drawn a little picture here We have a decreasing function u of x So the first thing that happens is that our integration domains are exchanged So if you have a look a region in which x goes from c1 to c2 Then that actually means that y which is the map from x to y lies between d1 and d2 but d2 is less than d1 and The cumulative distribution function of y is still given by definition by the probability measure assigned to the region to the left of y we plug in the definition of y in terms of x use the definition of a random variable and Now we have to there's a larger than in here rather than the less than So we need to do the integral the other way round We know what the integral is at infinity because the integral over The entire domain from minus infinity to plus infinity is one so we can write this Well, it's not really a cumulative distribution function but we can write this kind of function with this sort of probability over the region to the right of v of y as one minus the probability for the region to the left of y and And plug in the definition in terms of the density which works because p is assumed to have a density and Again now do everything as before so this density we're looking for is given by the derivative of the cumulative distribution function This if you compute it actually here There's not going to be a minus here showing up other than that everything is the same because the derivative of 1 with respect to y is 0 and What now changing is that we have your derivative of v of y which is the inverse of you and if you look again At this picture if you mentally flip the axis of this function around then the inverse of a monetarily decreasing function is also monetarily decreasing so it has a negative derivative and Minus times minus is plus so we get we can write this expression with The absolute value of the derivative. That's exactly what we're looking for So this was the univariate case for the multivariate case the situation is a bit more complicated I'm just going to show you what it is because this isn't a calculus class So we I will just tell you what the answer is if you have a multivariate joint density over multiple random variables and consider a continuously differentiable Injective function with non vanishing Jacobians, so that's the corresponding concept to a monotonic function in the one dimensional case Then the derived variable y which is g of x has a density that is given by the density of the original variable at its pre-image of y under g times the determinant of the Jacobian Matrix of the inverse of the function of g So this sounds really complicated. That's a determinant There's a Jacobian matrix. If you don't know what a Jacobian is by the way, then please look it up It's a matrix of partial derivatives of elements of the multivariate function So this sounds complicated, but It's not going to be that complicated in actual applications So later this term we will get to see we'll actually get exercises where you have to do this transformation And if you actually do it mechanically like you through if I give you a concrete function Then it's usually quite straightforward to write down Jacobians to compute the determinants at least numerically Construct the inverse and so on and then you'll find that this just does something not to view The only thing you really have to keep in mind and this is our next great slide is first of all that probability density functions are An important concept we're going to use all over the place They distribute probability across continuous domains Not every measure has a density, but all probability density functions define measures probability density functions are Non-negative real valued functions which are integrable such that they're integral over the entire domain is one and They satisfy when we interpret them as defining a probability measure These three rules which are continuous analogues of the sum rule the product rule and base theorem And they actually look a lot like the sum rule the product rule and base theorem So therefore we can really do probabilistic inference Using these density functions rather than the probabilities, which are actually the fundamental object The only thing you have to be really careful about with densities is if you change the variable in a particularly in a non-linear fashion I mean also in linear fashion, but particularly in a non-linear fashion So if you can construct the devised variables from underlying variables, then the densities Transform in this non-trivial way that involves the Jacobian of the inverse of your transform and it's determined At least that's true if you have a transformation that is actually invertible With that, let's take a quick break and then we'll do something much more fun. We'll do an experiment I know that these very theoretical derivations are tedious. They can be boring Especially if you're not someone who's particularly excited about mathematics So let's end this lecture with a real example to finally start doing some ever so simplistic machine learning Let's look at a real example Let's say I want to know how many people, what's the proportion of people in the population that wear glasses That's maybe a bit of a stupid question to ask, but of course it's a template for a fundamental kind of question That you might want to know, like ask questions that you might want to ask about the population or the world at large So how do you do this? Using a probabilistic or if you like a Bayesian approach Now this is one of these points Where normally I would ask you questions in the lecture hall and then we would have a discussion about it and we would slowly go through it and Hopefully that would help your mind to follow along. Now unfortunately because of corona we can't do this So I implore you try to slow down the video stop it here and there think for yourself Otherwise you're not going to be able to follow this example So how do we do this? Well we're going to define a probabilistic generative model which allows us to do Bayesian inference And here is how this works We begin by introducing a random variable for this quantity that we care about Let's call it pi, the probability to wear glasses Pi is maybe a bit of a weird symbol, but obviously I can't use p because it's already used for probability distributions So we're going to use pi and say I want to know this unknown probability, pi That's obviously a real number between 0 and 1 So for some of you it might already be confusing that we're trying to learn a probability with probabilities But probabilities are just numbers Right? Real numbers between 0 and 1 There's nothing that forbids us from trying to learn a probability with probabilities So there's this number, it lies between 0 and 1 If it's 0 then it means there's no one in the entire population who wears glasses Now if it's 1 that means everyone's wearing glasses And if it's 0.5 that means half the population is wearing glasses And now we're going to assume that we can do experiments Now in any other year I would do this by actually doing the experiment in the classroom and go around And look at people, that's one opportunity for us to finally get to know each other in lecture 3 We can't do that today, so please imagine me walking around looking at individual people And asking is this person wearing glasses? Are they not wearing glasses? Are they wearing glasses? Is this not wearing glasses? And every single time you make such an observation we're essentially collecting the value of one more random variable x Let's call it xi, x1, x2, x3 and so on That's all we need, those are all the random variables we care about And now we just have to define probability measures And because pi is a real value we'll typically define a probability density function over it And then afterwards everything is just Bayesian inference So how we do that? Well we're just going to use Bayes theorem, right? So that's the mechanism we have agreed to use, we're never going to question it That's just how probability theory works because that's what our axioms told us to do And the only quick question is what are the terms in Bayes theorem? What's the prior and what's the likelihood? So we need to assign actual values to these Now let's start with the prior For simplicity I'll do something very simple Because one of the arguments I want to make is that the prior doesn't matter so much Maybe it's just to assign a uniform prior So the prior is a number between 0 and 1, so it ranges from 0 to 1 And we could just say every single number in that domain including 0 and 1 are equally probable if you don't know anything yet So there's just a flat distribution This works because it's a bounded domain and the integral from 0 to 1 over the function 1 is just 1 Great, that's our prior Now the tricky bit is what is the likelihood? What's the probability to observe someone wearing glasses if the true probability of wearing glasses is Pi? Now I know that this weird seemingly recursive definition or kind of object is confusing to many people So there's this probability which we call Pi, we want to know about it but we don't know it That's the probability though that tells us what the probability is to observe someone wearing glasses So actually, this is one of these points where please stop the video and think about it for yourself So that you can have this insight for yourself But now that you've maybe restarted it, I can tell you that the likelihood to observe someone wearing glasses If the true probability to observe someone wearing glasses is Pi is just Pi, right? And what's the probability of observing someone who's not wearing glasses? If the probability to observe someone wearing glasses is Pi, it's 1 minus Pi, of course Great, so that's it, we're done, now we can do Bayesian inference Oh actually, almost done, there's this annoying normalization constant, the evidence that we have to compute The probability to observe someone wearing glasses So for that we need to just integrate this probability density function Because that's just now a function, right? So we just have to do it and it turns out that that's a function that can be integrated And it's called the beta function and it'll just allow our computer to do that Now instead of doing this, I'll show you a demo, see if this works It's a little bit of a trick to do this, different from my usual setup So, okay, imagine me walking around the lecture hall now And saying, let's assume the people in the lecture hall are a sample from the population And we are going to collect individual samples So what you're currently seeing is our prior, right? That's a function that maps from 0 to 1 onto the reals And it assigns a probability density function of 1 to every single number between 0 and 1 That's a probability density function because it's non-negative and it integrates to 1 And it's integrable Now let's say I see the very first person in the front row and they are wearing glasses How lucky are we? Now, what I need to do to do Bayesian inference is I need to multiply this prior With that likelihood to observe this person wearing glasses And that's, as we just saw, this function which is just the function x or pi actually So this is typically the most confusing part of this experiment to most people So let's go slow The probability, if the true probability to wear glasses is 0 and I see someone wearing glasses And the probability of that happening is 0, right? It would be 0, that's what the likelihood tells us If the true probability to observe someone wearing glasses is 30% And I observe someone wearing glasses, the probability of that happening is 30% If the true probability of wearing glasses is 90% Then observing someone wearing glasses happens with probability Oops, I'm sorry, I'm pointing at the wrong line, that's the right line, the dashed one Then the probability is 90%, right? So notice that the likelihood is a function of the latent variable, not the observable The observed variable is the thing that the likelihood assigns a probability to The likelihood is a probability over the data, but a likelihood for the latent quantity So it's a function of the thing we don't know And now we can reason about what the true probability might be And to do that we do base theorem, which means multiply this dashed line with the straight line So prior times likelihood, and then normalize by the evidence So what's the evidence? The evidence is the integral over the product of these two functions The prior is a unit function, so it doesn't do anything So the evidence is just the integral over this dashed red line from 0 to 1 Well, what's that integral? Now that's again something you can think about for yourself Maybe if you want, stop the video It's obviously 1 half, right? Of course, because there is a rectangle here of size 1 And we've just drawn a straight line through with dividing it in half So of course it's 1 half, right? So we get to multiply this by a factor of 2 So that's 1 over 1 half, and we're left with the black line That's our posterior density function Great, okay, that's our posterior Okay, that was patient inference, done Now of course there's not just one person in the audience Now let's say I go and meet another person sitting right next to them They are not wearing glasses Okay, so now how do I do patient inference? Well, actually, I already have a posterior from the previous observation That posterior is this probability density function So the likelihood to observe the next person wearing or not wearing glasses Has nothing to do with the first person other than the probability to observe someone wearing glasses So you can already think about conditional independence We'll talk about that in a moment So that means I can just multiply that prior, which is my previous posterior So the posterior from just before is now my prior Which I multiply with the likelihood for this observation to observe someone who is not wearing glasses And we realized before that that likelihood is 1 minus that probability So what I've done here is I've now multiplied this prior density function with this likelihood So that's just 1 minus pi And then divided by the normalization constant So the normalization constant is the integral over the product of these two functions This product of these two functions is some kind of parabola like this It's actually a parabola And the unnormalized one is a little bit further down I've just renormalized that it's a probability distribution Computing this area under this quadratic function is, of course, an ever so slightly non-trivial problem But, of course, I'm sure you can do that for yourself, maybe even in your head Integrating a quadratic function is not that hard Now, let's say I made one more observation and the next person is also not wearing glasses What do I do now? I hope that by now you've understood what's going on This is now our prior after two observations I get a third observation and I just multiply with the likelihood for that third observation again That likelihood is again 1 minus pi Because I've assumed I see someone who is not wearing glasses And this is now the posterior distribution The integral now gets a little bit harder because we're now integrating this complicated polynomial But let's just say we can do that I mean, we have a computer, right? Computers can do cool things So let's do that And now we're beginning to do machine learning, right? We let our computer do the integration for us And it's constructing a posterior distribution for us Now, let's say the next person is also not wearing glasses I just keep multiplying in likelihoods Then there's a person who is wearing glasses and another one wearing glasses And one wearing not wearing glasses and another one wearing glasses And here's a person wearing glasses and here's someone not wearing glasses And another one not wearing glasses and another one not wearing glasses And someone wearing glasses and so on and so on and so on Usually, I actually do this by going through a significant part of the lecture hall Until we have something like 20 people or so that we've seen And then the distribution looks something like this what you're currently seeing here is the most recent observation is someone who is wearing glasses and that was the prior before the observation and this is the posterior after this observation. And what this now tells us is we've learned something we know that the probability to wear glasses is definitely not zero and definitely not one and it's also quite unlikely to be 90 or 80 percent or 10 or 20 percent just because of the ratio of people we've seen wearing glasses or not wearing glasses. But after 21 people we're still not sure right we really don't know what it is because it might be any number between let's say 20 percent and 70 percent. I mean of course it could be any number between zero and one other than zero and one it can't be zero or one because otherwise we couldn't have seen both positive and negative cases but those are now very unlikely and if you keep doing this then over time this distribution will concentrate around a value which actually tells us what the probability is to wear glasses. Now I know that people will have many questions about this they always have and that's great let's talk about it in the flipped classroom please if you're confused by this write down your questions and we will talk about it. I have to tell you where this experiment comes from it's one of the oldest questions of statistics of course right and it was actually discussed long before statistics were a thing. It goes back and for this I need to switch back to slides to this wonderful man who you've seen before he's called Pierre Simon Marquis de la Place. I'm actually not sure whether he was already a Marquis by the time that he wrote this this text and he wrote about this experiment in his famous book theory analytic the probability. Now if you can read French here is his real citation normally I would do a simple a silly joke in the lecture hall and ask you whether someone can translate this for us and then when you're laboriously trying to to drag out your high school level French and try to understand what he's saying I say don't worry about it by now we have cool machine learning technologies that can translate texts for us so we don't need you anymore and here is the English translation as created by a deep network. The probability of most simple events is unknown Laplace writes considering it a priori which is what we just did it seems susceptible to all values between zero and unity right we just put it a uniform prior between zero and one but if one has observed a result composed of several of these events the way they enter them makes some of these values more probable than others the posterior starts contracting until we get an a posteriori concentrated distribution thus as the observed results are composed by the development of simple events their real possibility becomes more and more known and it becomes more and more probable that it falls within limits that constantly tighten would end up coinciding with the number of simple events became infinite if the number of simple events became infinite now obviously Laplace writing in 1814 didn't have access yet to the wonderful compact mathematical notation that we use today but he was thinking along the exact same lines so what we're going to do now is we're going to go through this exercise again more slowly to think about how this fits into our cooking recipe for Bayesian inference for that let me just briefly switch something all right so let's go through this experiment once more but a bit more mathematically not staring at a demo but trying to think about what just happened so we've actually followed the cooking recipe that I outlined in the last lecture remember I quoted David Mackay always write down the probability of everything so here's our cooking recipe again how do we build a probabilistic machine learning method it's actually I mean you could call it a probabilistic inference scheme but that's what it is it's a machine learning method all good learning machines are probabilistic inference schemes it's just sometimes hard to notice that they are so we're going to start by defining our probability space and now that we've talked about random variables we can actually talk about random variables rather than this complicated sigma algebra notion so you you might notice that in many texts people just define random variables and that's because random variables are a function and because even the most basic functions can be thought of as being functions of an even more basic space and maybe even the same space mapping onto itself it's enough to just talk about random variables as the objects of interest here we have two kinds of random variables one of them is the probability to wear glasses that's a real number and the other ones are and the real number lies between 0 and 1 sorry including 0 and 1 and we have the observations let's say there are five of them or n of them right and then actually n is another parameter of this model and the x i are the individual observations the random variables then um oh yeah i just told you this right so these are binary variables that lie between 0 and 1 and now we can draw a graphical model if you like right so this is actually the beginning of step number two so the first step is to write down our we call it well you could say the sigma algebra define the probability space or also just say what the random variables are and now we can start thinking about how the situation is going to look like in terms of conditional independence so we'll draw a picture like this which says so there is this unknown variable which is the probability to observe um someone wearing glasses and it's generating all the other ones so we're we can draw from this probability over and over again independently for each of these observations and identically because each of these these variables is drawn with the same probability so these variables down here x1 to x5 will be said to be identically and independently distributed iid um and so now comes a little bit of simplification in the notation so um formally we would have to say there is this random variable called y a random variable being a function and it takes values pi between 0 and 1 but we're just going to be talking about pi because that's the value we actually care about right so the function doesn't matter so much is the value of that function and the same for the capital xi the random variables we'll just write the lowercase x and actually this is the last time they're going to be so formal in the future i will always just write real numbers for the values and i'm often not even going to be talking about random variables at all just values to which we assign probability density functions and um we need uh for our generative model we need a prior and a likelihood together they make a joint distribution so p of pi times p of xi given pi gives a p of xi and pi comma pi that's all we need once we have a probability distribution everything else is just mechanical Bayesian inference so um to do Bayesian inference we start with a prior you could think of this as the probability for pi given that you know nothing if you find priors weird and then um compute the probability to where for someone to wear glasses after the first observation for that you multiply the prior by the likelihood and you divide by the evidence base theorem we are going to i'm going to to sort of to drive home this point of the structure of this equation i'm going to call this integral down here z one that's a normalization constant i'm doing this to say that this down here is if we if you're thinking about pi which we do then this is just a number it's just a real number right it's an integral over pi so it doesn't depend on pi anymore once we've integrated it out it's just a number so we might as well write it in front the interesting bits all the structure is in this stuff behind it the likelihood times the prior so that's our posterior and now imagine we see a second observation so in the demo i said oh so now the posterior becomes the prior and i just get one more likelihood and we get a and we get a posterior again why am i allowed to do this you might have thought for yourself that's maybe one of the many questions that come up well i'm allowed to do this because of the structure of this generative model because the samples are iid so let's say i see a second observation then so now we have we have seen two numbers x1 and x2 so the the proper mechanics of probabilistic inference tell us that to get this posterior we need to use base theorem and write down the joint distribution of all three variables pi x1 and x2 and then the normalization constant so z2 you can think about this for yourself what this is but it's an integral over pi over this expression down here or up here maybe so um what i've done here is i've already factorized this distribution so of course i could have written here p of pi comma x1 comma x2 and then because of the product rule i'm allowed to rewrite this most generally as x2 given x given x1 and pi times x1 given pi times p of pi all right of course i could permute all the variables in here um if i wanted to because the product rule is generally true for all permutations of the variables however this particular factorization is useful because i can now use the assumptions of of iidness of independence given pi the graphical model on the previous slide encoded a conditional independence structure it's one of these fan out elementary restructures right um which say that when conditioned on pi then x2 is actually independent of x1 so i can get rid of this x1 and this expression and just write it like this and now notice that this year before up to the normalization which i've absorbed absorbed into this real number is actually just the previous posterior times a new likelihood so normalization doesn't matter because after the after the inference step we will always have a probability distribution even if so even if this thing that we integrated over here isn't a joint probability distribution as long as it's a non-negative function that integrates to a finite number afterwards we have we're going to have a normalized probability distribution so that's why we get to do our um uh yesterday's posterior is today's prior kind of thing so once i've done that with more variables i um will get more and more of these terms in here and there will be all of these individual observations showing up in a product and of course if i had n of these observations then that would just be n of these terms in this product multiplied by a prior normalized and you can do the normalization in every single step as i did in the demo or we can do it at the end doesn't really matter as long as we only start talking about probability distributions once they are normalized everything's fine so that was step number two step number one was let's define our generative model step number two is now let's think about the structure of this model a little bit to understand how expensive it's going to be to do the inference now the third step is so far all of these are it's just abstract nonsense right i've just written down some symbols but these functions don't actually have a shape yet so then i ask you so what is the probability for someone to wear glasses given that you've observed someone wear glasses and we started thinking about the structure of the prior and the likelihood and actually assign values to them notice that what's happening there is that we are now imposing our domain knowledge or you could also say our assumptions onto the model so at this point this is where the philosophical debate can start up until now we've just done mathematics and everything is just axioms well okay you could you could of course debate the conditional independence as well and i'm sure we will in the flip classroom so maybe that's part of the philosophy as well but at some point philosophy comes or with philosophy yeah at some point our our everyday knowledge or our mathematical assumptions about the process come in and those are the ones that can be debated probability theory cannot be debated it's just a set of axioms the assumptions we make in a concrete experiment can be questioned so i said the probability observed to observe someone to wear glasses given that they wear glasses uh sorry no given the probability for someone to wear glasses given uh now the probability to observe someone wearing glasses given that the probability with two glasses is pi is pi and one minus pi the other way around that's actually an assumption and you can think for yourself about whether you want to question this or not and maybe we will talk about it in the flip classroom and um yeah so now for for convenience i will let's introduce two new variables you could call them random variables um but um there are there things that we're going to know so it doesn't matter so much that they are random and we will actually know how many observations we've made let's um say um we say capital N for the number of people with glasses we've seen and capital M for a number of people without glasses we have we have seen and they take values little m and little m and um we're going to use that to simplify our notation because after we've seen lots of observations we don't want to plug in the individual terms we want to somehow condense them and we're allowed to do this because we've defined the notion of a random variable so after having made N uh um five observations N of which were positive and M of which were negative our posterior is going to have the following structure that will be the prior that we haven't talked about yet so we'll have to do that in a moment in the demo i did that first but let's say we leave that to the end and um here the likelihood will now take the form that there's only two kinds of terms in here either it was a positive positive observation then there's a pi in here or it was a negative observation then it's one minus pi in here so we can get rid of the product symbol and instead write um pi to the N times one minus pi to the M times the prior that's what our posterior is going to be and now we're left with only two problems we have to define what the prior is and we have to get this normalization constant so this is now where people who criticize probability theory might come or Bayesianists might come in and say oh but now your prior the prior is the big problem right how are you going to deal with the prior so here is how the plus deals with the prior it makes some very interesting observation so what we will need to do um Bayesian inference is that we need to compute this um normalization constant right we need to integrate over this function now notice that um well in the index in the in the experiment actually that is still a simple thing I'd said let's just say that prior is a uniform it's just one right then um we need to in the end solve an integral that is something like the integral over pi to the N times one minus pi to the M now here's a super smart observation tutela plus which is hmm okay so to solve this problem I need to be able to solve such integrals now notice that the unit function is actually also of this form it's just pi to the zero times one minus pi sorry pi to the one times one minus pi to the first power right because then it's just pi times one minus pi which is one so okay um no I'm sorry that was wrong okay short correction it's just pi to the zero times one minus pi to the zero which is one times one which is one yeah that's how this works okay good so actually since we will have to solve this integral which actually was a huge problem for laplace let's talk about that in the flip classroom um but since I saw I have to solve this kind of integral anyway I could actually consider any prior that is of this form so any prior that can be written like this where if introduced a minus one for convenience so then um this is uh that's why I was confused just now right so if you set a and b to one then you get our unit prior looks like this right um because then if you have a prior that is of this form then if you multiply it with the likelihood we'll still have to solve an integral that is of this general form so actually we can be more general in our definition of a prior than using a uniform prior we can do um we can we can choose priors that are of this general form because then we have to still before and after always just have to integrate this kind of function and this thing has a name it's called the beta function and Laplace actually couldn't do this integral so when he did his computations he actually used an approximation and I think this is a fun thing to discuss in the flip classroom so we'll talk about that then um this integral was solved by Leonhard Euler a little bit after Laplace so um this is called the beta integral and the so what this means is that we can think of a prior as arising from past observations and if you use a uniform prior that's a little bit like assuming that we have no prior observations at least that was Laplace's argument which is also maybe where this minus one comes from he in fact actually writes that so here he is again saying this um oh yeah it's a French again sorry I can translate into German to English so um he's writing that and again keep in mind it's 1814 this poor guy doesn't have access to proper mathematical notation yet and all the concepts that we now use to simplify stuff he doesn't even know what a random variable is so he has to write this very complicated sentence when the values of x considered independently of the observed result are not equally possible if we name z the function of x which expresses their probability that's our prior it's easy to see by what has been said in the first chapter of this book so by what we've just done basically that by changing in formula one which is what I had in the previous slide y in y times z we will have the probability that the value of x is within the limits between pi and pi prime this amounts to assuming all the values of x equally possible a priori and to considering the observed results at being first at being formed by two independent results whose probabilities are y and z so he's saying to we can actually consider a more complicated prior and then our data and think of the resulting posterior as result as resulting from two different data sets some a priori observations multiply with the with the likelihood for the new observations that we collected these a priori observations are today often called pseudo observations we can thus reduce Laplace writes all the cases to the one where we assume a priori before the event an equal possibility to the different values of x and by this reason we will adopt this hypothesis and what follows so he says if you you already have some other prior knowledge then you can include that as if it were a data set rather than actually calling it a prior this is called this this sort of algebraic structure we've used here has a name and I'm just going to tell you about it now and we will come back to it much later in the course it's a really beautiful concept what we've just done here is we've constructed something that's called a conjugate prior why have we done that and this is often something people ask in the lecture it's because it makes the computation easy and of course this might seem really fishy to you because if the point of probabilistic reasoning is to express everything we know and then do mechanical inference and then not never to question the mechanical process anymore it seems a bit dodgy that we're doing this we're sort of fiddling with the prior to simplify this computation and maybe it is we should talk about that in the flip classroom please come with your questions however there is I mean the simple answer is think for yourself about or maybe try it out with your machine what kind of shapes of priors you can address by using this kind of prior distribution so basically change A and B make a plot of this function set A to anything from zero or just above zero to a large number and B and see what kind of shapes you can create and if you think that that's an interesting language in which to encode prior information then you've already bought the argument and if you think there are some kinds of priors that you can't encode with this then think for yourself about what you can do today using your cool computer which Labota Plus couldn't do in 1814 to replace this prior with something else okay with that we're at the end let me briefly summarize random variables allow us to define derived quantities from atomic events prevail sigma algebras can be defined on all topological spaces allowing us to define probabilities if the elementary space is continuous and probability density functions distribute probability across continuous domains pdfs are actually the objects of interest for anything real valued or continuous valued because they satisfy the rules of probability we can basically treat them as if they are probabilities the only problem is that we when we transform variables in any way in particular also in a non-linear way then they transform non-trivial and we have to be careful about these transformations but as long as we don't transform our variables everything's fine that was the tedious mathematical part of this of this lecture we had to do it but now we have all of our tools available and we can start doing some actual computations with real numbers actual experiment and think about how we do this with computers and we don't have to follow in the footsteps of 1814 anymore and I'm hoping we'll see each other again in lecture four thanks for your time