 Welcome to dealing with materials data, in this course we will learn about the collection analysis and interpretation of data for material science and engineering. We are in module 3 and we are looking at probability distributions, specifically we are looking at discrete probability distributions so far. And we have looked at Bernoulli trials, binomial and negative binomial distributions and we are going to continue with more discrete distributions. And we are using a practical example where these distributions are important, specifically it is a characterization tool, it is a microscopic technique to find out the volumes at very small length scales of samples and this is known as atom probe and we are looking at the process of detecting atoms in an atom probe technique and how we can calculate the error bars basically, the uncertainties that come about, knowing the process, knowing what happens in the atom probe technique in terms of selection and detection of atoms, can we say something about the actual composition of the sample based on what we measure and say how much is the error. So there is the error in the process of both selection and detection and if we know them, can we say how much is the error in the actual sample, in the composition that we give for the actual sample, so that is the question that we are trying to answer. So we will continue with atom probe and we will see that we looked at the selection process and we came to the conclusion that it is binomial, specifically it is negative binomial. So we said that okay, so suppose if you have to detect 100 atoms, how many failures will happen before you reach the target of 100, so that is what the negative binomial distribution was and we saw that for a detector efficiency of 0.6, obviously you will fail about anywhere between some 50 to 70 times before you actually reach 100, so that is what the negative binomial distribution showed and we of course calculated the cumulative distribution function, quantile function and so on and we also learned how to pick random variants from the distribution. So if you want to simulate this process for example, then it is important to pick random variants when you are simulating this process for the first stage to use the negative binomial distribution and calculate the values. So now we are going to look at the hypergeometric distribution and we will find out why it is relevant for the atom probe technique and so this is again just to remind you the schematics from Dano et al. So we have a specimen and the specimen has a proportion of A atoms and we take a volume V of this specimen and we pull out the atoms from that volume, so that is M atoms from volume V are pulled out out of which J of them are of type A and so the proportion P of A atoms in this is J by M. Now these atoms are expected to fall on the detector and out of the atoms that fall on the detector, N of them get detected out of which I of them are A atoms and we hence be able to calculate the proportion P0 of A atoms, so this is the actual measurements. You will say detection happened for 100 atoms and 33 of them are of type A, so 33 by 100 is the composition that you would measure at this point and it will have its own error and what we are saying is how much is the error in the actual sample composition that we give based on this value, so that is what we are trying to answer. So we looked at this process of pulling out M atoms and we realized that these atoms when they fall on the detector either they get detected or they are not detected and based on this binary process given the detector efficiency one has to think of negative binomial distributions to know how many failures will happen before you reach a success of given number of atoms, so that is the negative binomial and now we are going to see how hypergeometric is relevant. So let us say that N is the number of detected atoms, so the measured P0 is I by N and that is but in the probed volume there are J by M, so this is P and this is P0 that we measure. So what can we say about P0 and P, we can say that P0 is an unbiased estimate of P, why? Because the detection is independent of whether it is an A atom or not an A atom, irrespective of what is the type of atom, the detector always detects the atom just that it does not detect all atoms that fall on it, it detects only a fraction of them but that fraction is not biased towards detecting, for example it is not like if 20 atoms fall it will detect 18 and 20 B atoms fall it will detect only 10, if that happens then this proportion that we measure has no relevance to the probed volume or unless you know how much is the differences you cannot say anything about the composition in the probed volume. However in this case because the detector is independent of whether A or non A that is falling on it it is going to detect with a given efficiency that efficiency remains the same, so we can think of P0 as an unbiased estimate of P. So our interest is to estimate the variance in P0, how much is the variance in P0, so then we know how much is the variance going to be in the value that you are giving for composition for the specimen, so it involves two variances we have already seen there is a negative binomial and there is a variance associated with it and because M is itself an estimate it is not just a single number, so there is an estimate there and then there is an estimate for P0, so that is what we are trying to find out what is the variance in P0. So the problem is you measured I out of N and these N have been arbitrarily selected from M, arbitrarily selected because the detector does not have any favourites, it detects both A and non A with the same efficiency, now given M so which contained J A atoms, what are the different ways in which I A atoms are chosen in a group of N, so that is the question we are asking. So we know that there were M atoms out of which N got detected, in that N what are the different ways in which I of them happen to be A atoms, so this is the question we are asking, I is a random variate of the real valued random variable I let us say capital I, then the I is a distribution and that distribution is called hyper geometric distribution and hyper geometric distribution has the parameters J, M minus J and N. So what is this hyper geometric distribution, basically it answers this question, given M which contained J A atoms, what are the different ways in which I A atoms are chosen in a group of N, so that is what hyper geometric distribution is, it is called sampling without replacement from a finite population. If you have an earn of J A atoms and M minus J B atoms let us say or M minus J non A atoms and if you draw N atoms from this earn then I of them are A that is given by the hyper geometric distribution and the parameters are J M minus J and N, so this is the hyper geometric distribution by definition, so it is a sampling without replacement and it is a finite population, so because it is a finite population every time you pull out an atom if it happens to be A or B then depending on that the further probability of picking another A atom will change, so it is a finite population. So the probability function or mass function for the random variable I for it to have a realization of random variable small I is given by this expression, it looks very involved but it is very simple it is J factorial by I factorial minus J I factorial and M minus J factorial by N minus I factorial M minus J minus N plus I factorial and in the denominator you have M factorial by N factorial M minus N factorial remember this is the total number of atoms out of which N of them we are detecting and here we have J of them in that population to be A out of which I of them we are detecting and M minus J of them are non-A out of which we are detecting N minus I of non-A, so that is what this quantity is and so the expectation value for the hyper geometric distribution is given by N P where P is remember it is J by M, the variance of I is given by M minus N by M minus 1 N P times 1 minus P, so if you calculate expectation of I by N that is P and variance of I by N so that will have an N squared dividing this quantity, so that happens to be M minus N by M minus 1 P into 1 minus P by N. We are going to assume that P is P naught because we assume that P naught is a unbiased estimate of P and if M is large and if it is a constant given by N by Q then one can show that expectation value of I by N is P naught and variance of I by N is approximately 1 minus Q times P naught 1 minus P naught by N. Remember we are making several assumptions and approximations, we are first assuming that M is a constant which is not true, it is only an estimate, it cannot be constant number and we are assuming M is large which might be okay, so you can do an experiment by pulling out large number of atoms and we are assuming P is equal to P naught which is also a good assumption or approximation. If so then the variance is 1 minus Q P naught into 1 minus P naught by N, so this is the hypergeometric distribution, so let us take a look at hypergeometric distribution a little bit, how to deal with it in R, so it is given by this command hyper, so you have D hyper, P hyper, Q hyper and R hyper and these are meant for probability density, cumulative distribution, quantile function and this is for generating random variants and hypergeometric distribution has 3 parameters we saw, so let us say that we pulled out 170 atoms and let us say that 85 of them are actually of type A and we want to know if you detect 100 of them and then what is going to be the number of A atoms that you are going to have in this 100, I that is the value that is of interest to us, so that is what the hypergeometric distribution is going to give us, so we will do as usual using R, so let us open R and the version is 3.6.1, so we get the working directory, so we are in the right directory, so we can deal with, so this is a repetition sort of the earlier plot that we made for the negative binomial distribution, so N is equal to 100, M is equal to 170, J is equal to 85 that is given to us, so we are again going to make a column of row of plots, 3 rows are there and so 3 plots we are going to make and for x the sequence is 0 to 100 in steps of 1, for y it is 0 to 1 in steps of 0.01 and so we are going to plot the probability density or probability mass function and cumulative distribution function and the quantile function and remember d hyper you have to give these parameters, so you can see the distribution function, the cumulative distribution function, this is the probability mass function and the quantile function, so all 3 of them are plotted and of course if you want to get help in any of these cases, so for example if you want to get negative binomial, so you can say dn binom, so it gives you and it tells you what are the things that you have to give to this function, a vector of non-negative integer quantities you have to give, that is why we give this x and q is a vector of quantiles, so for calculating the quantile plot for example we need to give the quantiles, so it has to be obviously between 0 and 1, so that is why we have chosen and p is the vector of probabilities and so that is the in qn binom and n is the number of observations, so that is something that we have not, so we will use n when we want the random variant, how many of them we need, so here we have to give the size probability and there is also alternative characterization in terms of mean, but we did not use for dn binom. Similarly for hyper, let us say p hyper, so we have to give mnk, x or q or p or things that you give here and what are these m and k, so here it is described in terms of white and black balls in an urn, you can think of them as a and b for example or a and non-a for example, number of white balls in the urn and number of black balls in the urn, that is why we give m and j and m minus j and k is the number of balls down from the urn, so that is the n for us, so it is you can think of this as the white or a, black or non-a out of which how many of them we are pulling out that is the number of balls drawn from this sample that is n and then this distribution gives how many of them are going to be the atoms of type A, hyper geometric distribution is used for sampling without replacement and these are the parameters and so it gives you the information about what this function is giving and so this is what we found, so when we do this plots tell you that for example if you pull 100 out of a sample which had 85 a and 85 non-a then you are going to measure approximately about 50 and there is a distribution, so and this is a cumulative distribution function and this is a quantiles, so we can also of course get the random variants and from this distribution, so you can get random variants from the hyper geometric distribution, so let us do that, so we are saying that let us generate 20 random variants from this distribution which has these parameters, so we are saying that 85 a, 170 total, so 85 non-a and then let us say 100 of them from this sample we take out then what will be the random variants that you would get in such a distribution, so you get numbers like 52, 48, 52, so that is what we saw, so it is about 50 it is distributed and so it is closer to 52, 48, 52, 51, 50 and 45 and 46, 47 and so occasionally you have some 40, yeah, so the lowest that you will get is probably about 46 here and you get 254, so you can generate more random variants and see what happens, so you can generate let us say 40 then you will see that, so it is distributed and occasionally you will see numbers going of like 56 in this case for example or 44 in this case for example, so you will have the distribution about 50 and it will show you, so this is expected because remember the composition that we are saying is 50 percent and so about 50 percent is what it is going to return and it is going to return values about 0.5, so that is what we expect and that is what we see, so we have now looked at the atom probe technique and there were two stages, selection stage and detection stage and we have looked at the statistics at each stage, one is negative binomial, the other one is hypergeometric distribution and we have found out because once you know what distribution they are, you know what their variances are going to be, they can be represented in terms of the parameters that represent this distribution and knowing these variances of course we have learnt how to do the error propagation in the previous module, so we can use this information we have now and try to do the error propagation and we did discuss during error propagation last time for example that how to deal with if it is independent, how to deal with it is not independent and so we will continue with a similar discussion for the atom probe technique analyzing the errors or random variations that you see in your measurement and how does it contribute to the error in your measurement of the composition of the sample, so that is what we will do. So we have given two examples of two discrete distributions, so the negative binomial and hypergeometric they are all based on Bernoulli processes and binomial distribution, so we will continue in the next session we will try to calculate the variance in capital P in the sample the proportion of A atoms, knowing what is the proportion of A atoms in your detected number of atoms, if you detected N atoms and you profound the proportion to be P naught, what can you say about the error in the composition that you will get for this sample, so that is the question that we will answer in the following sessions. Thank you.