 Welcome to dealing with materials data, this is a course on collection analysis and interpretation of materials data, we have done two modules so far introduction to R and descriptive statistics using R, we are in the third module, this is the module on random variables, we are looking at some special random variables and random variables are of two types discrete and continuous and some like uniform distribution could be both discrete and continuous. So we are looking at some discrete random variables, we looked at Bernoulli trials and binomial distribution, we are going to continue with the discrete random variables and we are going to also see some practical applications. In the last session when we discussed Bernoulli trials and binomial distribution, we were talking about random ally, it is an AB ally, it is a binary ally and the randomly picking items from the ally and deciding whether it is B or not and how that can give you some information about the ally composition is what we were discussing. And there are actually microscopy techniques which do similar things and that is what we are going to discuss in this session. So it is a continuation of discrete random variables and Bernoulli trials and binomial distributions. So we are going to talk about technique which is known as atom probe technique and how that leads to negative binomial distribution is what we want to discuss in this session. So atom probe technique is a technique to measure compositions in sub-nanometric length scales. So it is really fine measurement of composition and the composition measurement depends on detector efficiency and detector efficiency is less than 1. That means that if there are some 10 atoms that you pull out from your sample and even if all 10 of them fall on the detector, the detector does not recognize all 10 of them. So it has its own efficiency. So it detects only some fraction of the atoms that actually reach the detector because of which the compositions that we measure are actually estimates and they are not, if you pull out say 10 atoms and detector actually detects all the 10 and then if you know whether they are of A type, B type, etc., then you can actually give the exact composition for that number of atoms that you are taking out from the sample because that does not happen. So it is estimate. And what we are interested, so we are going to look at one more thing that we discussed when we were discussing descriptive statistics which is error analysis. We are always interested, it is very difficult to do any experiment without errors. So it is a given that there will be errors and standard deviations. As long as we have control over the standard deviation we are okay. So in the last session also we discussed how the accuracy can be improved and knowing that if you do more experiments for example, you can get better accuracy is a good thing to know. In a similar fashion, it is okay even if your measurements are only estimates but if you have an idea as to how much is the error, then you are okay with the measurement or the measurement is more useful to us. So in this context we want to calculate the variance or the standard deviation in the estimation of the composition from the atom probe experiment. So that is our interest and this session and one more session probably following this is based on the paper written by Dano et al in ultramicroscopy in 2007. There are two papers, the first one is what I am going to discuss in greater detail but the second one is also equally interesting and make some interesting points and it has all the statistical analysis which is done very nicely. So I strongly recommend this paper for you to take a look at and we are going to discuss some aspects of the paper. The paper also shows you how simple things that we are learning or simple ideas that we are learning are really of great use in doing actual analysis of this type. So this is the measurement of composition at sub-nanometric length scales. So how accurately you can measure or how much is the error or what you can say about the error or how to improve your experiment and so these kind of things. So it is a very practical example which shows why we need to understand distributions because we started this session on probability distributions saying that it is very important to understand experimental data because our understanding is that every experiment is actually probing this distribution and every result is actually a random variate from distribution. So knowing the distribution is very very important to understand how the error is or what the data is telling and so on and so forth. So here is a nice example which uses some probability distributions which also uses the ideas of error analysis and shows in a very practical scenario how these things are important and it has a very surprise ending so I really like this example. So I strongly recommend that you actually take a look at this paper and try to read it on your own. So we are going to discuss the first paper which discusses what is known as 1D conventional atom probe. It describes the process by which in this experiment the atoms are selected and detected and we are going to translate this understanding into a statistical language and we are going to understand then the statistics that comes out of this and based on that we are going to make some calculations or analysis about the variance in these experiments. So what is the atom probe experiment? So this is a schematic based on Dano et al. They have a very nice schematic. So here you can see that there are three regions I have marked. So this is a sample. So from which we are going to pull out a portion and then that is going to these atoms are going to fall on the detector or they are expected to fall on the detector out of which some of them are detected by the detector. So the specimen from which you are trying to pull out the atoms or which you are trying to study has proportion P of A atoms. So we are interested in knowing the composition of A atoms in the specimen let us say and so we are interested in knowing A and not A. These are the only two cases we are interested in and if it is A then we are going to keep counting. So how many atoms are there for the total number of atoms that we pull out from this sample. So this is the specimen and the proportion of A atoms in the specimen is P. Now the probe volume is consisting of M atoms. So from this sample we are going to pull out M atoms and out of which J atoms will be of type A. So the proportion of A atoms in this M atoms that we pull out is P. So this is the notation that we are using and then these M atoms that you take from the specimen are expected to fall on the detector but only N of them is detected by the detector and so I is the A atoms that is there in the detected sample or detected atoms and the proportion is P0 of A atoms in the detection stage. So I have marked them in different shades of green to show that if you think of the proportion of A atoms they need not be the same in all three stages. So it is important but we will later see in which cases it is the same and in which cases it is not the same or where we are making the approximation or assumption that it is the same, how reliable it is, is it okay and things like that. So this is the detector that is shown schematically. So this is from the paper of Danu et al. So we have P proportion of A atoms in the specimen out of which we pull out M atoms and J of them supposed to be A atoms. So P is basically J by M. So that is the proportion of A atoms out of which N atoms get detected and I of them are A atoms. So I by N happens to be the P0 which is the proportion of A atoms at this stage after the atoms are detected. Our interest is obviously in knowing the composition in the specimen and we are doing two things. One is that we are pulling out A atoms and so this is called the selection process and we are going to assume that this is a random solid solution that is very important. If it is not then the statistics has to be different and this we realized even last time. So you can you have to assume a Bernoulli trial which means that you have to assume some independence of events and that the probability does not change and so on and so forth. Those assumptions are not valid if it is not a random solid solution. And we also have to assume that the specimen that we are looking at is actually a representative volume of the material which we are looking at. If not also you will get wrong results. Suppose if you pulled out a region which is extremely rich in A atoms and then you made this experiment you will reach wrong conclusions about the proportion of A atoms in the material. So we are also assuming in addition that this is a random solid solution that the sample that we have pulled out is actually representative of the specimen that we are looking at. So from that we are taking out M atoms and we are trying to get and our interest so that is the selection process and then there is a detection process and the detection depends on the detector and its efficiency and so on. So our interest is after measuring this so all that you will get at the end of the measurement is that okay we detected N atoms out of which I of them happen to be A atoms and then okay so you can calculate the proportion of A atoms in the detection stage based on that can we say something about the composition of the specimen and if we give the composition of the specimen what is the error that is the more important question that we are interested in answering. So let us do the analysis. So how do we do that? Like I said we are assuming that it is a random solid solution and we are assuming that the specimen consists of A atoms and non A atoms. We are also assuming that the number of atoms in the specimen is infinite. So this assumption of infinite will keep coming. What we mean is that compared to the number of atoms that we are taking out the total number of atoms that are there in the specimen is very large that is one way of understanding this infinity or this removal of M atoms that we are doing from the sample is not going to change the composition of the specimen too much. So that will happen if you have two larger number of atoms and if you are pulling out a small number of atoms from them. So this is also an important assumption or approximation that we are making. So as long as this is a good assumption that the specimen actually has more number of atoms compared to the number of atoms that you are taking out for your analysis it will still be a good experiment and you will get reliable results. So the proportion of A atoms in the specimen is P and probed volume is V that is the volume of the M atoms that we pulled out and J atoms actually belong to species A out of this M atoms taken from this volume V. So proportion of A atoms is P. So J is nothing but M times P. Out of M N atoms will be detected by the detector I belong to the species A. So the proportion of A atoms is P naught which is I by N or I is N times P naught like J is M times P. The detector efficiency is N by M right if there are M atoms that fall on the detector only N of them are detected the efficiency of the detector is N by M. Detector efficiency is approximately 60 percent and detector efficiency is not exact because how many of them are falling on M we have no idea. So the detector is also a binary process it either detects an atom that is falling on it or it is not detecting. If it is not detecting we do not know whether the atom actually fell on the detector and the detector failed to detect it or if it did not even fall on the detector. So we do not know this information and that is why detector efficiency is approximate and this if you know these two numbers exactly this is exact efficiency but we do not know the numbers. So that is one of the problems and that is one of the reasons for the uncertainty are the variance and we want to understand that and we want to exactly calculate what it is and that is the analysis of the one day atom probe experiment that is done in this paper. So you can think of the result of a conventional atom probe as a time-ordered sequence of detected ions. So the atoms are getting pulled out of the sample and there is a sequence in that so there is a selection process and there is a detection process and we are keeping track of how many are getting detected and out of which how many are of type A. So the process is to evaporate an atom and it hits the detector if it is detected it is counted if not we do not know if the atom has hit the detector or not. The probability however we know that 60% of the atoms that fall on the detector it detects so approximately so this is the number that we have. Now the composition of the sample volume is based on n A atoms that are getting detected. So the composition of the sample volume is decided based on detecting n atoms out of which i of them are of type A. So given the detector efficiency so an estimate for m is n by q because n by m is q so m is nothing but n by q. But m is only an estimate it is not the exact number this is because like I mentioned detector process detection process is binary. So there is a finite probability for detecting all incident atoms and there is also a finite probability for detecting no incident atom. So how do we understand this idea that it is an estimate. Think of an analogy so let us say that we pick one random atom from a random binary alloy it is either B or not. Now if you pick a large number of them then the fraction of B atoms that you would pick would correspond to actually the composition. But there will always be an error so it is never going to be exactly suppose if your alloy composition is 0.5 you can never assure that by picking let us say n atoms you will have B type n by 2 cases. So you might pick 10 maybe 3 or 4 of them will be of type B if you pick 100 maybe about 45 or 54 of them are B and maybe if you pick 1000 then you will get some 492 or 512 or something like that. So as you go to larger and larger number of sampling that you do you will see that the value that you calculate goes closer to the actual value it is supposed to have. But there is always going to be an error in the process so it is only an estimate it is not the exact number and that is what is being told here also. So if you know exactly m atoms are falling and n of them are detected you can calculate the efficiency exactly but if you do not know how many atoms actually fell and you detected only n atoms then it is very difficult to say exactly what it is but it will be a distribution it will be and what is that distribution that is what we are going to find out. So m is a random variate because it is not one number so it has a variation and it is real valued because it is after all the number of atoms. And let us say that the random variable that describes this so the distribution from which this is supposed to be a random variate let us call it as capital M. Remember our idea is that experiments actually give you sampling of a probability distribution. So the probability distribution is m here and out of which we are trying to pull out or we are trying to look at the realization of m or that random variable takes a value m. So it is a binary variable atoms are either detected or not detected and there is a probability of success right quote unquote success. So in this case the success is that the atom is detected. Now how many atoms should impinge on the detector for n atoms to be detected that is a question you can ask. And the answer is that it is a negative binomial with parameters n that is how many successful results that you are getting and the probability of success. And so negative binomial is the number of failures in a sequence of Bernoulli trials because it is a Bernoulli trial remember that it is either detected or not detected and the success of detection is always Q it is not changing and different atoms getting detected or not detected is independent. So it is a Bernoulli trial and as the process goes on this is not going to change either. So the detector is going to detect with the same efficiency under that assumption you can see that it is a Bernoulli trial and what we are asking is that okay we know that number of successes is known n atoms are detected but we are asking how many failures happened before this success is achieved right. So that is the negative binomial distribution. So we say that this random variable m goes as a negative binomial distribution and the distribution parameters are small n which is the number of successes in this case the number of atoms that get detected by the detector and Q is the probability in this case it is the efficiency of the detector okay. So this is the idea behind the negative binomial. So remember we had discussed Bernoulli and binomial now we are moving to negative binomial distribution. So the negative binomial distribution has the probability function or probability of the random variable taking a value m small m and it is given by this expression it is m minus 1 factorial by m minus n factorial into n minus 1 factorial Q power n into 1 minus Q to the power m minus a and the expectation for this probability distribution this is the probability distribution the probability mass function PMF and so you can calculate the expectation which is to take the variable itself and then sum by multiplying this by this probability and if you sum you will get the expectation value and that is n by Q expectation of this being m is n by Q which is what we expected and the variance is n into 1 minus Q by Q squared okay so these are the properties of the negative binomial distribution. So you have the probability distribution or probability mass function and the expectation and the variance and as we discussed last time so you remember that it was binom for the binomial distribution and it is n binom for negative binomial distribution. So Dn binom Pn binom Qn binom Rn binom are the commands and we know what these stand for this stands for the probability mass function and this stands for the cumulative distribution function this stands for the quantile function and this is to generate random variants from this distribution right. So that is why the next question can we plot the probability density cumulative distribution function and quantile function for the negative binomial distribution. Remember Pn binom and Qn binom are sort of inverses Qn is actually inverse to Pn and the inverses are important to know the confidence intervals. So we will come back to confidence intervals and discuss them later but these are the things that we are interested in any distribution that we take and can we also generate 20 random variants from the negative binomial distribution. So that is the question and obviously to calculate these values we need the parameters and for negative binomial distribution the parameters are n and Q. So assume n is equal to 100 Q is equal to 0.6 because we know that the probability of detection or the success rate is 0.6. So the probability is 60% of successfully detecting it. So the success in our case is given with the probability 0.6 and let us assume that we want 100 atoms to be detected. So the question we are asking is how many failures should happen before you detect 100. So that is given by the negative binomial distribution and so let us do this calculation using R. So as usual we first have to check that we have the right version of R and it is a good idea to know what is the working directory. So that is the working directory. So now we want to so let us do this. Let us go through this script line by line n is equal to 100 because that is the value for which we are calculating this negative binomial distribution and the Q which is the probability is 0.6 and by now you know that this means that we are going to have a set of 3 maps, 3 plots and that is why it is 3 by 1. So we are going to have 3 rows of plots and x is from 0 to 100 and y is from 0 to 1 in steps of 0.01 and first one is plotting the probability density or mass function and the second one is to plot the cumulative distribution function and third one is of course to plot the quantile function. So let us do that and you can see. So this is the probability distribution function and so if you want to detect 100 atoms with 0.6 probability so you would expect about 65 atoms to be detected right and so if you want to detect and this is the cumulative probability distribution so sometimes so this is this gives you the added probability. So at any x value so what is the probability that you are not exceeding x and the survival is 1 minus this and this is the quantile function. So we have done and how do we get the random variate of course that is also a simple command. So we say so we want to have random variates from the negative binomial distribution we want 20 of them and we know that n is 100 q s 0.6 and you can see so 58, 63, 67, 76 and so on up to 63. So this is the negative binomial distribution and so this is relevant for the atom probe and we can work with or using the n by norm to deal with this function. So we will come back and we will continue with the atom probe and analysis of variance. Thank you.