 Then I will build some tools Beijing Theorem, we will use Bayes Theorem and we will show an R and example to actually draw some analogy to the posterior distribution, then we will introduce hierarchical Bayes, then again we will introduce MCMC and two special tools in MCMC. So, that we can do the data analysis for our data. So, camera toppings is particular type of sampling where data are obtained in the form of images. So, cameras are often situated along trails in the nature reserves or at better stations. So, whenever an individual comes in the vicinity of the cameras, they get photographed. Then researcher collect those photographs and either manually or by using the software extra compare get I identify all those unique distinct individuals. So, camera trapping methods are widely used for species that have unique stripe or spot patterns like tigers, spotted leopards and some other cat species. So, this is the map of Naur Holi reserve where you can see that these black dots are actually camera at upstations, there are 120 search camera at upstations and each of them were active for 12 consecutive nights. The quantities of interest are how many such individuals are there and what are their captured probabilities that means the detection probabilities and how much they move kind of a measure of dispersion. Now, the kind of data we have is like a array of fonts and zeros. So, only those who get captured they get identified, we get those captured histories and arrange in some way so that we can get our estimates. The data are Bernoulli variables where the probability parameter is individual specific that means for every individual we have a particular probability of capturing. We also use another data which can be used as covariates also like location of different camera traps. These actually helps in identify the activity area for each of these individuals. Now, we will talk about a classical Bayes theorem which is used nowadays everywhere. So, probability of A given B is obtained by three different probabilities that is probability of B given A, probability of A divided by the probability of B and probability of B can be calculated as there is a sum of two different probabilities. The first term is actually B is occurring with A. The second term is actually the event B occurred, but the event A did not occur. I will talk about an example to explain the theorem. Suppose, we have a arm, we have four red balls, two green balls, the experiment says that two balls are to be randomly drawn one after the other without represent. So, there are two events the first ball is drawn and the second ball is drawn. Now, to find the probability that the first ball is drawn when it is given that the second ball is also drawn, probability of A given B that is what we want to know. So, probability of A is probability of the that the first ball is straight. So, we already knew that there are four red balls and two green balls. So, total number of balls are six, so the probability of A is six. Now, probability of A complement which is first ball is green is two by six. Now, we also compute the probability of B given A which is we know that the first ball is red then what is the probability of the second ball is also red. Now, there are since we already drawn one red ball we have only three red balls and the total number of balls has reduced to five. And probability of A even B complement that means the first ball is green what is the probability that second ball is red that is four by five. Now, we already talk that B can happen in two ways that B can happen with either A or A had not occurred but B occurred. So, using these three different quantities that probability A we have computed probability B also computed and probability of B given A we have also computed. So, using these three we compute probability of A given B how to formulate that in the real world problem. So, this I will draw an analogy like earn is a word and it is unknown. Somehow we got to know that there are four red balls and two green balls, but we do not have much belief on it. So, it kind of information which has been given to us, but we want to improve that knowledge based on the event of obtaining a red ball when we have already observed data that second ball is red. So, when to improve our knowledge that obtaining a red ball the first draw based on the given data. So, it is a different version of the same theorem. We have data on end points the model for the data is given as the conditional distribution of data given the parameter. The parameter is the true state of nature on which we have just a prior belief by theta which has been expressed in a probability statement. And what we want to do that improve our prior knowledge based on the given observation. So, that is the distribution of poster here that means the probability of theta given the value y which is the data. Again we need to compute these three quantities to get to the posterior distribution. So, the aspect of phase approach that people like that nothing is at all probability calculation are used for everything. That means estimation of mean estimation of variance estimation of different quantiles other moments say probability approach has been used for everything. But the things that people do not like why base is base is not popular among frequently that means how do you get the knowledge? What is prior probability? How can you expect a non technical user to formulate this probability statement? Now, the prior represents your subjective beliefs by a probability statement about a set of values and you know and you can tell that which set of values are more likely to occur than the set of values or you can give a likely range of all those parameters. But the inference is basically subjective. If you fix a prior your all the inferences will be conditional on the prior itself. So, basically what the basic statement is your prior is your prior you and the other person will observe the same data, but using different power will be lead to different conclusion. So, basically if you have a prior information you should use it if you do not have prior information there is no unique way to specify the prior distribution and this is a very vast area of research people has been devising different methodology to fix the priors how to fix a prior when you do not have any information there are same informative prior there are non-informative priors and so on. It is still a very lively area. So, and this flexibility makes base approach more applicable to real world problems. That means, if you have a some information you can actually use it to improve your into your knowledge on the parameter. Now, I will talk about hierarchical modeling. Now, you can see at the top most level there is a collection of truths which we do not know. Then there are some layers from which every layer precedes and then there we observe the data. So, before also in data there are certain truth layers and then we go to the talk. So, the data we observe is actually a outcome of a different process model. The process also depends on some other process models. Then there depend on some certain parameters. So, this is a hierarchy of different different populations and their distributions. So, basically the posterior distribution is a, but we do not know anything of about them. So, we want to know the parameters and the process given that we observe the data only. So, it is a hierarchy that the data given the process and parameters then the model of the process then the parameters. So, can see. Now, I want to introduce Markov chain. It is a random process. That means, every transition from every state will depend on some particular probability. If that particular chain satisfies Markov probability then we will call it a Markov chain. So, here we can see that there are two states A and B from A. So, what we actually seeing is there are two states from A it can go to B or it can remain in A, but the time will move. So, it is a series of A's and B's and combination of codes, but for from the state A if it goes to probability state B it has certain probability address to it. So, you can see that it can remain in the state A with probability 0.6, but it can go to B with probability 0.4. So, the total probability has to be 1. On the B is the same thing. It can stay in B with probability 0.3 or it can go to A with probability 0.7. Now, here it is the most simple situation. You have only two states. In real world problem you have multiple states. So, the motivation of Markov chain Monte Carlo is to compute certain parameters which will actually give you insight of the posterior distribution. Problem is the posterior distribution may not be standard. That means it may not be Poisson, it may not be the general distribution like normal or gamma or beta. It will only have a functional form attached to it. So, how to get some knowledge from that particular functional form? If you have a long enough sample, then you can actually quantify certain quantities from the distribution. Like these are the second is the posterior mean. The third one is the variance posterior variance. The first one is the marginal distribution of data. This is also a very hard problem to crack to how to compute that quantity. So, ultimately we are doing it in three steps. We get the long chain, we satisfy some particular condition like irreducibility, aperiodic and positive recurrence. Irreducibility means that from any state it can go to any state. So, the probability of going from any state to any other state is non-zero. Aperiodic means the minimum number of steps that it need to go to any state is one. Positive recurrence means the minimum returning time to the same state is has a positive probability and the mean returning time is actually finite. So, our quantity of interest can be a function of theta which can be approximated by this arithmetic mean, simple arithmetic. Then if you want to get some kind of a valid confidence for that particular estimate, we can compute the Monte Carlo standard error using either bootstrap or jackknife or some other non-parametric methods. Here we will show that if we want to use those chains, they have to be valid for that posterior distribution. So, here you can see they have they have three different starting points or three different chains. One is purple, one is red, one is blue. But ultimately after some iterations they all get to cluster and you can see they are from starting there are some multiple posterior density. That means from the initial points things hasn't been like things hasn't converged but as it goes like 10,000 iteration you can see a unique posterior density. That means all the chains has reached to a same point that means all have converged. Now there are specific tools from MCMC that how do we actually sample from that distribution. Now here we started with that long chain. Problem is how to get the chain. So, there are different I will be talking about only two that gives sampling. The name is pretty much famous. So, what we do we break the parameter set into different clusters. Here we broke into theta 1, theta 3 parameter set. Now we factor out all the functional form which depends on those theta 1 from the joint posterior distribution. And then we sample from there. If we get to sample they also can be non-standard. Figon can also be a just a functional form. In those cases we use metropolis listings. Just draw some sample from some known distribution then if that particular sample satisfies a particular condition this is the R is given the condition. Then we accept that otherwise reject otherwise the chain doesn't move. So, what is actually if the sample which can be obtained from some other distribution then we accept otherwise that particular state remains on the same. Now I will do you the data analysis using these tools. So, just a quick recap that we had 120 camera tabs and both of all of them are active for 12 consecutive nights. The area of the minimum triangle containing those camera tabs are 679.40 meter that Arjun just mentioned. Our model is a 1-0 data which is a binomial model made out of Bernoulli's. The capture quality is a function of three different parameters like P naught, sigma and Si. Now what is Si? Si is a particular latent variable which has been attached to each of these individuals. So, each of these individuals are very active around the whole reserve. If we assign a particular location on those particular activity center activity then we call it the activity center. So, each of these individuals serve mobile we just assign a single point for those activity area. So, for each of them we assign such Si's we call it them the activity center. Now P i will be dependent on Si because if those individuals are actually away from the camera tabs the capture quality will be low and if those individuals are very near to those camera tabs then obviously the capture quality will get high because they will be around the cameras all the time. In such way we got 45 number of distinct individuals. So, this is the capture quality so very big maybe. So, the latter part what is actually means that if the distance of that particular individual is very much away from the camera tabs X k are the camera tabs locations then the probability will be very low and P naught is the interpretation the interpretation of P naught is such that probability of capture for the most exposed individual that means if an individual lies around the center of this whole camera tab area then that particular individual has the largest probability of getting captured with the modeling part. So, we just assign some prior distribution for each of these parameters like our activity center since we do not have any information we use the uniform prior for that the sigma is the dispersal parameter we use uniform again but again we fix the range there you see 0 to 5 that means we fix that it cannot go beyond 0 to 5 unit and the probability of the most exposed exposed individual is seen from 0 to 1 because it is a probability data augmentation is a part of the hierarchical model that we think that there is a super population from which our population is a subset of it and then we get another subset which are our captured individual. So, this there is a three layer that is the assumption from the super population some individuals can be a part of the population or may not those we call the COD individuals. So, this particular parameters is a individual if the ith member if it is the ith member of the population this is Bernoulli parameter but the thing is we never get to observe this these are the latent variables we never get to observe this but this is a part of the process we get to estimate them the model of the captured data is again the binomial that is I described before but the thing is if the individual the ith individual is and part of the population then y a can be 1 so 0 but the thing is if they are not if z i is 0 that means ith individual is not a part of the population then the detection will be always 0 so this we will use in the actual setup sorry for this actually we have the joint density joint procedure density there now if we just take all the factors which depends on these different parameters we want to do the Gibbs sampling now we can see this all of them that f the f function is actually a product of Bernoulli's i of p naught is actually uniform which is a constant so product of them will never give some standard distribution for p naught same for sigma same for the activity centers but for z's we will get some standard distribution if we have a non-detected individual then the z i's will follow Bernoulli and if we have a detection then obviously this is a part of the population if you have a detection that means that the individual is there so z i will be always 1 and z i depends on the parameters psi and psi will give this kind of posterior data this is also a standard distribution using this so we have assumed the superpulpill population has consist of 845 individuals but we have only got 45 among them so the remaining are all 0 characteristics so using this 845 capture histories and discretizing the state space s among 10,000 potential activity centers we can do the MCMC and get to this result so the number of individuals in that minimum rectangle is 91 with the standard deviation of 15 and density this density is actually the number of individuals per 100 square kilometer it is 13.5 based on this study now using the same data there are several other studies also getting several other different results so the point is if you do the modeling right more often than not you will get the proper value answer here this study is actually mentioned this in Arjun's paper and also in the book by Royal and Dorazio and I just explained that thank you so this was the theory part of it that we explained how it works just wanted to understand what were the tools that you used and how much time it took to analyze the data so this particular analysis was done in Winbox where if you just fix the priors and the data and the initial value of the chain you will get to you will get the posterior estimates within some 4 to 5 hours but the thing is these data is not that big because there are only 12 consecutive nights and 120 camera types and 45 captured individuals but if you do it in JAGS it will take some more time if you do it in R you can do it much more faster R has the flexibility of doing it customizing that means it can go outside the standard distribution also and you can customize the thing by doing a proper coding or something like that but I am not a natural coder but Winbox is a very good way to do that hello my name is Naveen I had a doubt from the previous session also from the previous session which continues in this session also ok so one of the things I observed is that you make a generalization for the Bernoulli distribution generalization what kind of the beta distribution is a generalization to the binomial beta distribution beta is a distribution that is a generalization to the binomial no yeah is a distribution for the parameter the probability parameter yeah so what you do basically is to add an extra parameter to it and then make your estimation fit that parameter so that the entire data it fits to the data ok so you are basically doing an estimation that fits best set of parameters along with the beta distribution in the beta distribution that am I clear or should I repeat it yeah you should alright so basically we are doing an estimation on a cost function on the data set right ok which will give an estimate for the alpha and beta parameters in the beta distribution you do something similar in this talk also so my question is if I say derive a generalized distribution over say any GLM that I have and say generalized distribution over every GLM any one particular GLM that I have and say fit a set of parameter an additional parameter and give it an extra degree of freedom do you think it will be a better fit to the data no always or is it in specific cases a better fit because I observed it in the previous talk that is why I have this very specific doubt see if you do a generalization this week on the particular problem it should fit all the assumptions say that you are fit otherwise this is not a generalization you want to speak since it probably relates to my session also so what are we doing here we are actually trying to characterize the reality on the ground which is filtered by the observation process like I said it has to become a joint model together we are viewing something together it's not by parts and the survey design is made such that all the parameters are estimable so in the sense we say there is no problem of identifiability so for example when I showed you that index calibration experiment where the beta distribution came in we just saw a set of data now the point you are trying to say is now let me go on adding parameters to let's say a simple regression and see which fits the model best the point we are trying to make here is we cannot add parameters like that to fit these models we call that overfitting what we instead are recommending is we break it into a hierarchy such that the two processes are entirely different so what the capability you will get is suppose observation process is driven by completely something else for example like I said the observation process for the tiger example was based on substrate type so I'll put that as what we call a covariate now what do I put that as a covariate to I can't put that to the ecological process because substrate has got nothing to do with tiger density but it had a lot to do with how I see and how data are coming to me so when you break it into two parameters like that we form the hierarchy so that's the central point here so there you assume conditional independence then now it depends if they are meant to be independent parameters but you will have situations as I gave in the tiger example that they are actually related not by choice just by chance that tigers are at very high densities in parks and it also so happens that you can see tracks better there but if you have broken them into two parameters I can put the covariate separately for the two without a problem in this hierarchical approach otherwise the problem you will face is you are just going to add and when you add parameters to a regression which is normally done we start violating we still have this problem of identifiability that comes in I hope that sort of clarifies I have a question so like most probably hello probably we will start with the wrong prior so as we keep on adding more and more data will the effect of choosing a wrong prior reduce actually if your data is large enough or if your data has better information on the particular parameter then whatever prior you take should not influence but again if your prior is too much strong for the data it's not advisable but like if we keep on adding data for years and we keep on deriving posterior and the posterior acts as a prior for the next posterior yeah this kind of methods are actually there the way you are saying that it's a time series which actually is time point is whatever you are actually getting the estimates you can use it for the next time point yeah this kind of models are there still like this isn't it how it should be done if you get some data add a prior get a posterior then add some more data that posterior and data you get the next posterior no see obviously this is one way there is another way that you get a you use all these past events past data accumulated not as a market then you use it you can also come up with a new player based on the data yeah there are different customizable ways I just thought I could clarify what he was trying to make maybe to answer you better I think you were referring to what we call in Bayesian statistics is you are talking about this procedure called stepwise updating but you need to be careful what are you updating now suppose your process actually undergoes a change say stocks for example you see fluctuations but you might be seeing a trend and typically that's how it's used these approaches not for the kind of problems we do because the kind of problem we do we do more detailed modeling we look at individuals see which comes out which goes in and we estimate different sets of parameters but the kind of problem you are talking about is this stock rising or stock declining there is a big fluctuation there is year to year auto correlation now if I blindly do stepwise updating and keep shrinking the posterior you will go wrong because you are not taking both the auto correlation into concern as well as you have to account for the fact that the last year stock prices of a certain share need not be that tomorrow and we are going to take an uncertainty in there right so to answer you more precisely I think it will be a mixture of two processes there will be some updation that will happen in this Bayesian updating but the larger deal in this whole thing is that you need to get that process right so whether you use Arima models Arma models all these time series various models right whatever you take to be ideal that model you should not change you can't update that I hope that sort of answers this is more related to the problem that you addressed here so when you were defining the binary variable why i j right so it was i th individual in the j th instance right so why aren't we considering a k factor right captured by which camera yeah actually see this is a particular study these kind of studies are also there where we actually use more and which actually increases the dimension dimension of the whole array sometimes it gives better answers sometimes it is actually more cumbersome or more time-taking process also and sometimes it leads to a bad result so here it's more or less easier to do that's why I explained that yeah but say if a particular tiger in this case is captured by multiple cameras on a given instance so it's like you take that to be only one right yes yeah okay this will be totally orthogonal from the discussion that happened but on the similar lens I just want to note the field of your study was Tigers you're right if that could be the camera image in the signal camera images in the city so with that thing a traffic prediction any modeling has been attempted or something like that we can predict the traffic with the bang I don't actually work on these kind of things but I don't know if there are more or less individual specific that means you get to know all these different networks individually and get the proper data or proper covariates then I think it should be modeled but I think it's more or less based on signal processing right I thought the images available from all the things we will be able to from all the images, historical images we are able to just predict or something we can do about the traffic study and then that sort of thing I don't not able to get the actual problem there but I think that if you able to distinguish all of them and make them properly modeled then I think you will be able to do that but I am not an expert of that a cluttered signal would do actually little since most of the signals in Bangalore is covered so you can outer ring road traffic can be just studied actually from the given the points that is happening on this north Bangalore on the one single stretch and just giving item not maybe it's wild thought see these ideas actually the idea of capture recapture originally came with P.S. Laplace in 1780s he looked at birth registers in churches in France to estimate the population of France the next application was for fish in Denmark you mark some fish you release them you recapture from the proportion of fish mark in the recaptured sample you estimate the detection probability in some sense then ducks hunted ducks and banded ducks in so these original ideas actually mark the animal so the photograph we are using natural marks it's very interesting you say that one of the early capture recapture models was developed in Edinburgh doing exactly the thing you are saying there's guys noting down taxi numbers in different parts of Edinburgh and estimating the total number of taxis in Edinburgh using capture recapture one issue about the continued data is that see one fundamental assumption in this capture recapture is that you are sampling a population in a very short time in a snap short so during that snap short the numbers are not changing so when you use the next year's data to answer that this information may go as a low quality prior what your previous estimate was but you have to again repeat the experiment and take another snapshot and there was talk about what if you don't have prior information at all so you he talked about non-informative priors in which case this method converges to pure likelihood thank you sorry very simple question I've seen people acquire priors in questionable ways so for example in a previous setting the goal was to estimate the probability of success of a coin toss and let's say you have 10 different coins each one of them has a different PI and the way they got and you only have two taxes for each coin and the way they constructed the prior was to take the 20 coin tosses as one population and estimate beta distribution from that so that's obviously wrong it's bad because you also have the particular coin that you're interested in say a coin the first coin the coins are heterogeneous I mean you sort of do believe that they have similarity even if they're heterogeneous you do believe they have similarity my problem with that was you want to estimate you want to get a better estimate for P1 which is for the first coin but you build your prior including the two coin tosses into the beta distribution data so you want to improve your prior distribution as you go on that's what you were saying so you want to get a better estimate so you have a coin say C1 and you can get from the two tosses of C1 you can get an estimate for you can get a hat of P1 just using that for data just the two tosses yeah but you want to do better than that by using beta binomial model and so where do you get your beta distribution from you take for all the ten coins together you treat them as if they were one homogenous population related to this particular coin and you get the beta distribution does that make sense no then you should know what is the parameters of the beta distribution there are different ways to estimate those parameters the high they are called actually high by parameters that's right alpha and beta yeah alpha and beta so this is an estimate that is empirical base where you actually integrate over P and get as a function of alpha and beta and you maximize your likelihood you get the estimate for those and the other thing is you give another model for those two parameters so this was going via the empirical route my problem was that they were including the data for the coin in question the coin of interest C1 they were including that data into the estimation of the beta distribution for the prior if that makes sense they are using that data to the estimation of those parameters yeah they are using it for the estimation of alpha and beta and they are also using it for the likelihood this is a double fold thing yes exactly so what would have been a better way to go about this because it seems to me this is wrong see empirical base is useful in very much multi-dimensional ways that means suppose the number of parameters are very high then using a population for all those parameters is very time taking and is very complex also then using that to actually improve your prior knowledge on those parameters makes sense because you will be actually improving your time but in this case because you have very small data it is not advisable so you said if it had been say 100 coins then you would have been okay with this procedure this is very easy 100 to 100 then you can say 1000 okay okay thanks any other questions thank you so much