 Welcome everyone. Balapriya will be starting her talk now. Hello everyone. Can you all let me know if you can see my screen? Can somebody let me know? Then I will start. So hello everyone. I'm here to deliver my talk on discrete probability distributions learning and testing using tools from Fourier analysis. So over the next 20 minutes, this is the rough agenda that I hope to cover. So first will be the motivation for the problem followed by a little bit of details on what exactly is distribution membership testing. And then I would go over the details of what exactly is an NKSI IRB and then on the discrete Fourier transform and its parsity properties, which will be the key ideas that we'll be using in this problem. And then we would go over the pseudocode of the algorithm with a little bit of code snippets. And then I would leave you with references and some open problems and a few papers that you can read from if you're further interested in exploring on this area. So yeah. So why do we even have to talk about probability distributions? So like in today's world we hear about data quite a bit. So data analytics, data science, everything. Everything involves data here. And then with the huge amount of data, how do we analyze it mathematically? So like whenever we have data, there is also an underlying probability distribution. So like the data points should essentially come from an underlying probability distribution and how exactly do we make some inferences about it? This is a very fundamentally relevant problem in statistical analysis. So let us just proceed further to know more about it. So these are some of the widely studied problems. One is distribution learning, where you learn an approximation to the distribution in its entirety. So ideally you would want the distribution to be as close as possible to the original distribution, which means your error should be minimized between your estimate and the original distribution. And then there's this well-known problem of parameter estimation in which you seek to estimate certain parameters of the distribution. So let's just probably consider the hello world of probability examples, tossing of a coin. So like let us consider a simple Bernoulli trial. So as we know, a Bernoulli random variable takes value 1 with probability p and takes value 0 with probability 1 minus p. Suppose you're given the problem to estimate the probability of success p. What would you do? Let us say I give you a coin and then let success even be the probability that you get hits and how exactly would you go about estimating the probability of success, which essentially is the parameter that the Bernoulli random variable takes. Like the Bernoulli random variable is parameterized by the success probability p. So a very natural thing that you would do is just take the coin, toss it n times and then just count the number of times you've got hits, add them all up, divide them by the total number of times you toss the coin. So this would be an estimate to the probability of success. It's also the maximum likelihood estimate if you've done a course on estimation theory, but this is a very natural thing to do. So similar to this, you might as well estimate the parameters of complex distributions. So here we would look at the problem of distribution membership testing. So let us say there is some family of distributions. You can consider it as a very big set and then you have some underlying distribution and some oracle there gives you access to samples from a distribution. Like you can sample from the distribution, collect as many samples as you want, do whatever you want with the samples and in the end what is your end goal. You should check if the distribution that this data point comes from, whether or not it belongs to a particular family or class. The class as I said you can just consider it as a very big set, like lots of probability distributions residing under it. You can consider that as class and then you have access to samples basically and then you would like to see if this distribution belongs to that set or not. So this is called a membership testing problem. So like don't worry if things are not very clear at this point. You would also try to break it up into smaller chunks in the coming slides. So as I said earlier in statistical learning this is a fundamental problem and as I said you have to decide membership whether or not it belongs. So how many samples do you need in order to decide that? Do you need very few samples? Do you need two large samples? Two large samples would ideally not be sample optimal. So the key question that arises is how many samples do you need to see to decide because the distribution could potentially come from a very large alphabet size meaning it could take a very large set of values. In that case how many samples do you need to decide? So these are some of the questions that immediately arise because every time we are dealing with computational complexity and we want it to be computationally as efficient as possible. So the key idea here would be in two steps. In step one you would learn the distribution assuming that it belongs to the specified class or family and then in step two you check whether your assumption was correct. So this is like you are learning first. First step is distribution learning which I spoke about in one of the earlier slides. You learn assuming that it belongs to the specified class. So when you check in step two either you get evidence that your assumption was correct or you run into a contradiction and you decide that your assumption was wrong. So the class of distributions that would be considered in this particular top is the NKSIIRV. So NKSIIRV is essentially a sum of N independent integer random variables and each of these random variables is supported in 0 through K minus 1. So if you can see this figure here, I hope you can see my pointer as well. So this is a discrete uniform distribution probably the simplest of distributions when it comes to discrete distributions which is uniform in the interval A through B and then each of these points have an equal probability of appearing. So this if you just switch to 0, replace A by 0 and D by K minus 1 you would essentially get a K independent integer random variable and then you add up N such independent copies. So that is what it says. You have a random variable that takes value 0 through K minus 1 all the values being equally probable and then you add N such independent copies and the resulting distribution is an NKSIIRV. So for simplicity you can even call it NKSIIRV if that makes things simpler to pronounce. So a toy example like if it is in uniform in 0, 1 through 1, 2 the support size as you can see is 3 here and each of these points 0, 1 and 2 will have a probability of 1 by 3 of occurring. So this is exactly what I had cited in the previous one. Each NKSIIRV if you take has support from 0 through K minus 1 times N. I hope this is clear because each of the N random variables has support 0 through K minus 1 and ideally you are adding up N such independent copies. So even though they are less probable you may still get value K minus 1 every time you sample. So ideally the support will be from 0 through K minus 1 times N and yeah. So as I talked about the key concept with effect to Fourier support and the variance so this is the key concept. Any NKSIIRV which has sufficiently large variance has small effect to Fourier support. On what exactly is meant by small effect to Fourier support we would just do it in a couple of slides from now. So this is just a quick recap on discrete Fourier transform. So you essentially have any signal in the time domain which you seek to convert to the frequency domain and you essentially try breaking the signal into frequency components and then you essentially want to analyze how much of each frequency component is there in your time domain signal. So this is the gist of the discrete Fourier transform. Don't really worry if the math here is too much to take in because when you code it in Python it's just one line of code. Ignore if you don't like the math details. So this is also the inverse DFT. Essentially you are mapping back from the frequency domain to the time domain. So this again sums up what DFT and IDFT are and now we would go to the algorithm which learns the NKSIIRV. So this input that you have is sample access to a KSIIRV. Let us say I am calling it P and then there is some epsilon which is greater than 0. So let us say you have oracle access to this particular thing, sample access. And then in step 1 you seek to draw some constant number of samples from the distribution and you estimate the mean and the variance. So yeah, let us just say, let us go over this code snippet. So I have imported the numpy library with its initial alias NP. And yeah, if I want initial sample size to be 100 and let us say I fix N to be 10 and K is equal to 4. So this is how I can generate samples. So NP dot random dot random, this would essentially give me K and size is equal to N. So K in the sense this denotes that the values return could be anywhere between 0, 1 through K minus 1. I haven't assigned a separate probability list. So by default all of them would be equally probable and that is the case that we are looking at now. So size is equal to N. So you take N such samples and then add them up. That is why the sum here for X in range of initial sample size. So you want initial sample size that many samples you want. So this is exactly what it does. You may even use the sci-pi stats module because there again you have some random variable functions. You can freeze a random variable and then you can call methods on it. But then numpy is a little simpler. So this as you can see is the empirical distribution that is known. If you can see in the previous slide after estimating the mean and variance, this could just be the sample mean and sample variance. So you are drawing additional samples in order to find the empirical distribution. So this exactly is one such empirical distribution which was obtained for K is equal to 4 which means the support would be 0, 1, 2, 3 and N is equal to 10. So let us quickly verify this because K minus 1 times N would be 30. So here you have 30 although the tail values are very less probable and in the empirical distribution these values do not occur very often. You can see that there is a nice bell shaped curve similar to a Gaussian. So that is why sums of random variables are of interest because many of the results like law of large numbers, central limit theorem, all of them lie on sums of independent random variables. So let us move to the next slide. Now that we have basically covered step 1 and 2, this is step 3. So if you can remember the key idea that I showed you a few slides ago that any random variable which has sufficiently large variance has approximately sparse Fourier support. So just to check if the variance is large enough you are doing a threshold check. If it is not greater than that you just output the empirical distribution and it's done, that's it. Else you go on to compute this M and then find its Fourier transform. So let us just see. So from the estimated variance there is threshold check. If yes you have to compute M move to step 4 else you output Q and then stop. That's it. So let us just see what is so different about DFT modulo M. So as you can see DFT modulo M would require our signal to have M samples like the length of the distribution that we are considering should as well be M. So what would essentially happen if you apply directly the FFT routine. If you directly call the FFT routine what would happen is that it could only take the first M points and ignore the rest which means we are not accounting for the whole of the distribution. So in order to account for the whole of the distribution so we really have to fold the distribution okay. So let us say you have point number 0 on the distribution and then through M minus 1. This would occupy M positions and then the Mth point you fold back and then add to the 0th position. So this you do until your entire input array is taken care of. So the folded empirical PMX will still be your fill in M. So this is one such very simple code snippet that does this. So I need only till M and then I have number of folds. This I can identify from the length of the empirical PMM and then if it exceeds M I just fold back and wrap it around and make it coincide with the pre-existing points. That is what this code snippet does. I hope it is clear thus far. So this is the next step computational effective support. So if you can see the effective support set L this artistic L that is here. So it essentially collects all zeta for zeta in 0 through M minus 1 such that this condition holds. So let us not worry about this. Let us go over the code which is a little simple to understand. So essentially here you have I and J between 0 and K and then there is delta between 0 and 1 by 2. So account for all of that and then you collect all those zeta for zeta in 0 through M minus 1. If this particular condition is met absolute of zeta by M minus I by J is ideally less than this particular quantity over here. So this you have to extend the support loop over all possible I and J values and then compute all those zeta or rather collect all those zeta for which this condition is true and then effectively you can find the effective support. So there may be like disjoint sets but still you can just find the union and collect them all into a single list called effective support. So now that you found the effective support there is another thresholding step that essentially checks if the Fourier transform at that particular zeta is greater than a specific constant. So this is just another thresholding check and then you essentially compute the inverse DFT or rather you can even retain the DFT or H hat as you can see is no longer a probability distribution you started with a probability distribution and then you apply it in FFT so now it doesn't have to be a probability distribution per se but it's very close to being one that is what this step 7 does. So if this particular condition is met if this L2 norm that you can see is less than epsilon squared by 5 if this is true you accept else reject. So this accept essentially means that this hypothesis is accepted which means your assumption that this distribution was an nk serve was correct and we initially started with an nk serve so it is true the assumption is true. So this is essentially the last step for every zeta in the support set I'm looping through that and then checking if it is less than this epsilon squared by 5 so if you accept it the underlying distribution is an nk serve and if you reject it the underlying distribution is not an nk serve which means your initial assumption was wrong. So these are some of the results that I have obtained from simulation I have taken all these k to be prime powers to be specific powers of 2 and then I have taken n such samples and then computed this all of this and then I've also logged the support set and the actual support for delta equal to 0.4 so the actual support you can just get by Boolean indexing of the array so essentially you would want the Fourier mass outside the effective support to be arbitrarily small. So you can find that there are certain cases when the sigma is less than threshold which means these random variables do not have large enough variance so you can just output the empirical distribution and stop you need not go ahead and compute the effective support. So if you can see this the effective support here is comparable the support that you obtain is just one larger than the actual support whereas if you consider this case k is equal to 8 the actual support is just of size 3 whereas the support that you get from the algorithm is huge. So this is one scope for improvement and it's an open problem which is very evident looking at the table you can see but this method of using Fourier analysis has been used as a general testing method which means it is not just restricted to the NKSIIRV that I spoke about now but rather it covers even a broad variety of distributions you know if you can just look it up Poisson multinomial distributions Poisson binomial distributions all these can be covered under a general testing framework using this Fourier analysis but then as I said there is scope to tighten the support so if you tighten the support can you possibly get any improvement and sample complexity that is the question that we seek to address if you go all the way back to the algorithm where it all started so in step you are drawing some number of algorithms so this epsilon is essentially if you can see towards the end it's the distribution and it's like the distance between your distribution and the hypothesis determines whether or not you accept the hypothesis so if it should be very close to your distribution then this epsilon should be very small but as epsilon becomes small because you have an epsilon squared in the denominator this n scales twice as fast as epsilon so if you just pick your epsilon to be 0.1 even for an improvement of 0.1 you would essentially need 100 times more samples and I hope you get how it scales so if you tighten the support can there be any possible improvement in this end can you draw fewer samples to decide that's exactly the question that we would be seeking and let me go all the way to this slide that I was in and there is another question as to this uses the Fourier transform so are there other transform techniques that you could as well use to achieve the same generating testing framework that's another open question to be decided so these are basically the references this first paper has been like a key paper that I started reading about 3 months back and it's on testing for families of distributions via the Fourier transform this is the code paper which was presented at New Ribs 2018 and there's also an archive version if you would like a detailed and formal reasoning of proofs for optimal learning via the Fourier transform for sums of independent integer random variables and then there's another related paper learning sums so this slide you can find in PyCon's website as well so feel free to look at these references when you're free thank you for attending my talk so if there are any questions I would take them up at this point thank you Priya for an enlightening talk and giving such an involved analysis into it let me check if there are any questions so the first one we have is can you list some practical example applications where I use these concepts so let us say you have data which is of this form let us just think that there are people in categories and then you have you basically assign a number to each of them let us just say there are like 10 people and then you index them from 0 through 9 and then you look at related the groups of them and that would essentially you can try to model your data as an NKSI IRB if it is possible so like wherever you have integer valued data types which I believe is quite common so you can use this to try to formulate it as an NKSI IRB and then you can apply all these so maybe practical example applications would depend on the data set that you are considering I would say you have to look at those points thank you Priya for answering the questions in case you have more questions Priya will be available in the Hedrabad stage on Zulip chat so you can always post your questions there thank you Priya for joining us today