 So welcome everyone to IB, Mathematical Statistical Methods session. My name is Marcella Alfaro-Cordo, I'm a teaching assistant professor at UCSC. I would like to start by thanking our sponsor for this session, Roche. We are going to have three talks in the next hour, 15 minutes for each talk, and we will have a couple of questions before switching to the next talk. So please note that this session is a bit shorter than the parallel session, so we'll have plenty of time for questions after the third talk, if needed. Please feel free to use the chat or the Q&A, and the Slack channel, hashtag talk underscore math underscore stats to send your questions or comments during the talk. We'll facilitate them to the speaker during the Q&A part of this session. Our first talk is Marce, tidy inference on their mis-specified statistical models in art. It will be presented by two people. One is Ricardo Fogliatto, and the other one is Shamindra Shrutiya, from Carnegie Mellon University. A little bit about our speakers. Ricardo is a fourth year PhD student in statistics at Carnegie Mellon University, advised by Alexandra Chuldechoba. His experience includes being a research fellow at the partnership on AI, and this summer he will be an intern at Microsoft Research, working with with Mina Nushi and Kori Indre. He holds a master's degree in statistics from the University of Torino. He is interested in machine learning methods, applications to the social sciences. Shamindra Shrutiya is also a fourth year PhD student statistics and data science at Carnegie Mellon University, advised by Professor Mateo Nicol. His research is focused on shape-constrained estimation. He holds an MA in statistics from UC Berkeley and enjoys learning and sharing interesting things in the statistics and machine learning space in his blog. I will share the link in the channel. So please join me to welcome their talk. So, hi, everybody. This is a presentation of the Marce Packager, the name Marce stands for models as approximations in R. We are Ricardo and Shamindra, both of us are PhD students at Carnegie Mellon, CMU, and this joint work with Arun Kumar Kuchibotla, who is a faculty member in statistics at CMU, and we are very excited to be presenting this work today. So let's get us started. So in order to understand the Marce Package better, we first need to revisit the ordinary lead square regression, which is also known as OLS. So moving on to slide three, OLS is a handy tool that is in every data scientist toolkit. When we take our first statistics classes on modeling, this is generally the first method that we encounter, and it's just simply great. We can model so many phenomena by running OLS regression. And here, for example, we sample 1,000 points from a simple linear model without interceptor. And we might think of employing a linear regression model to describe the population level dependence of Y on X for a point represented in the figure. So moving on to slide four, as in any good thing, however, there is a catch. And here we often forget that OLS is built on a series of assumptions. The inference for OLS is based on a well-specified linear model. And in this figure, the blue line represents a fitted regression line, whereas the red lines represent 95% confidence intervals for a residual. And this is in violation of one of the OLS assumptions, which is homoskeletonicity. Indeed, there is no constant variability. It's actually something that really happens in practice. So moving on to slide five, the real question is now, can we still make inference on the coefficient of interest, even if one of the OLS assumptions is not met? So after having fitted a model without intercept on the data, the LM, we can call confinter to get the 95% confidence intervals for this coefficient estimate. And the confidence interval assumes homoskeletonicity, that is a constant rotational variance, but in this case, we do not have. Therefore, like once we actually check for coverage, based on multiple replications, we find out that it's below 90%. This means that if we generate the data and fit the model 100 times, this interval will contain the true parameter in less than 90 cases in expectation. And this behavior is absolutely undesirable. So moving on to slide six, then how can we actually make inference when we are dealing with models that do not satisfy all of the OLS assumptions? And luckily, there are some excellent packages such as Car, Club Sandwich, LM Desk Sandwich, and others for performing OLS inference under model mispecification. Now we have the MASR package, the package that we are introducing as well. So I'll pass it on to Shavinta now. Thanks, Ricardo. Moving to slide seven, MAS stands for models as approximations in R, and it's been inspired by the discussion papers written by a group of statisticians at Wharton. These papers have been recently published in statistical science, and we highly recommend reading them. Going back to the packages Ricardo just mentioned, given that these existing R packages are so great, why create a new MAS package? Now let's turn our attention to the MAS package and its features. Moving on to slide eight. The MAS package comes batteries included with a rich set of inferential tools for OLS under model mispecification. These inferential tools fall into three main categories. First, on the left, we have the closed form variance estimators, which are computed by default. That is the LM and sandwich-based standard errors. Second, in the middle, we have the resampling-based estimators, which also help you diagnose convergence to normality. These include the empirical multiplier residual bootstrap and sub-sampling-based methods. Third, on the right, we provide by default valid hypothesis testing under model mispecification, that is chi-squared tests rather than f-test that are reported, and also some experimental model mispecification diagnostic tools, including non-linearity detection. We note importantly that LM variance and the residual bootstrap highlighted in yellow are only valid under the full OLS assumptions. The remaining variance types perform valid inference under complete model mispecification. Most importantly, from a user perspective, all of these tools are accessible by our single COMFAR function, which is set up for more such tools in the future. Moreover, the output of the COMFAR function readily allows for a quick comparison of all of these estimators, thus leading to mispecification diagnostics. Moving on to slide nine. So although having such inferential tools is helpful to the data scientists, what really distinguishes Mars from other packages? The key idea which makes Mars unique is its strong emphasis on pedagogy, not just on research. We make this pedagogical emphasis clear to new users in three key ways. First, on the left, Mars explicitly allows the user to print the inferential assumptions for the different variance estimators. This is to minimize the time overhead for the data scientists, specifically so that they don't have to keep spending time looking up research papers, which discuss the assumptions behind the different estimators and their validity for inference. By making these assumptions explicit in our output, we hope that it makes these tools less intimidating and accessible to new users. Second, in the middle, we try to explicitly emphasize teaching these deep concepts to new users by example. We do this mainly through writing very detailed vignettes. These come in two main forms. The first is where we take research papers on similar topics and reproduce tables and plots from them, again, increasing the accessibility for the data scientists to the research literature. Second, we provide lesson plans through vignettes. For example, we know empirical and multiplier bootstrap are valid under model mis-specification, but how different are they in practice? We can use Mars to conduct systematic simulations and provide a detailed comparison between these estimators via our lesson plan vignettes. Third, on the right, Mars follows tidy principle in its function namings and its focus on returning tidy tables for easy downstream analysis. Let's see what we mean specifically by this in the next slide. So moving on to slide 10, before seeing Mars demoed in action, it is worth quickly describing what we mean by our tidy Mars grammar. The main idea here is that Mars code for performing OLS inference under model mis-specification should ideally be written and read as pros, read left to right, top to bottom. We note that this idea is not new and we were inspired by the tidy versus similar packages. In our Mars grammar, on the left, we see that the main nouns are the LM and the Mars LM objects. In the middle are our verbs, which are the Mars functions which act on these noun objects. These include generic methods such as summary, print, confin, et cetera. We also have tidy analog for each of these methods which follow the get underscore prefix. This reminds the user that the object return is a tidy table similar to the broom package. We name Mars functions using this consistent convention so that both reading and writing Mars code aligns with the communication of these ideas. Lastly on the right, we see that code composition in Mars becomes easy using the Magrida pipe operator so that these deep statistical and econometric ideas can be communicated easily as pros. Moving on, so to slide 11, we have discussed the key principles upon which Mars is based till now. It's now time to get hands on and see briefly what a Mars workflow looks like. So moving on to slide 12. To begin our demo, we will work in a simulated setting. Recall the simple toy simulated example that Ricardo presented earlier. Here we generate data from a linear model with heteroscedastic noise where the variance is not equal. For interest on the left is what the R code used to generate this type of plot looks like. So how would we go about modeling this type of relationship? In the next slide, 13, since we know and believe that the data is generated from a linear model, we can start by running the LM workflow that we know. By running the usual, and we see this here, we run Y on X without intercept. By running the usual summary command on our fitted LM object, we get that the OLS estimate of our coefficient is indeed correct. But of course, as shown earlier, the standard errors don't provide the appropriate coverage of the true parameter anymore. So how can we now use Mars to perform inference in this situation? We'll see this in the next slide. So we note that every Mars workflow begins with fitting LM as we just did. Next, we take our LM object and pipe it into COMVAR, the most important function in Mars. This creates a new object of class Mars LM, in this case named Mars fit. The COMVAR design is very simple. And in this case, we ask it for the empirical bootstrap, multiplier bootstrap and residual bootstrap standard errors to be computed with a thousand replications each. In this case, we created our Mars object implicitly using many default inferential assumption. How can we explicitly check what these assumptions are? We just print out Mars fitted object. And on the right, we see that the print method explicitly conveys the inferential assumptions used and even reveals what default assumptions were run. In this case, for example, multiplier bootstrap was run using default Radar Mark awaits. We built this feature for ourselves in the development process that many users like that the object made these inferential assumptions explicit. We hope that this catches on in the R community and more assumptions behind statistical objects are made explicit through the generic print methods. I'll now pass back to Ricardo to describe more Mars methods. Well, thank you, Shamindra. So moving on to slide 15. Most of the generic methods that are currently implemented in Mars also have tied the R alibs. So for example, when we can call the typical summary method, also on a Mars object, and the summary method does exactly what we would normally expect from calling summary on an LMA object. It just looks a bit nicer at least we think it's nicer. Our summary displays the assumptions together with both of the coefficients estimates, the standard error estimates, p values, and so on. However, we can also obtain similar information in the form of a t-ball by calling the getSummary function. Do we say that getSummary is a tidy analog of summary? And the idea of the analogs extends well beyond to the summary. So moving on to slide 16. Again, like we try to embrace the tidy workflow and here is another example where we call the getPlot function on a Mars object to obtain eight model diagnostics. So six of these diagnostics are the same that you will get from calling plot on an LMA object plus two additional diagnostics. The first of these two diagnostics is shown here and in this plot A displays the confidence intervals for the estimates of coefficients. And each interval is based on a different type of method for computing variance. And so all the goosebumps methods. And as we might expect, we see that the intervals will differ. So moving on to slide 17. Like this was a very quick dive into Mars and we have absolutely covered all of its functionalities. But what are some of the key upcoming features that we have in our pipeline? So first we plan to publish documentation and features which includes making our code faster by parallelizing it and other things. And we also plan to spend some time developing user-friendly vignettes. Then we also are going to spend some time on new functionalities such as ANOVA which we are currently working on and using conformal inference for predictions. Lastly, one very important element is that we want to send the package beyond just OS so in order to include also GLM and other models. So moving on to slide 18. These are some of the references to the papers that we cited in the presentation and that have also inspired us. And moving on to slide 19, the last slide. To conclude, you can install our package Mars from PAK or through DevTools. And we will also soon push the package onto CRAN as well. So we will truly love your feedback so please sign our package. Let us know what you think. Feel free to open an issue, don't hesitate at all and also feel free to make a contribution. And thank you very much for listening to this. Thank you so much, Ricardo and Shamindra. So we have time for one question. I still do not see questions on the Q&A but I will invite you to pose your questions either in the Q&A or in the Slack channel. I do have one question and is if you have tried Mars for teaching and if so what kind of classes and how? Yeah, so basically we haven't yet tried it for teaching. However, next semester we're planning to actually teach it in a course with Professor Arun Kumar Kuchibotla. So Mars grew out of a course, the same course that we did last year that Ricardo and I did and we developed Mars to teach ourselves the concepts from that course. And in the fall, we're planning to teach a couple of lessons in that class and use it as a pilot case. That sounds great. And another question is as busy grad students how do you find time to write a package? And you have already explained your motivation but I know from what I remember in grad school you have a lot of things to do. So how did you find time to write a package? Yes, it's definitely a great question and very slowly, taking it very slowly. It's a science project for both of us and we're doing it mostly for fun. So it's been like we started in October and then we have moments in which we work a lot on it and then like now I'm on an internship and sharing this work and research. We are like for a month has been stalling. So I guess, yeah, very slowly. That's it. I'll just add on that. If you do want to help us build the package, I feel free to open an issue. We're really friendly and we're happy to take on contributions. Sounds great and congrats on the good work. Thank you so much, both of you. Thank you. Moving on to our second talk, please join me to welcome Martin Bingen from LMU Munich, Germany, who will present mixed integer evolutionary strategies with Ms. Mushu. Martin is a third year PhD scientist at the statistical learning and data science department from the Luding Maximilian University in Munich. He is interested in automatic feature preprocessing, feature extraction and model selection. He is the main author of MLR-CBO and MLR-3 pipelines. Both are software packages for data preprocessing within machine learning pipelines. Before finding his way into computational statistics and machine learning, Martin obtained a math degree in theoretical physics and a master in science degree in biostatistics. Martin, the mic is yours. Hi, thank you very much. And welcome, everybody, to my presentation for my R package, Ms. Mushu, which contains a letter M-I-E-S, which is for mixed integer evolution strategies. In the beginning, first, you can find my package on GitHub as Ms. Mushu. It has both documentation. It is very well tested and you can actually install it and run it. And if you're watching this video later and on YouTube or somewhere, you can actually download this and follow along to some of the code. So first, what is Ms. Mushu, except besides being an abbreviation here, it is also a German word for a mollusk, which Wikipedia tells me is blue mussels in English. You can, well, they look like this. Usually when you see them in the wild, they look very similar. And this is the kind of word I always think about when I read the M-I-E-S abbreviation. So I named the package like this. So how's the presentation going to proceed? First, I'm going to be telling you a little bit about evolution strategies in general and about optimization in general. And then I'm going to show how Ms. Mushu works. So first, it's building blocks, the operators, then how to use it. And then a very interesting use case for Ms. Mushu where we can actually combine operators to work in more, on more interesting optimization spaces than many other optimization methods allow. And finally, I'm going to give some outlook on what I plan to do with the package and where the work is heading. So first, optimization. Optimization we do in lots of parts of engineering and science, data science in particular, where we have some kind of function and function is a very general word here. So something where you have some kind of input that produces some kind of output and you desire output to be as big as possible or as small as possible. So these functions in particular, they're just examples. You don't need to remember the formula, but remember that we have some kind of process that gives us a scalar output and the input we have some kind of boundaries on these. So some kind of constraints. So in our example, we have two dimensions X and Y and we have a function that produces some kind of output. And here we're trying to maximize this output. So we're basically trying to climb this hill. So evolution strategies are now a population-based optimization method that tries to in some way imitate natural evolution. So in natural evolution, we have a population of individuals that have produced children in some way and these children are similar to the parents, but not completely similar. And out of these children, some of them are more likely to survive than others. So the more fit individuals survive and over time, you get individuals that are more adapted to the environment that they live in. So our method, which is an obstruction of that or which tries to imitate it, but in some way works as follows. So at first we have some kind of initial what we call population and these populations consist of individuals. So here this is a table with some rows and you can imagine these are just points in your search space and these are individuals that in some way live and they tend to have some kind of fitness. And the fitness we get from evaluating the function just on these X values. So this is our first step that we do. We initialize a population by randomly sampling and evaluating. Now we do the optimization procedure. So we iterate some steps over and over where we generate offspring in some way and this offspring then generates offspring again and so on. So first we have to look which are the lucky individuals that get children. This we might select just completely randomly or we might go by how well they perform. And these parent individuals are not going to produce children. So this is our first step, selecting parents. Second step would be to have some kind of crossover. So the parent individuals have some interesting times together and they in some way mix their genetic material and this might for example, look like crossing over some of these values here. And but it might be like this is a very general thing. So it could mean anything but in some way mixes this information or it might not even mix the information. And so this is our second operation. Finally, we might have some mutation. So we in some way permutate the genetic information a little bit. So we get individuals that are not really an average of these two parent individuals but they're in some way a little bit different because we actually want to explore the search space. So all of this has now produced child individuals that we add to our population. And we record which generation each individual belongs to because it might be interesting later. Of these, we now evaluate the performance. So we have to apply this function in some way. So we might just run in our function but it could actually mean that someone somewhere runs an experiment. And finally, we have to choose which individuals survive. So maybe we kill the individuals with the worst performance. We might have some randomness in this process. We are also going to record when individuals died. And this is our final step. And now we repeat this process. So we generate new individuals and kill old ones and do this again and do it again. And we hope that this process in some way is going to find regions of other search space where the performance is quite well. So how are we going to do this in R? In R, we can, well, use some packages. I hope you consider using Miesmusher here. Miesmusher is a new package that is built on the MLR3 ecosystem which you can ideally Google it but you can also visit this link. So this MLR3 is for mainly a machine learning or originally it was a machine learning ecosystem but it also contains lots of optimization because optimization is very relevant for machine learning. So, and this is the, it gives us some building blocks on which Miesmusher is building. Miesmusher is inspired by the ECR on CRAN it's called ECR on GitHub it's called ECR2 package by Jakob Bosek. So this is, I've worked with this package before it does some things. I mean, it's a very good package but Miesmusher is trying to expand the possibilities of that is trying to go beyond what this package offers. Most interestingly, Miesmusher is based on R6. So most things as you see Miesmusher are six objects which as if you've worked with R6 before then you know it has some nice benefits and Miesmusher gives you two ways of using it. So you could use some pre-specified algorithms that are given by Miesmusher and well these you just run and you get the result which is nice but you can also use the building blocks of Miesmusher themselves and basically build your own optimization algorithm. The main part of Miesmusher which you're going to be using are the operators which I'm going to be showing now. Operators represent the individual steps that I've shown before while presenting the optimization process. You get, well operators are six objects which you get by calling accessor functions. So these are quick access functions that give you operators that you can use and these operators work on data frames of individuals. So the table you saw here earlier these are collections of individuals. So these are just tables that you feed into operators and they return either they return new individuals which are just the modified ones or the selection operators they give you the indices of the selected operators. So if you wanted to use an operator you could just get one using this quick access function. You would have to well tell this operator on what search space it's actually working on and then you could just call the operate function and it gives you for example if you have like a data frame with a very round numbers you do the Gaussian mutation then you get a data frame that has slightly modified values which is what mutation is supposed to be doing. And this is a nice picture of that. Most of the time you would be using functions that abbreviate all of this. One quick word on some things that we give you that are given by the MLR3 ecosystem. So BBO2K is the name of the package which gives us some things that like some objects we need to operate on. For example, the objective function or the information about what kind of individuals have been operated on have been evaluated before and what their performance was. So how do we run MISMOSHA? So there's as I said there are two ways. So first the complicated way which means we have to we basically do all these operations we build our own algorithm but even that is quite easy. We select some operators and all the steps that I showed before they have a corresponding function which you can call. So we prime which is a new method thing. We initialize the population we generate offspring we get some offspring variable we can evaluate the offspring so we get the objective value and we can do survival which kills off some of our individuals. And this we can just repeat over and over in a repeat function until we get some termination error which is just a signal that tells us we've optimized to the end and we're done. But there's also the quick and easy way of calling MISMOSHA. So if you have your operators defined then you can just get an optimizer that uses all these operators and this optimizer does the optimization loop for you. So and we have this short form for that that gives us gives this optimizer very quickly. We can even modify this optimizer in some ways or test hyper parameters. After we got our BBOTK optimization instance we can just call optimize and it does all of this for us and we're done. So one thing I'm going to skip over a bit is that these operators they come in various shapes and forms they do different stuff but there are also operators that combine other operators. So we could combine operators that then work on different parts of our search space. So for example, if we have an operator that is supposed to be doing mutation to some numeric mutation by Gaussian change on the numeric parameters but flip bits of logical parts of our search space. Then we can use a certain mutator operators that are built together from other components. And this is going to be very daunting when you look at see the first time but actually what is being built is very logical and it's very physical. You have the feeling that you actually have things in your hand that you can work with and that are things that like that have a physics to them. So I think it's very intuitive to work with this. And if we do this we can actually put it into our operator and into our optimizer and it does some optimization for us even on search spaces that are not purely numeric. So where is all of this going and what are things that I'm happy about that I'm by far don't have enough time to show. So one thing that is already implemented is self-adaption. So we can, these parameters, they can change the hyper parameters of the operators. They can change themselves during the optimization process. And so there's a very nice paper that actually introduces the Timex integer evolution strategies where they use the self-adaption and you can actually do this in Mies Moschel. Something else is that Mies Moschel can do multi-fidelity optimization. So we can selectively evaluate our performance measure or fitness function more closely for some individuals than for others. Finally, we also have multi-objective optimization. This is not on the main branch yet but it's definitely possible. And so it's being used in experiments where we have multiple performance measures that we try to optimize. And so we try to find individuals that in some way are somewhere on this front of possible performance values. But where we don't say a priori, which performance is more important than the other. So as I already said, Mies Moschel is very well documented. I think it's very well tested and it should be on crown soon. Currently I'm still working on the multi-objective part and some parts of these are more likely to change than others, to be honest. So the core Mies is, I think works very well. I think, for example, the multi-fidelity part of it might still change a little bit but I definitely think you can already use this in your experiments and soon you should be able to just download it from Khan. But currently you have to install it from GitHub using this command, which I hope I've inspired you to use now. I thank you very much for listening to my talk. Maybe when you're listening to this on YouTube later you might actually already be installing it from Khan. And I'm looking forward to, well, getting lots of visitors on my GitHub page and getting lots of feedback about users who might be, well, who might want their own wishes fulfilled by Mies Moschel. So thank you very much. Thank you, Mairi. Just as a reminder, before we continue, you can pose your questions on our channel in Slack or using the Q&A function in this webinar. So one of the questions, you said that it was inspired by biological evolution but are there any applications that you have tried to package outside of biological evolution? So, we're currently using this package in our group actually on research. There's one, so the nice thing about these evolution strategies is that they work quite well in our experience with search spaces that are kind of complicated. So they have large numbers of, like a large number of dimensions, for example. I've used a very similar method which is not implemented yet in Mies Moschel, but which was actually prompted me to develop Mies Moschel. I've used it for doing automated feature selection and a combined with hyperparameter optimization in machine learning. So this tended to work well when I use basically the selected features of the dataset that I'm doing machine learning on as a genotype and using the machine learning performance as the fitness that I'm trying to optimize. Very interesting. So we're gonna give people a bit more time to pose questions because we already have one question for the previous presenter. And so we'll leave the rest of your questions for the end of this session, if that's all right. Thank you so much again. And now we are welcoming Sebandi Candanarati from our MIT University who will be presenting here is the Anomaload Down. About our speaker, Sebandi is a lecturer and applied mathematician in the Mathematical Science Department from our MIT University. She has a PhD in mathematics from Monash University, Australia. Sebandi uses statistics, mathematics and machine learning to find unusual patterns and data. She also likes working on real-world problems, especially ones that are motivated by industry. From 2016 to 2019, she worked with an industry partner on intrusion detection. Please join me to welcome Sebandi. Hello, everyone, I'm Sebandi and I'm going to talk about anomalies. This is joint work with Rob Heinemann. So why are we interested about anomalies? We're interested because they tell us a different story. So for example, I'll think of the fraudulent credit card transactions among billions of legitimate transactions or computer network intrusion, or astronomical anomalies like solar flares or weather anomalies like tsunamis or stock market anomalies. Are they heralding a stock market crash? So there are all types of anomalies and anomalies there in different applications, right? So why are we interested in anomaly detection? Say for example, take fraud or credit card fraud or network intrusion, right? Take that an example. So suppose we train a model on certain types of fraud or like, you know, these are the telltale signatures of this fraud or these are the telltale signatures of cyber attacks, which are really anomalies because there are millions of billions of legitimate transactions going on. So, but there can always be a new attack, a new fraud and your model would not know it because your model is looking for certain types of things, right? So, but if you have a way, you know, but so what you want is you want to detect when really different things happen, when really weird things happen. So these are, you want to detect anomalous behavior because these anomalous behavior might have some really cool meaning in it because they're telling us a different story such as fraudulent credit card transactions or network intrusion, right? So anomaly detection is used a lot in these applications. So is everything done or is everything rosy? Well, hardly. There are some big challenges in this field and one of the main things is the high-dimensionality of data. This is a challenge in lots of data that it is a mission living. So when data is really in high dimensions, finding anomalies is hard. Why? Anomalies look like normal points and normal points like, sort of, you know, you can't really distinguish there the distances between anomalies and other points are similar. The clustering is similar. The density is similar. So high-dimensional data is a problem. Then there are other things. So high-dimensional data is a problem from many machine learning tasks. And so there are other problems as well such as high false positives. So with anomalies, like say, for example, if you're talking about credit card transactions or intrusions, you don't want your application or your underlying algorithm to be an alarm factory, right? So think of a situation where you've got a camera and you've got a camera outside your home and it's just, you know, taking the video and then it alarms if a burglar comes in or something like that, right? But if this camera, sorry, if the alarm goes off every night at 2 a.m. or something like that and then this is because there's a possum, right? Or some such thing. It just becomes an awe, like every time there's a, you know, wind blows and then the camera, when then the alarm goes off, then it's just news, right? You don't want that to happen and it just becomes an alarm factory and the confidence system in the system goes down and you're gonna switch it off and you're not gonna take that seriously. You don't want that to happen and that's why high false positives are a real problem. The other thing is parameters need to be defined by the user in many algorithms. So, and that is again a problem because, you know, some people are devising the algorithms and the user doesn't really know what these parameters are doing inside the algorithm but the user has to set the parameters that are suitable for them and that is another problem. So, these are some of the challenges and in today's talk, what I'm gonna talk about is I'm gonna talk about two, two are packages that we've done. So, one is Dobbin. Dobbin is a dimension reduction method suitable for outlier detection, right? So, Dobbin addresses the high dimensionality challenge. So, it's a little bit like PCA, it gives you a different set of basis vectors like IJK or E1, E2, E3, it gives you a different set of basis vectors such that the outliers or the anomalies are highlighted. So, that's Dobbin and the other one is Lookout. So, Lookout has low false positives and so, Lookout is an anomaly detection method that has low false positives and the user doesn't need to specify parameters, both packages are on cram. So, to start off with Dobbin, this is a paper by Rob and me and it's published in JCGES and that's a sticker. So, Lookout is a pre-processing technique. It's not an anomaly detection method, right? So, what we're making sure is the original anomalies in the original space are still anomalies in the reduced dimensional space. So, that's the key. So, what does it do? So, it finds a set of new axes or basis vectors which preserves anomalies. So, it is in that sense, it's a little bit like PCN, right? So, you have this, you have your stat like in the starting of axes and then it finds, ooh, if you have these axes this way, then the anomalies, you're putting a spotlight on the anomalies and by doing that, you can use fewer axes vectors and use it for a normally detection. So, the first basis vector is in the direction of most anomalousness. That is, we're taking the largest KNN distances and we're finding the first basis vector in that. The second basis vector is in the direction of the second largest KNN distances and so on. So, just to give a little example. So, we consider a uniform distribution in 20 dimensions and let's say there's a point at, there's one point at 0.9, 0.9, 0.9, 0.9 until in all 20 dimensions. This is the outlier because this point is far away from all other points. Now, if you traditionally do PCA, if you traditionally do PCA, that's where that point comes out. This point at 0.9, 0.9, 0.9. So, this is a uniform distribution. But if you do dobin, that's where the point comes out. So, that point lies really far away and using two axes. So, this is reducing the dimension from 20 to two. So, in our, you just use dobin like here, like this. Dobin of x and that's the cause, you know. So, the next one is lookout. So, this is leave and out. We're using leave on out kernel density estimates for outlier detection or anomaly detection. This is a preprint again with rock and that's a sticker. So, for lookout, lookout is an outlier detection method. And the main things of this method is the false positives are low because we use extreme value theory. That is the reason that false positives are low. And extreme value theory is used to model 100 year floods like really extreme events. And then we use a generalized operator distribution in lookout. So, the plus point is it's not an alarm factory. So, lookout, I'll need to keep moving this. Sorry, lookout. And also the other thing is the user does not need to specify parameters. So, lookout uses kernel density estimates for to use kernel density estimates, we need a bandwidth parameter. But the bandwidth, that's the general bandwidth that is there to represent the data is not really good for anomaly detection. So, what we do is we select bandwidth using topological data analysis methods which is actually persistent homology. That's what happens inside lookout. So, first of all, we find a bandwidth using TDA topological data analysis and then using this bandwidth, we find the kernel density estimates. Using the kernel density estimates, we use extreme value theory and then find out lives. Okay, we model it using extreme value theory. So, that's what lookout does. And more details are available in the pre-print. And then there's, we also introduce something called anomaly persistence. So, anomaly persistence is which anomalies are constantly identified when we change the bandwidth because bandwidth is the key parameter. If you change the bandwidth, what are the anomalies that are constantly unidentified? These are consistently identified, right? These are persisting anomalies. Okay, so, that gives the overall picture of the data set you have, right? So, that is kind of a little bit independent of the parameters that you've chosen. So, for example, let's look at a couple of examples. Here we have a two-dimensional normal distributions with a bunch of outlines or anomalies at the far end. And lookout identifies these anomalies. That's why they're red and this is the strength of the anomalies. So, it identifies these anomalies with very high strength but it also identifies this at the point here. This is slightly yellow with low strength. The indices of the anomalies are 505 to 501 to 505. That is to say, in the data set, I've placed the anomalies at the very end. And this is the anomaly persistence diagram, right? So, the anomaly persistence diagram, the outliers, so, for early, for low bandwidth values, for low bandwidth values, there are many points that get identified as outliers. But as the bandwidth increases, only those points get identified as outliers. And this is the bandwidth chosen by lookout. So, the one with the dashed line. So, going on to the next example. So, here we have a bimodal distribution with some outliers in the middle and again, lookout identifies them with high strength but also identifies this point as an outlier and that point as an outlier with low strength. That is the anomalies cause are low for those points. And then again, the outliers are placed from 1,000 and one to 1,000 and five. That's at the end of the data set because this is a data set with 1,000 and five points. And we see these are the indices of the observations. And we see that again, these ones are identified as anomalies for a very, as we change the bandwidth, exactly this is a bandwidth, as we keep on increasing the bandwidth, they're identified as anomalies. And for very small bandwidths, lots of anomalies are identified. But as we increase it, they drop and that's the bandwidth chosen by lookout. Another example, so here we have three clusters and each of them are normally distributed and some outliers there. And lookout identifies those and identifies this, that point and that point. So, lookout identifies these as outliers and so the anomalies have these indices and we see again that they, no, I've always in the data set, I've placed anomalies at the very end. So they identify them for very long range of bandwidth values and for early bandwidth values, there are lots. But you see these other points like that point, which is corresponding to that or is also identified as an outlier at the current bandwidth. But then if you keep on increasing the bandwidth after that point, it ceases to be an anomaly. And so this example, so this is points are placed in an annulus and here we have some anomalies in the middle. We have 10 anomalies placed in the middle, which lookout identifies that's why they're in red because that's the strength. And if we keep on increasing the bandwidth, this is the bandwidth that lookout picks the one in the dashed line, but if you keep increasing the bandwidth, again, those points are identified as anomalies for very large bandwidth values and for a range of bandwidth values. But we have this point as well, which is identified and that point corresponds to either this point here, which is identified with a low score or that point there, right? So there are some other points that are also identified, but with low strengths. So in summary, I've talked about two things. One is dobbin. This is a pre-processing step for, but specially catered for outlier detection and it's a dimension reduction method. And then lookout is an extreme value every base method to find anomalies. Both paper and pre-print are available and both packages are on cram and thank you very much for listening. Thank you so much, Sandeep. So another reminder that we have both the channel on Slack and the Q&A. We have one question in the Q&A from Andi. In many of the examples, the anomalies show up with high bandwidth. Why not just use high bandwidth to detect? Yeah, thanks, Andi. Yeah, true. The thing is this though, it's generally high bandwidth, but there is kind of a Goldilocks range. So we start with very small bandwidths and then everything gets detected as anomalies and then there's a range of nice bandwidth values where the actual anomalies get detected in the as anomalies and then after that value, then nothing gets detected. So we need to pick up that kind of Goldilocks range. That's the reason that Lookout is looking at that particular data and doing this topological data analysis procedure underneath to find this big, generally decent bandwidth value because if you just pick a massive one, you're not going to get anything. Does that answer the question? He said thanks. We have another question on the Slack channel from Geoffrey Hilton. What sorts of analysis are left to us following the anomaly detection given the small sample size we are left with? Are there any quantitative methods available to us? Or is this the case where a closer, more qualitative analysis is more practical? I'm sorry, Marjela. So I can't see, so this is in the chat, right? Because it's really good if I see the question. Oh, let me, let me. This is in the chat on Slack or on? It's on the channel in Slack. But I see. I faced it on the chat in here. I see, okay, okay. What sort of analysis are left for us following the given the small sample size we are left with? Right, so this is, so the thing is anomaly detection things are used kind of like to flag things like say, so in a real scenario, say you are monitoring traffic in a computer network and then you see really a node behaving really anomalously. That might be because it's trying to hack or it's been hacked, it's like sending things everywhere or it's like really doing something very different that the others do. So in a real world scenario, it's a flag and a warning and needs to be investigated, right? Unless it's, unless you know more about anomalies in a way like, oh, this is, unless you know more about it, it's then that is to say, it's not just from the knowledge from anomaly detection, but from some background knowledge that you have the contextual knowledge on the application that you say, oh, that is an attacker. That's what it's doing. And bang, you go and stop it or block it like really quickly, but generally you would have, you'd be in a scenario you say, hmm, that's different. We should investigate that. That is general scenario just looking at the anomaly detection framework, but putting in the domain knowledge that you might have more stuff. Like for example, in a sensor, water quality sensor or something like that, that is deployed in a river. If the anomalies are of a certain type, you might think, ooh, that is because the batteries are dying down or you know that's a telltale signature and that's that kind of anomalies or maybe the quality sensor, which was supposed to be beneath the water has come out of the water because the water level has gone down and then it's behaving really weird. So you know, so unless you have, so that's coming from the domain knowledge, but without any domain knowledge, it's difficult to go to an action plan. But these are kind of flagging which were the next ones to look at for you to, because when there's so much of data coming in, you need to have like a short list. These are the weird ones, like you know, things like that. Yeah. Does that okay, Jeffrey? I think, yes, thank you. So thank you so much, Sorandi. Now I have one more question for you, but I'm gonna save it for the panel. So can we have all the stickers from before? Ricardo, Shamira, Martin. Great, thank you. So I'm gonna give a couple of more minutes for people to post their questions on the Slack channel or in the Q&A. I do have one question from the Slack channel to the first presenters. I think they already answered on Slack, but I would like to share that question with the audience on YouTube. Jeremy Zella is asking if Martin is able to tell me, if tell him if our wrong model is used like the reset test and rainbow test from the LM test packages. And I believe you have an answer, Ricardo or Samindra? All right, thank you. So the answer is that moment, no, we don't have to stress, but we have some other tests for non-unit interaction that have been presented in the two papers that we refer to. So in that case, it's still an experimental phase, I guess, but we have three different tests for that purpose. But yeah, again, we do not offer like the rainbow and the other tests. Nice, great. And the invitation, it's also, I would like to repeat your invitation to contribute to your package using YouTube. Thank you. All right, thank you. So that's, there's another question for Zavanti, going back to her presentation. Grace Ryan would like to know if look out works well for disease surveillance. For example, daily, weekly or monthly frequencies of a reportable disease, if you can use the package to find excess counts. I haven't used to non-disease surveillance at this point, like both Robin and I haven't. So this is joined to a big drop. So this is on generally, like we did it on IID data, like, you know, general rectangular. So if it is like a time series that we need to take into account the autocorrelation, which we have not done at this point, but we are interested in doing that application because then it'll be, you know, then it's a nice step. So we haven't done the time series aspect of it as it, it's just in the IID case. Yeah, but that'll be a really good application. Thanks. Great. Thank you. I do have one more question for my team. So is the package part of your dissertation work or is this a project parallel to your PhD work? And if it is, again, I'm going to ask the same question that I asked the first presenters. How, as a PhD, you find the time to produce the package and how do you manage that? Well, interesting, the package wasn't planned to be part of my dissertation, but it's going to be now because it was, as you already said, like, how do I find the time? So this package grew and grew in scope and, you know, you work on some software and you're like, well, I want to do this more general. So it's like generally applicable. And then you add this and add that and suddenly you have like worked for this in months. And so this was supposed to be just a little tool, but now it's going to be in there in some way. So I'm going to publish this and it's going to be like a published software in that sense and it's going to be part of it, yeah. And so I do find the time, it's by making it part of other research things that I'm doing. So it's basically using the time that I'm supposed to work on something else by making this package part of it. And then I just will solve my other problem by making the package nicer. I guess this is a way to do it in a PhD here. No, but it's great that you have the chance to get recognized by your work. Because one big issue in software development, at least in academia, is that you don't get proper credit sometimes. Yeah, I think this is generally an important thing to make this possible. And I'm very lucky to be in a research group where this is encouraged. And I think it's something good because in the end, everybody wins when there's good software, right? Correct. Thank you, Mardin. I think Alsewanti has a question for you, Alsewanti. Yeah, I'm really interested in the work. So the package means a broken present. So how is it different from the package EA? Because there are packages for genetic algorithms and optimization. Yeah, so I think the additional benefit here is that it's more flexible as in like it gives you these objects. It's built in R6. And these objects, you can specify very precisely what your operations are supposed to be, like how to mutate, how to recombine. And each of this is given to you as a building block in a R6 object. So you can basically build your own algorithm very specifically on what you want to do. And in particular, this way of combining operators that I had to jump over a bit, unfortunately, it's very powerful because you can have like a search space where like some of the parameters like things are categorical, others are numeric and maybe integer-valued. And you do different kind of operations on the different parts of your search space. I think this, like obviously you can also like write this in other packages, but they don't have this as a primary goal. And this is, we're trying to well optimize problems search spaces that you would have difficulties with in other packages, I think. Great. Thank you, Mike. I believe we don't have any other questions. But please, to all the presenters, I thank you so much again for your participation. And please feel free to keep interacting with people in the Slack channel. The Slack channel will continue to be open for other sessions in this same topic today and tomorrow, but I've added your information as presenters and your packages. So hopefully there will be more interaction after people watch this video on YouTube. Thank you so much to all and have a nice conference.