 Hi there, my name is Clara and I'm here to talk to you about functional programming and parallelization and spatial point pattern analysis. So my project is on the geospatial analysis of Airbnb in Singapore and the data is taken from inside Airbnb from September 2020. So for this project we're trying to determine if Airbnb listings are clustered in Singapore and where they occur using with these K-function tests and we use it to split by room type and the region type. Those are the packages that we use and I want to really talk a little bit more about how we parallelize the K-test for larger data sets. So how does the K-test work? SPATS that provides the function KS to compute an M bias estimate of KT and then basically it runs a series of Monte Carlo simulations using the envelope function and compares it to the theoretical value. Now if they are completely spatially random meaning there's no relation to each other it will fall within the envelope values. Of course we reject that null hypothesis when the simulated values fall higher than the upper bounds of the theoretical values and that means that they're clustered or if they're competitive processes this simulated values will fall lower than the lower bound of the theoretical the theoretical values and this is what it looks like right? So in those two graphs that they're shown this shows signs of clustering and for the graph on the bottom right it shows complete spatial randomness because the simulated values are within the envelope. Now issues are it starts being computationally intensive for large data sets or data with non-uniform observation windows. Additionally there's no inbuilt parallelization methods in SPATS that that means you've got to use your own way of parallelization and it's also dependent on OS functions such as MAC apply from the base parallel package works for MAC and Linux but not for Windows. So what we did was obviously split the analysis to make it less computationally intensive and also create functions that benefit from functional programming such as an envelope function and the plotting function. So I wanted to go deeper into the parallelization because I realized that there were not many assets or things around this area and what we do in this case is to paralyze the Monte Carlo simulation splitting the number of simulations between a different course and then combining the results back together. So first of all we set up the clusters the code here below shows you how that is done this is obviously calling all the libraries of do parallel and for each and next we need to set up and distribute objects to each of those clusters so that they can read and obviously run the functions on those clusters. We use 100 simulations exporting a store of PPP objects the number of clusters and number of simulations required and then of obviously passing spets that entirely versed through to the different six different clusters the three are only shown here and what then we need to do is to set up the helper functions for the parallelization. So first we need a function to replicate the PPP object for each core and use and that's this one here and the next one is to look at running an envelope simulation for one PPP object and this results in a list of envelopes. So basically it's saying we're going to replicate the PPP object across all the clusters and for each of these objects in the clusters do in parallel and we set the seed so that we can replicate this run the envelope function for each of those things using spet stats ks function how many simulations which is the number of simulations divided by number of clusters rounded up to the nearest number and then of course we make sure that we are able to save it into envelope functions and then finally it's a function to pull the envelopes in simulation results into one envelope and we use this by calling do call pool of of the function x and we put them all together into one main function which if we just parse through the PPP object with the region name and a different split by different room types it will be able to do that for all of them. So once we're done obviously run the function and then stop the clusters once you're completely done with that. What's happened is that we find that parallelization is four times faster than running obviously the simulations in serial serially. We use microbenchmark and run the simulation 10 times for each different region and expression and we find and these are the results that you can see here. An alternative for the k-test is to use k-test fft function and this is basically an alternative function that analyzes large pattern of points it discretizes the point pattern onto a rectangular raster and applies spasferia transform techniques to estimate the k-function. In essence it kind of does things in parallel by splitting them pairwise and then aggregating them again. Of course this only applies to the k-function test so hopefully the last few slides were not a complete waste of your time and if you needed to run it on different other functions that's kind of what you would need to do in terms of how you would run the parallelization. So this chart just shows the difference between running the spasferia transform and a normal k-test on the central region private rooms. So the main takeaways from here are 1. Learn parallelization so you can apply it to your problems and then make your computation a lot faster. 2. I guess learn how to set up a VM with better hardware specs and that's what I had to do for the second half of my project on geographically weighted regression and 3. RTFM that might actually be a function written specifically for your problem. You don't know till you've gone through it so that's basically it. Thank you very much and if you want to see the code for some of those things you can find that in the RPUBS or go to the project page where I've put most of the data wrangling and all the other parts of my project in there. Thank you.