 Thank you. It's nice to see you here, and I want to show you something I think important It's more than only business case because most of the examples came first from NGOs and Social problems then there were applied to business problems for e-commerce So the topic of my presentation is about special data and how to make it better How to use it as a data scientist and a few words about me I love open source and open science and I feel that we must share every scientific research with everyone and the same with software up up to some point I understand that there are some business cases and I work as a data scientist and data engineer Most of my time I spend with cleaning data and Wondering how to make it better to enhance my models. So that's why I think that data engineering is 90% of my job Then I am a member of Hacker space. So I like NGOs. I spend a lot of time in nongovernmental organizations and I dedicate myself for society. I think it's very important and and I'm all I also like hiking be Boeing. So when I stop coding I go somewhere to dance and I love parks if you have park you are my friend and What will it be about The first I will talk about spatial data because I'm not sure how many of you Know something about it and for data scientists from general perspective. It could be useful then we will gradually go to more to harder topics the first one is spatial similarity and How in our neighborhoods is similar to our place? then we will go to interpolation and it's something more than Feeling missing values with average or median and at the end we will talk about public data Enhancement and it's important topic because we have a lot of public data a lot of data from government, but it's It's it's quality is very weak and it's hard to Merge it with our data from other sources and Spatial data we have three main types of spatial data One is point and those are will coordinate in space usually described by along duty to tend to latitude or accent Y then we have lines and lines can be physical things like roads Railways or lines could represent some movements in space. So some temporal events then we have polygons and We have a lot of data that is aggregated over polygons. We know that every country has some Median income and we can watch it on the nice visualizations, but how can we use it? It's harder problem then also. I think that spatial data Brings also addresses. So if we want to describe that something is at this tweet, it's also a kind of spatial data We work a lot with time series with networks and network analysis and images images that came from satellite observations And there are two main packages that you must know if you work with spatial data in python And which one will you choose? It depends from your use case if you are working with some coordinates and Data that is described by points Then you will use geo pandas and it's the same like pain pandas, but it adds Additional column with geometry. So if you know time series from pandas where index is time series in this case You have additional dimension and it's a geometry, but geometry as you recall It can be points lines or polygons. So it's rather complex thing and The second package is rasterio and I personally started my journey with Geospatial data from this package and it is used for image processing for Satellite image processing because images from satellites are not the same as those taken by iPhone And those images have additional metadata and coordinates that you must process and to have control over it and The first concept spatial similarity There is some theory or low low That everything is related to everything else, but new things are more related than distant things and Nowadays it shouldn't bother you because it's it's not It's not true for every case But in the case of spatial data, we are looking for those things that are related to the closest neighbors So it's very important for us and here we have some example from industry. It's a mining industry Also topic from today because it's about deep-sea mining and It's very hard to get a lot of samples from the ocean bottom. We can only get a few hundreds Maybe thousand and that's all if we apply machine learning model It couldn't work it couldn't work it won't work with it because there are two Two less samples to work with them. So we must interpolate how we can some process or Some materials materials Be placed over the surface. That's why we use a spatial interpolation And first we must check how similar or dissimilar our objects are our neighbors this this is a very simple graph and Very simple plot, but it has a lot of information about some process. It could be density of gold samples at some place or it could be Customer spending at some district and we analyze how similar our neighbors to each other And if we see this curve, it's it's very nice because we can interpolate missing values missing places and know how our spending in the neighbors that we didn't sample before and On the x-axis we have distance Y-axis is parameter named semi variance and it's a dissimilarity It's reverse of similarity, but if you start working with you will understand it At the beginning it could be tricky because usually we Don't measure negative things but what you see is at the beginning at the Short distances this dissimilarity is very low That means that neighbors are similar neighbors spending are similar and at some point Here it's about 25 kilometers dissimilarity disappears and dissimilarity is maximized so it's From this point, it's not Our goal is not able to interpolate values and we know that Maybe from the I will show you in the reward case scenario. It will be Easier to understand. Okay, so here is the use case and it is the use case from the last year when I was developing dynamic pricing model and We were interested How close hotels how how the proximity affects the prices for this Dynamic pricing model and we wanted to know to how up many kilometers We should include neighbors to control our price and that's why we perform We perform this spatial similarity the we build this semi vario gram and we've checked that Okay up to distance of 10 kilometers We can assume that Renting prices are similar. So that's why we included it in our optimization model but everything depends from the scale if I Change the scale then I can see completely different graph and if I Take wrong scale in this case too big There there is no Similarity at all and it starts from a high dissimilarity and it's then it's lower and lower So what does it mean that? If we get those results, we should move away from a spatial analysis because it Doesn't explain what's going on and to get this it's very easy because it's only few lines of code But you must know what you are doing. So you build experimental semi vario gram and You want to check the distance at which this similarity occurs between neighbors the next step if you if you have built this semi vario gram you can go further and Interpolate results and this is another case in Poland. You have only Few hundred stations that measure air quality and if you want to Present map like this on the web. It wouldn't be very pleasing for viewers So you can interpolate values. This is one task. The second is that if you Have limited number of requests from API You could you could use this interpolation to build a map for the whole region But only with few points from for the request. So it's also savings and One should ask Why don't we use inverse distance waiting it's built in in many packages also in cloud services and Personally, I use it a lot with Google Cloud, but for some cases, it's not good because You don't know how big power Should be there's only one parameter power you use it and you don't know how big it should be usually the power is set to two but For it is only true for small regions for the country region It's not true and you should change power locally and that's why you use kriging it changes power locally It's like local histograms and it checks how big this parameter should be there and Also, if you use Kriging this special interpolation technique you get Some kind of uncertainty map and you can check where you should take another samples to have better results and then my first encounter with this technique was many years ago and Today I heard about something about government. I can tell different story that Many years ago the first time when I met kriging I heard that some people just drawing lines that are Link in the point and that's that's it. That's the input interpolation for the public tenders and Nowadays, it's very easy to just use Python packages to do it And it it will be really much better and you can tell from where those lines they come and There are a few steps. You just load data set check experimental semi diagram if there is any Spatial structure spatial similarity, then you build a model You load canvas that The map that you want to interpolate and then you may check error distributions or do some cross-validation Or go with another model or link it to some machine learning model and This is the second use case As I said, you can interpolate missing points to reduce number of API requests and I use it for weather readings from so open weather APIs and small projects code is a little bit longer and What's very important here? It should be checked by person. It's not Automatic procedure. It is semi supervised. So the point where people should check it is When you create experimental semi-variogram, you must check if there is spatial correlation and The last use case. It's the most complex Is to transform public data sets that are aggregate over big areas here We have as I recall it's breast cancer rates in the United States and There are few problems with those maps this map on the left Because if you look at it you automatically Think that the most important areas are those that are the biggest and You should get rid of this visual bias And it because it can affect public spending, you know, it's very important to get rid of it and the other thing is that Not every point not every area here is Is populated so you see it on the map on the right There are some white points and that means that there are no people that could be sick that could get this cancer And here is another case of transformations. We have some also data from public health about Lima or the Aussie's infections in Poland and it was aggravated over counties. It wasn't possible to get Exact points where people live. It's for the people's privacy privacy. It's most important concern here So but we wanted to know Which areas are especially riskful for people and We can build first model to Assess ticks or currents and we've used satellite imagery and one pixel was One kilometer per kilometer and one the smallest county in Poland Has few kilometers, but the biggest one has more than Few it those in kilometers. So it it doesn't fit you couldn't just put Counties into satellites pixels and built model on it. So that's why it needs a transformation and It's more complex, but it's fully automated in the packages that you can use What are the reason to a building like that a Because we've assumed that this functionality will be used mostly by social scientists and They don't check this experimental semi-variogram They are not interested in things that are going on under the hood if someone knows the topic better then, okay It could be checked but for normal people that from social sciences It could be very very difficult and As before we built semi-variograms, but it is a long operation It Takes a lot of time to do this. It's divided into two parts because it's a iterative process and I Think that the package needs help here with someone who is good with parallel processing Because it could be speed up many many times With those skills, but What is done here? We take a special similarity of area of centroids of specific areas where we have aggregated values and then we check Population how population is divided over the country and check semi-variogram of this population and After some mathematical Computations we try to fit this population semi-variogram to the semi-variogram of those are all aggregates and then we have something completely different and we can build Specific we can go deeper up to the specific points where population is grouped and Here is Crime rates derivation. I've cleaned this only to show the most volatile places in Poland But you get the many many of those points. It only depends on how a big area of population units and A code is much longer if you want to do it by hand Okay, so For me, that's all I Shard this presentation with you on this course and if you have any questions Please feel free