 Thank you, and thank you for the for this reduction Today I'm going to talk about very recent project which we called x-wek. I will soon understand why and It's mostly talk about very weird data structure for geospatial data and a way how we implemented it with Python on top of x-ray And the data structure is called vector data cubes and we see it as something that lives in between Geospatial roster data, and I will show an example how that looks like and geospatial vector data And I will also show what do I mean by that? So what are spatial data? I may try to zoom in With vector data, we are essentially dealing with Points with lines and with areas. So you have your specific coordinates specific vertices each Let's say latitude longitude Encoded and point is easy. It's just one spot on the ground line is something like Street could be a flight path an area that would be for example an area of a country or some landmarks and Those are very precisely defined as vector data and each of them is kind of uniquely Defined independent of every other geometry in your data structure The other side of the geospatial world are raster data And that's essentially an umpire array with values. So the point would be just a single Just a single cell highlighted with the value line would be something like this so you represent line as Contiguous series of of your cells and the area is extension of that In Python Skip this raster data are very often Handled by library called X array which is built on top of an umpire mostly And with one example from X array tutorial I can show you how that looks like So what you get when you load a data set because that's one of the core structures second one is data array So they does it's like a bigger bigger part We do have Dimensions so we have longitude with latitude We have some level of height in this case of measurement and we have some time dimension here and for every single of them we have an Array this is geopotential you have some information about how it's measured and it's an umpire array with plenty of values So this is how the data structure looks like usually under the hood it's essentially a set of numpy and the arrays Aligned with pandas indices so you have your numpy array you are aware of you know You have your pandas index you also know and this kind of very smartly combines it together And this is how one of the months and one of the levels would look like on this specific data set So we see that we have a data covering the whole globe you can slightly Sense the the continents in here Yeah, so this is geopotential I don't know what that exactly means that's far from my expertise. I'm talking about the data structures here Not exactly about the measurements Vector data are very different We use Geopandas library we import Geopandas SGPD usually it's an extension of pandas and What you get when you read vector data set as your traditional pandas data frame with special column with Geometrics so in this case we have polygons here And We can look how that looks on the map. So in here we have counties of of United States Each polygon Represes represented as a one row and it has all its columns. So you have always data linked to different polygons However, you're limited to pen pandas data structure in here So in case you need to do some multi-dimensional data sets that are actually linked to Geometries and not to a raster cells is normally we do with With it X array you're in trouble because there's no Smart way of doing that until expect came and until some changes next ray actually happened So typical data cube is a raster data cube. That's X array data array Which is usually in geospatial word indexed by X and Y or latitudinal and logitude On contrary vector data cube is an ND array where at least one dimension is indexed by an array of vector geometries So instead of coordinates We have an array of vector geometries as an index Which is again slightly different from what Geopandas is doing where index is traditional pandas index and Geometries are just one column in here We can switch it and use geometries as an index But we are currently at the state of expect 0.1 So it all starts this proof of a concept how we whether we actually can do it with current Situation on X array and shapely which are the two libraries that are kind of combined together in here And we were essentially trying to mimic what an R package Stars in our spatial world is already doing So let me let me walk you through the data structure itself. So the simplest data array Which can be considered a vector data cube vector data array It would be just an array of values indexed by an array of geometries as simple as that So we have loaded already our counties that was the Geopandas geodata frame and US counties and we had all those columns and we can create a very simple Data array where we take one of the columns for the population per counter from the year 1960 And we can use the geometry column as coordinates In this case that I look like this we have our values in here we have our counties in here as polygons Right now the index is Pandas index and it doesn't give us any special support However, when we import X WEC we get access to an X-ray accessor and we are able to use it to use that count a index and Converted to a geometry index in the same set geometry indexes and we also specify the CRS Which is coordinate reference system that tells us what actual numbers in those coordinates of those individual points on? Polygons actually mean on ground because that's not always that it would longer to it. It can be some local local local projection system and When we look at index we see that the county is no longer a pandas index But it's a geometry index with a specific projection in this case. It's actually is a lot into the longitude but such a simple Data array doesn't really make sense to use you can easily use geopandas geodata frame For the same thing because you are limited to dimensions and it would be much easier much more comfortable for you So a bit more meaning for example would be including time dimension in here, so we still use our county's example and We take population for four different periods, so we want to Index the data array based on geometry as before but also by time So we can see that we have two dimensions in here and we already said that the geometry index directly in here So the the data structure now looks like this We have 2d an umpire array holding all the values and That umpire array is indexed by county and by year where county are our geometries and here Our integer values to present different years This is still fairly simple and it gets more interesting when we get to more dimensions in a data set So we can we can go directly to the x-ray data set from data array Making noise here, sorry Which means that we will include other variables so right now we had only population so it was one umpire array But we can also include other information we had in our original Data frame so we can work with population do the same thing as before but also unemployment data divorce data age data And we have a data set Indexed by two dimensions, but that itself has four different variables inside so This is sort of how it looks like and If you look at indexes one of them is pandas index, which is the x-ray default and the second one is geometry index You can skip this So why why are we doing this? What what is what is the purpose of indexing by something like geometry? It allows us to work with the data Using special operations using specially enable indexing and selecting and using More more advanced special operations directly on the index of x-ray x-ray data set So we can take another example We will use yellow taxi trip data as a data cube. So we will have in here data set representing individual trips in New York for I think one month or for 40 days some arbitrary arbitrary Time series and we have every for every Trip we know payment type. We know the date. We know the hour We know origin and destination where origin and destinations are Polygons so You know that our origins and destination looks like this with individual zones and in New York and Each trip starts at one of these and ends at one of these So that's a lot of dimensions to work with And it would be very inefficient to create a data frame like like this because pandas multi index can get you on less as far as it can and It's useful to create multi-dimensional array like like x-ray, but if you are starting and ending at geometries normally would have to use Let's say integra index for a geometry and store geometry somewhere else and keep those two in sync Which can be troublesome So easy Example of indexing by geometries using geometry as a label. So We specify that we want only Two destinations Using the specific two geometries. This is exactly the same case as you would use with normal x-ray Now we see that we have only two destinations. So our damage knowledge is kind of reduced But interesting bits are are are starting right now. We can start doing the nearest join You know that you may know that x-ray can do nearest for example based on time but with x-wek and and data cubes Indexed by geometry. You can do the nearest selection According to space. So it's a nearest point in space So we have two points for origin two points by destination some subset of of date and time and You're interesting in nearest locations and This is our final the final final data set. It was very simple to get the nearest nearest join like this We can also extend the the space queries from nearest to something else If you use color geometry like a single single geometry, we can find intersections so we can define a box like a rectangle on top of New York of part of it and we do what is intersecting it and we see that since we were looking at origin only We have all trips that are starting Within one of the original destination origin locations Intersecting our specific location of interest could be a neighborhood. That could be something something you may be interested in from a business perspective like vicinity of a shop you are you are trying to analyze and This is how the subset would actually look like so you see that We created box which was something like this around in here You can also do that in an array of geometries not only Geometrics like like like that like the polygon and You can also use abilities of special indexing in in shape play because shape play is a library that is Taking care of of geometries. We are using in here and we are using all its capabilities. We can Use it to do all these special space and this is which means that we can also specify that we want only Origins that are within our geometries, so we would get on the two of these based on those geometries before You can also specify whether we want to get all hits because in some cases you may get the same Origin location as an answer for the predicate For the check whether something is within something else for the first one as well for the second one In some cases it's important in other cases. You may want to filter that so you can do the direct live specifying unique key work in here Or you can also Specify everything which is within certain distance around around points Let's say five thousand. I think this is feed because I think that the data set is actually in feed Again, we have some certain subsets of the data set but what's interesting and important is that we always no matter which Operation we do we always keep the full Dimensionality and we always keep all the indices that the original data structure had We are not reducing anything on the data structure side How do you mention that when you are creating the Index you are assigning projection. This is probably way too technical, but we are Borrowing what Geopandos is using for full projection management and we are applying exactly the same thing on On x-wag so you can see that there is specific projection, this is the code which end codes how the values should be Interpreted in the end and this is the detail of the actual projection of the data set So we see that it's actually indeed in feed I can skip this You can skip this as well and This as well The issue with vector data cubes is that it's not necessarily standardized Data structure, which means that it's currently pretty hard to serialize it to save it to a disk There is some work ongoing Across different languages. We are trying to find a standard that is able to do this Without limiting how many dimensions can be actually indexed But you will probably find yourself in a need to convert the Data cube once you analyze it to Anyone save it to something like Gip and Agilita frame again So if you get another data set as an example, it is our traffic counts from Chicago So we again have origin destination. We have trips between these these polygons With information on the mode of transport with information on time date and time So we have a couple dimensions in here We're starting and ending and these Neighborhoods, I think I think this is officially called communities for some reason So expect has two different ways of Converting stuff to Gip and us by making what X array is doing with pandas directly So if you select a subset as a slice of the data array, we can convert it to geodata frame Which in this case gives us the origin destination data and the subset of oxygen before we are interested in If you try to do the same way to geopandas instead of to geodata frame it currently raises and value error because Following the X array logic, we would have to index the array on both sides Both code index and columns using geometry, which is not supported on the other side and will probably never be So we need to get some smarter subsets of this I'm gonna skip this and talk about how Can we actually represent multi-dimensional Q within two dimensions because we are two different options One is something which is called long form and one is something which is called a white form So if you use the demographic data of US counties views before It's something like this The long form means that in the end you get very very long data frame And you will see certain a lot less of repetition, so we have Year and for every year we have all our radar per individual county and the long form data frame gives you a data frame index by years But you will see that this polygon this polygon this polygon and this polygon are actually equal. It's the same polygon. There's duplication and We cannot really avoid it when we want to do the long form Because if you have more than one dimension in in our way that they take you Something needs to repeat And in our case years are repeating again and again and again for every county within This is counties are repeating. There is a lot of repetition. We don't have any repetition on the value side But we do have a lot of repetition on what used to be an index white form slightly eliminates this Because it ensures that we have every polygon only once in the data frame But you will see that the columns are now indexed as pandas multi index and you have repetition in here and this Data frame can become really really really wide and if you have More than let's say to maybe three dimensions It can get really really messy and you really don't want to work with something like that So yeah, there are still limitations in Kind of converting from one to the other especially if you want to save stuff and read it Let's say with different language or on different location afterwards Yes, they're just mind disclaimer that Even though we are trying to mimic what X array is currently doing as much as we can there are some differences because Geometry is just start limiting what we can do in certain certain cases The final example I will show is Probably the most useful currently and that's a way of extracting point data from your special roster Into a vector data cube that preserves complete multi-dimensionality of the original roster so we had that example with the world data of some Geo potential as I showed before I think it was your potential Yep So we have you have a map of the world which looks something like this We are interested in values for specific locations But what I'm showing in heroes is a slice for only one month and only one level But I would like to know every everything I have in this in this data array for that specific location. Let's say Prague So we can use the extract points method Where we specify shapely points shape leg and this is dealing with our geometries So if our point locations, I have created just random points not random Yeah, and agree the points spawning across the world and we can just easily extract those specifying that for X color coordinates of our points the The linked dimension in the original DS data set is longitude and for why it's latitude and in the end we get the extracted Dataset which is still data set exactly as the original one was But right now it's indexed by geometry on top of whatever it was indexed before and it contains on the subset based on the the join You can skip this And get to the map like this because in here we see what exactly is the result so you are able to Get the information for every point and we see that for every point we have There were the bad values you may be interested in and even here I'm on only the slicing only So this is a typical use case where you would use vector a technique in real life You have your data as Multidimensional raster array and you want to get it to points you want to get it to For countries you want to get it based on cities personal personal continents, whatever you are really interested in Kind of aggregation method You can use shape it directly for geometric operations, so if you want to For example understand the area of every of the Polygons, you can directly call shape play on this because Shape plays working on top of numpy arrays x array is supporting every numpy You think so this is this is very straightforward to work with We depend on fairly recent versions of both x array and shapely because All the versions just didn't support custom indexes on x array and all the versions of shape play are not Cannot be used as an index because they are not hashable So this is all depending on very very recent versions of stuff And if you are interested in more go to the documentation of read the docs Most of the code I was showing is actually taken directly from the documentation So you will see everything in there with a bit more explanation that I gave you today That's me. Thank you Thank you very much Martin. Do we have any questions? We have a microphone there in case someone wants to ask something? Don't be shy No questions so far And the other one. Yes Hello, and how is it with data larger than memory? for example when you have millions of Polygons and you want to load onto to memory those that are actually needed for your special operation Do you support chunk data or could it be supported so that? Polygons the close together and special dimensions are also close together on disk At the moment it is not supported It's actually Really complicated problem to do these kind of things within your special world We are trying to implement Out-of-core just basic processing using dusk jpandas on this is kind of a sight of this and It's just not straightforward because you want to make sure that everything is kind of nicely or organized in space and on Memory and on this and that's usually very very painful experience. So right now we do not support this I assume that If you have a dusk array within your x array data array So if the out-of-core is not the index not the geometries, but the actual values you should be able to work with it. I Haven't tested that but I don't see a reason why this shouldn't work Equally you should be able to work with dusk arrays backed by sparse arrays Yeah, so because The Sparse part is data not the index so Yeah, to a degree you may find way around it, but like native support that everything you would expect with Tradition likes array backed by dusk. It's not necessarily supported right now It's on the it's on the roadmap Any other question? If not this thing smartening in And remember now we have a coffee break