 Hello everyone, so Yeah, my name is Neil Vite and I'm a scientific software developer at the European's belation source in Denmark and Sweden. I do Python for scientific data analysis and visualization And I'm gonna talk to you today about our project. We Pronounce it skip. You can come and ask me why after the talk And it's gonna be about multi-dimensional arrays with labeled dimensions and physical units and Just a shout-out to my awesome team Simone and Lucas and so new I'm aware that Some people in the audience can't really see the like the bottom part of the screen or so I'm gonna try and keep what I Do towards the top, but just let me know if you can't see Okay And I can also make it be So I'm basically gonna do this is like a demo and a Jupyter notebook I just have a bunch of imports and I'm defining myself a few useful plotting functions, but that's that's for later. Okay So the label dimensions, but why do we need them? So say I have a rectangular array numpy array which has a shape 10 by 20 and it might look something like this and I would like to slice out the row number four I look at the shape of my array and I know That this is the one that has only ten elements. So I have to slice out the first index Which is fine. It gives me what I want However, you can't always deduce from the shape Say now I have something that's square. It looks like this Now do I remember which one it was? Was it the first index or was it the second index? And obviously, you know, you're gonna get very different answers if you get it wrong It gets even worse when you have more dimensions, right now say I have four dimensions XYZ time in that order, maybe I Want to get the first Z slice Which one is it do you remember is it colon colon zero or is it zero so Hands up who has never struggled with this while using numpy get the son of If you put your hand up, I would say you were lying, but So label dimensions So this this really cool project called x or a if you haven't heard of them Go and check it out they introduced label dimensions to multidimensional numpy arrays and from their documentation they say Real-world data sets are usually more than just raw numbers They have labels which encode information about how the array values map to locations in space and time, etc and What we have done at the skip project is we have embraced and to a large extent copied the x array mechanism and And how this works is that you create a so SC is for skip We create a skip array by giving it our The numpy way we had above but now we give it a list of dimension which is gonna some strings that are gonna label each dimension that we have and We sort of have some fancy HTML representations for Jupiter notebooks, but you can see that Every label for the dimensions and the size are here, and then we have the values and Now when I want to get the Z slice all I need to do is Give it the Z label and then the index So compared to colon colon colon and zero This is really nice easy, but I think most importantly and that's a point that is often forgotten It makes your code extremely readable if I go back to my code a Month two months later, and I look at this I can see. Oh, yeah, I was trying to slice the Z dimension Or if somebody else looks at your code, and I think that's really important Okay, and then This is also what x-ray and skip both have you can add coordinates So you can have coordinates on each of the dimensions of your array And they basically describe the extent of each axis or maybe how far Every data point is from its neighbors You have some some visual visual representations for this so say I have a two-dimensional array Maybe it's representing say the Air temperature Above a city so at different altitudes and as a function of a year. So that's your sort of your dense two-dimensional array And then in skip an x-ray coordinates are added in a structure called a data array So you feed it your your data variable, and then you give it a dictionary of coordinates that are saying the years are from 2015 to 2023 and the altitude is from zero to 8,000 meters. So effectively what you're doing is is this You're adding coordinates to your data and You can also look at the HTML representation So you have your original data that we had and then you have a list of coordinates altitude in here good So now I want to talk about what we've added on top of this in in the skip project and The first one is physical units So every data variable and coordinate in skip has physical units And if it was very important for us to have this embedded from the start There are other Python projects that do this this pint astro pi units This is just for the units There's a pint x-ray project to try and incorporate this in x-ray But we needed to have this baked in from the beginning and I'll just sort of give you an example Maybe I'll also plot this and I hope So When you look at the representation you can see that my x and my y-coordinate both have units of centimeters So it's think of it as maybe like a detector panel and I'm sort of imaging some some counts coming in my data has units of counts and We can just plot this and it's sort of automatically labels the axes And then now say I also have an integration time I know for how long I've counted when I was recording say 300 seconds so I divide my image by the integration time and now The unit is counts per second automatically. It just library does this for you We can do pretty much any combination of units and you also see the values Have changed So my image has been normalized So this is really useful if you're dealing with physics and you're you can't remember if your energy was per unit volume or something like that You actually can see by just looking at the unit of your variables However, there's an added bonus is that the units also provide protection Say now I have a background image like a dark frame Which I want to subtract from the signal image above But I forgot to first normalize it by integration time So I have my background which has units of counts my image above had units of Counts per second and now the library is telling me up. You can't do this So I first have to divide by my background My background integration time and then I can do the the subtraction here So if units are extremely useful in preventing In early prevention of difficult spots to bug and if you have a very long Python script You'll normally you arrive at the end and you don't really understand why the units don't match or something went wrong This will catch it really early so this they save hours and I mean hours of debugging time and they also I Think it's also very important They free up a lot of mental capacity for the user They don't really have to think remember if did I divide by area or volume or something like that Just letting you focus on the important thing which is doing the science that you want to do Just as a side note. We can also use units for What you call label-based indexing if you know x or a so say I want the slice at Zero point five centimeters and I don't know the index the number of the index But I can just say slice X at zero point five centimeters, and they'll just find the correct slice That's also nice. This is something You can do with x array, but This is like a really nice way to do it with the units Okay, the second thing I want to that we have in addition is what we call vintage coordinates It's some sign necessary to have coordinates that represent a range for each data value say The temperature was 310 Kelvin between 10 and 20 seconds It's not given point in time. You have a range of when that data Value was valued and it's also This is what you have every time you histogram data So just like in my image above when we did some histogramming of counts it was the counts The counts are this much between zero point one and zero point two centimeters or something like that and Skip supports this by having vintage coordinates Which is a coordinates which has a length of one more than the dimension of your data So my little representation here. I sort of have an eight by eight image and my coordinates has length of nine on each side and you can See in the representation that these are usually marked by been it's so that they sort of you can see in the representation then you have been edge coordinates and Yeah, this is like the image I had above but I've Been didn't know I've histogrammed it into eight by eight bins You've probably used histogramming with numpy or matplotlib and they will return you the edges and the data separately like in a tuple We have everything inside a single data structure now this addition has actually allowed us to Create something which is I think one of the most powerful features of skip and This is the third part of my talk and We call this bin data. So I can So skip distinguishes between histogram data and bin data histogram data is the regular dense arrays when you've basically Collected all your counts and then you've done the sum. So you have a value of three between zero and one Seven between one and two and so on Bin data refers to the precursor of histogram data It's basically that you have a list of bins and each one of them Contains a list of records and you can of course convert from one to the other By summing all the data inside each bins, but there is a loss of information here and You can actually do some cool things if you sort of keep this structure so if you If you know a little something about awkward array, it's basically conceptually similar to a multi-dimensional awkward array and It's a best illustrate this I'll do a little example of data analysis and For that I'm gonna use something called the New York yellow taxi datasets If you haven't heard of this it's quite a famous data set in data analysis and It basically is a really long table of Data on New York taxi trips You have a pickup daytime drop-off daytime. How many passengers are the distance? Pick up latitude longitude. So that this is for example an image made of the Histogram of pickup latitude and longitude and you see sort of Manhattan and here you have the JFK airport So I've got my data set from the the VEX documentation, which is if you want quite a nice project also for Data analysis Go and check it out as well So I'm only gonna load a subset of this if not my laptop is gonna cry or scream at me but basically I have Loaded The latitude and longitude of drop-offs so where people were dropped off by the taxi The trip distance the hour of the day and how much they paid for the trip and I have 71 million rows in my table It's about 3.2 gigabytes In my memory So if we have a quick look at this data I'm plotting like one in a thousand points because 71 million scatter points in my plot lab is all still quite difficult But you can see you can see Manhattan and if you sort of zoom in here you see that you start seeing individual streets So there's a lot of data in here Okay, so now I'm gonna show you what you can do if you spin the data into records So working with bin data is actually most efficient when you keep the number of bins relatively low We can have a lot of bins, but you can it's basically most efficient when you keep the number of bins low and Binning is essentially like overlaying a grid of bin edges onto our data. So This is kind of what we're doing. We're keeping the underlying data, but we're overlaying a grid wrapper onto it and you can do this with any kind of data which is scattered or like for example those Talk yesterday, but about some cosmological Simulations that are using particles and you could you could apply this to to that like grouping your particles And then so the way I do I want to do this. It's very simple in skip. I've got my original data array I do da.bin and I say I want eight bins in latitude and bid eight bins in longitude and See takes about a second and now I have my Bint data structure. So I have eight bins in latitude and longitude and then my data is actually has kind of a weird type It's a data array view and what it's telling me. It's like a view onto my original array and then it has sort of different bins so the first bin has 65,000 records in it second one has 50,000 records and so on and And so if I naively just Histogram this You're gonna get a very pixelated image eight by eight of Manhattan. It's just not very useful But Because it only groups the data into it's actually just reorders the data. You don't lose any information It's simply real reordered. So then the bins can use be used for very efficient slicing or filtering So for example, I want to select a bin in Manhattan. So I take the first one in longitude and the fourth and So the first one and then the fourth I'm gonna be sort of up here Which is probably this one So you just change the slicing for the slicing we did before with the z dimension like this longitude the first one latitude fourth and now I have something like this Where I have 770 megabytes out of 3.2 gigabytes and I have about 17 million records in that bin and Now I have this because I haven't lost any information I can really histogram it at the much higher resolution and now you can sort of see that you have all the data in So it's really really useful for working on sort of subsets of your data I'm gonna select another bin which contains the JFK airport and you see like It's kind of hotspots here And if you look at the map of it, you can see this the different terminals at the airport I'm not sure why people are being dropped off on the highway, but you know Yeah, I'm guessing that's probably inaccuracies in GPS positions like it's just recorded by the taxis and Yeah Yeah Okay now I've sort of selected a single bin, but once I've done this what you can do after that is You can then bin this into a new dimension So let's go back to my Manhattan bin. I have a single bin which Has 17 million records, but if I look inside it I can see that I still have all the information on fair amount and trip distance latitude longitude and all this and so if I want to look at the trip distances inside the Manhattan and JFK bins I've selected above I take this this bin that I've sliced out and I make a hundred trip distance bins and Now I have a hundred a Dimension of length 100 in trip distance and I can plot this and I can see that Most of the trips in Manhattan As you usually short distance trips like less than five or ten miles And if you do the same with the JFK, you can see that people who go to the airport usually they do a longer trip so I'm not saying this is the only way to do it. You can do this with pandas. You can do this with x-ray but the ways I've found to do the pandas actually usually Not as simple the syntax. I think we have this actually really nice and they also tends to consume more memory especially if you've then been a second time into something so we have Reordering the data and makes it really efficient and then You can also do other things with bins like it's a little bit like if you use something like group by you Don't always have to just sum some of the things you have in your bins You can also do other reductions like min and max or mean So I have a little questions. I would like to know what is the fair amount as a function of distance So I'm going to go back to this Data I had from Manhattan which has a hundred bins in trip distance and Once again, if I look inside it, I know that I still have all the information on the fair amount or the right the hour of the day so To get the minimum and maximum fares for all trips that That are inside our Manhattan area We can do so this is my Data array here You do dot bins dot coordinates and then the min and the max and this will give you the min and the max of The fair amounts that you have for all the trips in my side that bin and the first thing you see is that the minimum is minus two hundred and forty two dollars which Bit weird and the maximum is seven thousand dollars, which seems a bit excessive and So these values are maybe a bit strange maybe indicative of bad data in the table So I'm going to restrict the range from zero to two hundred dollars So you don't only have to specify bins with a number of bins just like in NumPy, you can just directly specify the bin edges that you want. So I'm doing a lint space between zero and two hundred and So because this had one dimension Here and now I'm making a new dimension With a hundred points and I'll get something that's two dimensional and now you get something that looks like this So I have the fair amount on the y-axis as a function of trip distance and There's a few things we can say about the data So first one is that you have this sort of diagonal line Which you kind of expect like the further you're going to go the more you're going to pay Makes sense The other thing you see is people mostly pay above the line, but not really below it Which yeah apart from maybe here at the bottom. Some people seem quite good at negotiating and in the the last one is You have this sort of magical number of fifty two dollars, which will take you anywhere from zero to sixty miles Which is kind of interesting So is it bad data? Maybe there's a default value That gets if it doesn't get overridden. It's always fifty two dollars. I don't know Well, actually I Think I do know because and the last few minutes that I have I want to talk about what We have stuff that we build around skip. So that was like mostly the core features of skip But we think we've We've developed this thing this library called plop, which is what we use for all the visualization we do in skip The name sounds a little Funny or something first time I was working on the logo and my wife looked at it She was like you made something called plop. Is that what they're paying you for? But everybody's sort of laugh, but everybody remembered the name so we sort of stuck with it It's supposed to stand for plotting plus plus, you know, but anyway So anyway, we've got a lot of tools, but I just want to show you quickly one of them Which I think is quite so now I have I'm gonna go back to my original data and histogram it in three dimensions So I have latitude longitude and the the fair amount. So I have a three-dimensional cube And then I have this thing that we call the inspector plot Which Maybe I need to So on the left is my the map so it's latitude longitude is my two of my dimensions and then On the right what this is gonna be useful I've got this little tool here and I can add these these dots and this is sort of probing the third dimension So it's giving you the profile and you can you know, you can move these dots around and they will update So if I put one down here and then Last thing I want to do is go back to my airport and add another dot here And now all of a sudden you see that you've got this spike of 52 dollars so I think What it is is that it's all the airport shuttles. They've got sort of a fixed fare Which is pretty sort of arranged and then they'll just take you anywhere Yeah That was about it Thank you for listening. There's a few links For you for you here go and check it out. I also would like to say that we are hiring we have a Permanent position as a software engineer Developing some tools for science So if you're interested come and talk to me, thank you very much Have a question about the units how small bow days does a predefined list of units you accept and can they Convote each other like if you have grams then you can say I want them in kilograms of tons or those kind of things yes So there's a long list of units. It's so Because we so skip is written it has a c++ call and then it has python bindings on top And so the newness library is a runtime c++ library and it has a very long list of lots of different units and you can definitely do things like Just like you can convert from meters to kilometers with Mm-hmm Something like that. I'll just tell you 200 centimeters. You know the convert to feed Sorry, you can also convert to feed or miles or something So not only as I book and also imperial Yeah, yeah, okay Hi, my name is Mark Thank you for a great and amazing talk my questions about the plot actually because I can see what you were love making things Interactive you need to visualize big amounts of data and in the x-ray system There's always the integration to hollow views and each report and so on have you couldn't sit at building on that instead? Because they all have all the tools already for what you've been showing and so on We have we took a deep look at all of the Well, not all probably because there are a lot but many different visualization packages and I Think the issue we had with hollow views is that your data needs to be either a pandas data frame or an x-ray data array So we would probably have to convert to that Let's talk about it because what the system does is really they implement baggings So you could implement like a skip bag end and then everything would just work Yeah, thank you. Sounds good. Yeah, thanks. Thanks for your talk. I really liked the feature where you could slice your Your data in units and can you control the interpolation when you do that in skip? so the way it works is that If you have so it doesn't do any interpolation that's something you could probably add on top So it does basically If if you have the a vintage coordinate And you give it a value that's inside a given bin Then it will just return you that bin where that value you give it is If you don't have bin edges So if your coordinates are actually marking exact points, then you have to have an exact match when you request Data with a unit. If not, it will tell you I can't find anything Yeah, I'll make a request on that. Yeah Thanks for a nice talk My question is maybe a philosophical one What what were the reasons why you decided to create your own library and not to For example, extend the x-ray Mount the pint on that and somehow do better combination of these so We looked at x-ray for a long time and we got a lot of our inspiration from it before we started building and and we have two reasons the first one is that The first one is historical is that to start with we thought that skip was gonna need to interact with a lot of other C++ code at our facility So we needed to have a C++ core and then we sort of added Python bindings On top this may not be so true anymore But it was true when we started the project in 2019 or 2018 and then the second one is that we considered adding contributing to x-ray, but we thought that If we wanted to add something as fundamental as either units or bin edges It was gonna take a really long time to get it right and to get it adopted in x-ray And we actually needed things to move quite fast. So those are the two reasons We're not trying to replace x-ray. We just yeah needed something at all at our facility. Yeah Thanks a lot I Very nice talk and awesome Results, I think the saving here needs is very important for sharing data also in the scientific community So but first question is where did do you store the units? Is it like the attribute or the column or is it like all the data structure in the file? It's in the C++ data structure. So inside the variable So you sort of have This We have sort of different data structure So you have the variable which is sort of the lowest level thing and this is like in the C++ And it's stored in there next to the the buffer. We have the dimensions the unit and we can also store Verences like uncertainties alongside the values as well. That's a second question regarding the binning Can you use some custom grids instead of this regular? Grids like so I go now for example. It does not need to be All of the same size so your grids can have different size any size you want But they do have to be rectangular at the moment Thanks Thank you so much. It was such a delight to listen such an interesting topic and now we're in the lunch break