 Alright, so I'm going to talk to you today about big and labeled multidimensional arrays. And the context, the sort of data that I work with and sort of the problems are, so I work in industry at a company called the Climate Corporation. We help farmers make use of weather information to make better decisions. So I work with a lot of different types of scientific data and different sort of, and a lot of scientific data looks like, so it's data where you have, where you see like physical simulations or a bunch of measurements, and it's complex data, there's a lot going on the tools we have for working with that weren't that great. So that's kind of the impetus for the work that I've done. So let me be a little more specific about the sort of data that I work with on a day-to-day basis and maybe we'll see some similarities with sort of data that you're interested in or the, I think there's, it's a little different from what people usually work with in sort of the data science industry. So we have, and I think the, I guess the first big difference is that this data is that we work with data that's, it's like primarily it's numeric data, it's, it has this multi-dimensional tensor structure to it. So for example, I'll use an example that I work with, I work with a lot of weather forecasts and weather forecasts are made on a grid and they, they come out that you might say, predict say surface air temperature over North America. And we make the weather forecasts, we'll make predictions for say today's weather and tomorrow's weather and the day after that, so in time and we'll, we'll even work with, we'll add extra dimensions for things like keeping track of, I make a weather forecast today and then I run the model again tomorrow and make a new weather forecast, so every day I have a new weather forecast and then the weather forecasts aren't just, I don't just make one weather forecast, I actually run my weather forecast model maybe 10 or 50 times to get a range of the possibilities for what the weather could be. So I, so for the weather next week in Berkeley, well it's probably gonna be sunny, but there might be, it's unlikely that you, you can't say with complete certainty what the, the weather forecasts are not very good, say anything with complete certainty when you look at next week or next month. So we look at ensembles as well, different scenarios. So we have, say like big, say five-dimensional tensors of data that are, and so the, that's kind of awkward to work with just the, the five-dimensional tensor, but the good news is there's a lot of labels and a lot of metadata associated with, with, with these data sets. So we have things, we have, we know things like, we have like names associated, like, like we know that one of my axes from my array is gonna be latitude or longitude or time. Each, each point is, is also identified by the coordinate labels as well. So like we have, we have numbers associated with every, with, we have a, we have a date associated with every, every point along the time axis, we have a, let's say like a floating point number for every latitude and longitude. So there's a lot of metadata that can help us work with these tools. And there's also, I guess lastly, the, the amount of data we work with ranges from sort of small data sets that they can pretty easily fit in memory to gigabytes or terabytes of data that requires sort of new, new tools to work with that. So this is kind of the motivation and I, I would claim that this is actually not that different from a lot of scientific data sets. There's a lot of, a lot, a lot of the data that we work with in the physical sciences is, comes on, comes on grids and across sort of all different sorts of domains. From, from neuroscience, sort of client modeling to say, physics or experimental data. There's lots of different, very different types of sort of, I think a lot of the, a lot of the stuff we learn about the, learn about how the world works. We collect it and or we simulate it and we do, we do that, we do that on, we do that on grids and we do that, we do that with arrays. It's part of why, part of why in, in scientific computing we, we work so much, we use, we use multi-dimensional arrays. And actually a lot of the attributes that features which are sort of tricky with, with our data sets are also, I think, common with, with these assets as well. I think every, every time someone writes that, almost every time someone's writing data down in a grid, there are actually a point in that grid has a lot of, there's a lot of, there's a lot of metadata associated with it. There's a, and there's sort of, we have, there's a lot of implicit information in terms of how, how, how we, how we set up, how we set up our problem. That, that would be, that our tools really should be able to make use of. So, so the fraying problem for me is thinking a little more generally, I don't just want to make a tool that works well for, say, for weather forecasting. I mean, I could write a library that does, say, crunches specifically weather forecast data. That's not something that, I think that there's a, that there's a common thread here, and there's a need for, for tools and kind of, like, span disciplines. Sure, if, absolutely, if you're looking quick, that's great, yeah. Yes, sure, so the regular, the, the, the, the regular spaced aspect is, the, the grids is not necessary. I think that the aspect that you, that you represent, if you're representing it as some sort of a, some sort of a grid, whether it's regularly, or irregularly spaced, something that, that I could write down in terms of like a, as like a matrix of numbers, or like a multidimensional tensor of numbers with some, some set of consistent indices. It's true, so yeah, so I, I can't, so it's, I think things get really hard if you try to talk about, say, like, if I have some, some, say, a bunch of like point measurements, for example. And there are a lot of data sets that are like that. Sometimes, and like, there are, there are tools we can build for working with that, but that's, that has not sort of been the focus here. I, I think those are often interesting problems. I think that oftentimes, I guess I would claim that, that sort of the unstructured problems have to some extent been solved, are more like what, what we see in terms of, say, the tools that are, that are used in, in sort of, I guess, sort of traditional, like data science tools for, say, with databases, or sort of the tools that have been built for processing sort of big data in sort of the commercial world, is that they deal with very unstructured data. And that, so I think the, the structured aspect, I think is one, one of the, one of the big differences between, say, the data of sort of physical data science versus, like, I don't know, say, click data, or advertising, sort of things like that. So the, the work I've done here has been, has been in Python. And it's, so the, it's sort of trying to fit in to this, to the, the ecosystem of different libraries that people use in, for doing scientific programming in Python. So the, in my view, one of the really awesome things about, about Python is that there's not one way to do it. There is this vibrant ecosystem of a bunch of different tools. Maybe, I mean, this is kind of a strength and a weakness in some extent. We don't necessarily know what the right tool is for, for every job. But there are, there, there's some sort of core stack of tools that a lot of, that people from many different disciplines kind of come together to work on. And, and this is, these are sort of some of the core tools. And there, there are others as well. And they're sort of, this list is growing and shrinking. But, so I'm going to suggest a couple of tools that I've worked on, which are, might make sense to, to, you know, possible additions to sort of add to this sort of sci-fi stack. And they, and each one of these tools does sort of, there's one of these things I've been talking about. So xray is a library that, a library that, that I'm the main author of adding, we're doing sort of labeled multi-dimensional arrays. And this other library is, that I want to talk about today is called Dask. Actually, Matthew Rockland, who's right here today in the audience, is the main author of Dask. But I've also made some contributions there. And that's a tool for working with, for working with larger data sets that don't fit into memory. So I think that these, that these tools can possibly make a, could be valuable additions sort of, through this core stack of tools we have for, for doing scientific computing in Python. Okay, so let me get started and start talking about, more concretely, about labeled arrays. Okay, I'm going to start by talking about, about data frames. And so this, so data frames are real, so data frames. So this library xray, its genesis comes from thinking about, from seeing how people work with tools like data frames in a Python library called pandas for, for working with data frames. So data frame is really just, it's a, it's sort of a mashup between, somewhere halfway kind of between arrays and, and, and a relational database. It's a way of, it's, it's a tabular data structure. So things like you would see in a spreadsheet, but where, but with a, with an API that's very usable for, for doing data analysis and for crunching numbers. So, and they work really well for when I have, I think the data frames work really well for say like human data or data that's not, that's not at all structured. So data where it's a bunch of records that, that, that fills out and I, I can say I have some number of variables and a whole bunch of measurements and there's not like some really obvious structure to it. Data frames are a really great, great tool for that. And there's, there is kind of a, a right way and a wrong way to do data frames. There's a notion of, of, of tidy data. And this is, if you're, this is actually, this comes from, from Hadley Wickham, who's one of, who's the, they contributed to the, the R world and like R data frames. And the right way to do data frames, sort of to make things sort of easy to analyze is, so you follow a couple of, a couple of simple rules. The first rule is that we, we, is that we make sure that each of the columns is, is, is a variable in our data set. So something that, that changes and then each observation is a row. And if we follow these two simple rules, this is, this is, this is also sort of one of, sort of a standard way in which people will organize. There's a certain normal form for database as well. I'm forgetting the name of it, but it's sort of a standard sort of concept in, I guess in the world of, of database, that we do this, we see, we can make data, we organize in a way that makes it really easy to use, sort of standard data analysis libraries. And we can, and like, we can make it a whole tool, tool suite for working with tidy data that works really well. So there's, if you have one concrete example, here I've organized some survey data about this is taken from the tidy data, Wickham's Hadley, tidy data paper. Here is one version of it. And here's another sort of reformed version of the same data. Now this version here might be easier to read, like or print out if I, if I, if I had to sort of like look at this on a sheet of paper, but for doing the data analysis tool, I want, I want something that looks like this long version where every single row is extended out. So this, this is not what you want. This is not tidy data. And if you see, kind of the clue is when you see things like, like, like numbers in the column names, that's like a clue that you're not working with, with sort of the well-organized data. So what, what is X-ray? And how, how does it build on these notions of data frames and tidy data? Well, we, so X-ray is kind of, it's, it's like a generalization of data frames except where each of the columns can actually be some sort of multi-dimensional array. And that the unifying characteristic here is that, is that all these arrays, they might not all have the same dimensions. They might, for example, I have a pressure variable and I have a, an elevation variable and a latitude variable. Each of these has a different, a 3D array or a 2D array or a 1D array, but they all live in the same coordinate system. So because they exist on the same coordinate system, they share the same, they share the same coordinates, we can talk, we can think about doing operations that we would perform on the, on the shared, on the shared set of arrays that act on the coordinate system. And it's also kind of a form of, of tidy data as well, because we have, it's sort of, a bit of, I guess, I like to think of it as sort of a multi-dimensional tidy data, because again, it's now very obvious that each of the, let's say, each of the column names should be a single variable and that each observation, here the observations can have this, they can have this multi-dimensional structure, but, but it's very clear that should be logically distinct from the, from the column names as well. And so we have the shared coordinate system and we have, we have metadata here as well, and it's implicit in this coordinate system. This is the, the, the latitude access, or the, the longitude access, or the time access. So we have labels corresponding to these, the names of these dimensions, and we're also going to have labels corresponding to the points along, along the dimensions as well. So we're going to have both types of, both, both types of labels. And I'll show you an example of what happened if you, if you, you, you printed out this, this, this, you get a little, you get a preview and you see a list of, you see the names for each of these variables, you see the type and you see the first few values and you see the, the coordinates along which they exist. So we all share the same length along the time, latitude and longitude coordinates. These are all, each of these coordinates has some, has some size that is fixed across the entire data set and the entire data set has reference again across say the shared values of the time dimension or the, or the latitude dimension. So this is sort of, this is what the, this, this is what the, this basic data structure next where it looks like. So what can we do with this and why is this, why is this useful? And I think the first thing we can do with this is that we can start describing operations on our data set in terms of these labels instead of in terms of the arbitrary sort of numbers that we would use if we're using sort of traditional unlabeled arrays. So tools, tools like NumPy, tools like, tools like Matlab, tools like, I mean like sort of all these scientific computing libraries and languages. So if we're, if we want to select out the data for January 1 and we want to take say the maximum along the, the, of this data for January 1 say the, the warmest station let's say on the, on a particular day with, with X-ray we can describe that, describe that using, using the metadata. So we say I'm going to, I'm going to access something along the time dimension or along the station dimension. I'm going to pull out say I'm going to provide a date that's going to select out one particular or some, some set of values along that dimension. And I can write something that's going to sort of, that's going to, that I can write, I can write code that directly reflects my intent and that, so that directly maps my that doesn't need any explanation. So I can, I can write this and it's, it's almost entirely clear what this means even if you don't I mean you have to guess how cell is short for select but like it, the intent of this code is, it's sort of very obvious. The intent of the, of the version with just the, with just the numbers is much less clear and that's because I have, I have, I have, I've had some arbitrary numbers here like access to or like this you know like one, two, three zero through three. So you wouldn't really write code like this, hopefully. Maybe you would if you're doing some quick little one-off analysis but like, hopefully if you're doing it right you, you would this code would actually be expanded, it would be something oops, it would be you'd probably write like, you know, three lines with some comments or maybe you'll define some, some variable names maybe you will save some separate data structure that like keeps track of the, the keeps track of the maps from say the dimension names to the access numbers. You wouldn't, if you're, if you're writing something like this you, this, if you want to be able to sort of come back to it and know that it works properly to do this right it's, it's, it's a lot more work. So it's great to have tools that sort of directly provide some metadata. I think that's, that's the first thing, that's like the, the biggest thing about why I think X-ray is a useful tool. There are also some nice tricks we can do with this metadata that provides sort of additional incentives for using the metadata and that provide, give us additional things we can do with it as well. So one, one example is a vectorization. So for those of you who are, I think it's like most of you are familiar with, with, with saying a tool like, I can add together a scalar number and a vector and get back a vector. Or I can maybe add together a vector and a matrix and if I do it right I'll get back another matrix. So I can, I have operations where I, where I sort of can do an implicit loop based on, based on the dimensions, based on the, based on the arguments. So when we have dimension names we can actually do even better than that. We can match things up based on, based on, based on the names, not just their order. And so let me show you an example. Let's suppose I have two one-dimensional arrays. I have a timer, one array a long time and one array a long space. If I add these together, I don't want, I, I don't want to just, I don't want to like create some sort of hybrid space time dimension. That would, that would not be very useful. So I could maybe give, just give an error message. That would be one option. There's also, there's actually a useful thing you can do with that. And that is say like, well, when I add time and space, maybe I really want something that has that is sort of, that is the, that has both dimensions. So the full set of, that's going to be aligned along the full set of dimensions. So we can set up the broad, the, this, NumPy is called broadcasting. We can make broadcasting work with the name, with the name, with the names of dimensions. And that also, so that's useful for things like this. It's also useful for things like, if I happen to have my matrix and I have my matrix transposed, I can still add together my matrix and it's transposed and get back something that has sort of sensible dimensions. Yeah? Yes. So this is so, what would happen here, that's a great, what are the values in these arrays? So what would happen is that if I have, maybe I should probably annotate this example actually. If I have a1, a2, a3, and b1, b2, b3, the first, the first element here is a1 plus b1, the second element is a1 plus b2. It's the indices here, each if a cell in here corresponds one time in one space to find out what values go in that cell, I would need to refer back, I'd refer back to the corresponding elements of the original arrays, by, by their labels. So this, this sort of thing works, does that kind of answer your question? Okay, so, so here I'm, so here I'm literally, it depends on what the operation you're doing. That's a great question. Here, if I'm literally doing a plus operation they would be added together. If I was doing a times operation, they would be multiplied. I would get four in this case, exactly one three would give you four. Yeah, so I'm recycling the points. And one, one neat trick about how the tools like NumPy work, is that they can do this, so NumPy, the design of a tool like NumPy actually works is that you can actually do all this recycling basically for free. I only have to realize, I only have to I only have to create the 2D array. It's like this large once. I don't have to copy these arrays in memory. That's sort of a neat trick of the way of the way that the, like say the strided n dimensional array model works. But so this is actually a pretty cheap operation. It's, we can do this vectorization almost, almost for free in some sense. It's not any more expensive than it is. It's something we can do very efficiently. And the other the case where this comes in really handy is basically I don't have to worry about the especially I say when I'm working with if I'm working with something that has more than two dimensions. If I have, if I have three or four or five dimensions for my array there's a zero percent chance that I will correctly guess what the order of those dimensions are. It's just, it's like some mix of dimensions and I have to like think really carefully and look through my code to figure out how it works. I've written a lot of stuff where without dimension names where you have a mess of code where you're like okay let's try transposing that array maybe it'll add together. Let's try adding in a new axis. Let's try, you know, and that stuff is just you spend a lot, I mean it's not that much time but it's a source of annoyance and it's a source of you have a really hard time keeping track of a lot of noise in your code basically. So the labels make that better. Yep. So the order of the resulting dimensions is in order of appearance from the from the arguments from sort of left to right. So we look at, so there's an order for the dimensions in the first argument and then an order in the second argument and then here if it's time space then we'll do time space. If it was time, if I was if I was adding in another array that had some other dimension here like time, space and I'm not even sure what goes beyond time and space but so something let's say we're going into, you know, some fractal dimension or something. Anyway, that would be an extra one at the end. So that's vectorizing by dimension names. That's one trick we can do with labels. Another trick we can do when we do with labels, this is making a different aspect to the labels is that once we know what the dimensions are we also have labels along each dimension. So every cell has given some label and here I'm using letters but what might happen is that I might have two arrays and I might have, they might have different labels actually with them. So I might have you could have a case where I have one has the labels A through G, another has say a different size as labels D through I. When I when I get in a situation like this we don't want, we certainly don't want to try to the approach with a tool like we get to a couple things we do. One thing we could do is we could say if the labels don't exactly match up then we'll just give an error and we'll stop. That's a perfectly sensible thing to do. But another neat thing we can do is we can actually still make this work even in a useful way by saying that if the labels don't quite match up then what I want to do is I want a result that has the aligned labels. So I want to line up the arguments based on the labels and I want a result that is defined on the on say the intersection of the labels from the arguments. So that's something we can do with this is something we've borrowed from the PANAS library for working with the labels. This is another operation that actually we can do we can do pretty efficiently because PANAS can very efficiently calculate say the intersection of two sets using using hash tables. But not too much overhead we can make this alignment stuff work as well. So those are a couple of things we can do with this stuff. Maybe I'll mention a few other sorts of operations that sort of make sense or generalizing things from now taking things from the world of data frames and databases to sort of arrays. And I think the biggest operation one of the most common analysis tricks that people do is is a grouped operations. So things like I have somehow for example I have some data set that is defined on a big grid say it's over the United States and I have 50 states and now I want to calculate I want to say calculate say the average elevation by state for example. And these sort of operations also make sense on also make sense on multi-dimensional arrays and we can define things like this with x-ray. We can do things like we have the labels we can that means you can just plot something you don't have to you don't have to fill that in later so like that's really convenient from a data analysis perspective. And I think this actually has so I think this is pretty important and maybe it seems a little trivial but the metadata makes it easy to do to compute some of this stuff and like that's great that's a nice incentive for doing the data analysis but the other aspect is that we know metadata is important but unless it does something for us we don't it's hard to actually use it so I think the another really important point from the perspective of reproducibility and science is that we know that metadata is good and that our analysis should sort of preserve the metadata use the metadata sort of check that it's all there but unless you're being super careful about it unless you have the right data structures it's such a pain to actually keep around the metadata that you're they're not going to bother and or you're only going to bother for something that's like for something that's really important and that means that in many cases you lose that metadata and that's something that makes stuff harder to harder to reproduce harder to understand so I think the metadata is really important and I think things like the plotting are actually a great example I mean the main reason half the reason for plotting is not so much because I mean it's nice sure it's nice to be able to make a plot easily but just as importantly the best way to get someone to fix the name to like fill in the the name of variable the longitude not just like dimension zero is to show them a plot that's labeled by like dimension zero they're going to immediately go back and like fix their data structure keep that metadata maintained make it look right so having the data structures that use the metadata and do things with it is a great way to encourage people encourage people to use it that's the carrot as opposed to the stick of metadata okay so that's my intro to x-ray now I'm going to talk a little bit about about Dask and what we can do with Dask for sort of another approach to working with arrays so Dask so again I'll talk a little bit about the ideas behind Dask and then I'll then I'll show you a quick demo of what Dask can do so Dask is a tool for that tries to extend these scientific computing tools in Python like NumPy and Pandas to working with sort of medium sized data sets so I guess I kind of lied when I said big data but big data is sexy so hopefully you'll forgive me the reality is that most of the world doesn't really have big data where you really need a supercomputer or a giant cluster of machines to access for a lot of data, there's a lot of data that falls in this range of say 10 gigabytes to 10 terabytes which is pretty awkward to fit on you can't fit into memory on your machine but you can fit it on your hard drive and you don't necessarily need a truly gigantic scale sort of supercomputing solution you actually maybe a single machine or a small cluster of machines can get you a huge amount of bang for your buck so Dask tries to solve this problem first before trying to solve the really big data problem and it tries to there's two things it does the first thing it does is it lets you use your multiple cores on your machine so these days even my laptop has four cores and but many of these traditional scientific people libraries only can use one core at a time so Dask lets you do that and it also lets you do things that are bigger than fit into memory so it can do streaming computation that you can't necessarily even if I can't load the entire data set in memory at once and the other part that I really like that was really impressed me about Dask is that it has a very simple data model hopefully I'll show you in a few minutes what that looks like and a simple approach it starts very naturally where tools like NumPy leave off on a single machine and it very naturally it generalizes that in a way that's not immediately trying to solve really hard problems the sort of massive distributed systems it's sort of slightly distributed and then maybe hopefully maybe a little more distributed and so the tools that will starting simple with a sort of transparent model that you can understand from and starting on one machine I think that's great and again my disclaimer this is Dask it's not my project this is all these good ideas here there are really Matthews sitting over here in the audience who is the author of Dask so I wanted to talk a little bit about it and it's something that we've integrated into X-ray because it's it helps us solve problems that we couldn't do otherwise okay well I've made some contributions Dask right so let me talk a little bit about the bigger picture for Dask Dask has a very has a clean separation between an idea of a collection which would be a data structure that someone would work with like an array or a data frame and then an abstract representation of how these computations are represented in terms of different tasks in terms of graph of tasks and then there's a matter of how they are actually how they're actually evaluated in terms of different schedulers and I think the focus so far has mostly been on on a multi-threaded or multi-processing schedulers which run the machine and those actually work very well they let you do a lot of neat stuff there's also a hope that eventually we'll have a distributed scheduler and that's what Matthew's been working on over the past few months so that we can scale us up to clusters as well and then we can start solving problems that look sort of more similar to what things people would do using using more sort of big data tools like Spark for example so let me talk a little so that's the tool I'm familiar with and that works really well with tools like X-ray so a Dask array is a data structure where you built out of tasks defined on chunks so like many of these sort of distributed array systems a Dask array, I have a big block and I divide it and I'm going to define how things work on individual pieces of the array so I'm going to how they can fit them all together later so at any given point in time a Dask array is defined by a knowledge of what these chunks are and a knowledge of say how to create a new chunk out of an old chunk and that's represented just in terms of the tasks that are represented in terms of dictionaries in Python that are defining say element 00 of the X-ray is based on you pull it out of the Y-ray and you add one and these are operations that are defined on top of NumPy arrays so it builds on top of these on top of tools like NumPy that work very well for handling say a block of data in memory and then I'm just going to and if NumPy can handle say a thousand by a thousand matrix like a million elements pretty well I'm just going to define things in terms of operations that are defined in terms of a whole bunch of little things I'm going to do on like arrays of a million elements and then it's just a matter of putting them all together and figuring out how to actually execute these tasks so with that I'm going to show you a little I'll show you a little demo a little demo of things you can do with what you can do with Dask and here I'm going to borrow this is again inspired by something that let's make this let's do this full screen let's see if that is that a readable size on the screen here awesome okay so I have let me just restart this so I have a notebook here where I'm going to show you what a Dask where it looks like so Dask has an API design that looks that is almost exactly like NumPy if you're familiar with NumPy the only difference is that when I create an array so if I create a five by five array of ones in this case there's also this chunks argument which says I'm going to divide it into blocks of size three so there are three elements in each chunk so that's actually going to be two by two chunks and what this and it doesn't actually nothing exists here until I compute it so if I compute the array I see it's a five by five array of chunks but until then it's just a bunch of chunks that are sort of it's sort of an abstract computation that is sitting in just to define it in a very abstract way so the things we can do so we can also because we have this representation with of this array in terms of tasks we can actually we can inspect the tasks and we can do things like visualize the graph of tasks so I'm going to show so this is I guess the very simple example here of right now this doesn't look very interesting right now I just have four blocks and they just they're just sitting there basically but I'll show you a little bit like I'll show you what we can sort of build up here out of these what we can build up out of these smaller pieces with the with with Dask so we'll start I'm going to start simple I'm going to add one to the array and we see that we have four blocks of the array and they just kind of get they get they add one task I'm going to do something I'm going to take the average of the array now you can see that these blocks it's a little hard to the scrolling is not working great here but there we go we're now we're putting together all these different tasks we're taking the average of each little chunk and now we're going to aggregate them all together and we can even do some really some really complex stuff which is which works remarkably well so for example if I'm going to say add one to x a transpose doesn't really do anything but maybe I'll do a dot product and oh my that gets a lot more complicated so we have this so we can so because of how the flexible model that the Dask has in terms of sort of creating these tasks and dictionaries it can represent really sort of pretty complex operations there would be a lot harder to do for example in terms of say like the map reduced paradigm for example and we can continue to imagine putting together putting together a task like this plus x we can take some averages we can do some we can build these sort of towers of tasks and then eventually now that I have this representation here of doing a few simple array manipulations later we can come through with a scheduler and execute each of these we can execute each of these little blocks so this is so as an example of sort of the complex things you can do with this so one of the contributors to Dask this was I'm forgetting his name a machine learning researcher wrote some linear algebra routines the bill on top of Dask so they can let you do things like this is the graph representing a QR decomposition which we can visualize which we can really it's also it gives a sense of sort of how the data flows to our algorithms even in these sort of fairly complex sorts of situations and again because we have these task graphs we can also we can do things that say plug in to the we can also plug into the execution engine we can see things like we for example we can we'll see here we can profile the task so for example this task is going to take me takes a minute to complete and you can actually see how much time it's spent doing each individual chunk of of that giant computation so for each piece we can have an interactive graph that will show us really like what the time spent doing different types of computation so it's kind of a a bit of a pre sort of the architecture of Dask I guess I'm close to running out of time so I will show you a little bit about I'll just say that this is the tool that I like Dask because it has a sort of a simple understandable abstraction and it's something that we were able to really easily hook in to X-ray and let's us do really awesome things with with the scientific data that we work with so for example I'll just show you really quickly you're not really have time to go into it but here I have an example where I have data that is maybe it's only in this case it's only five and a half gigabytes on Dask but by the time if I open up a whole bunch of these this weather data in form of the form of these nestdf files it's in memory it's got more data I can plot it out in memory it turns out it's something like it would be something like 23 gigabytes loaded into memory which is more memory than I have in my laptop but I can easily set up computations here which will ensure they're a little bit slower it might take me 30 seconds to sort of go through my array and compute the mean but that's just the price of loading data from disk because it's awesome that we have tools that can do that and integrates really nicely so the computation in x-ray will do things like building these task graphs if you compute because we only define things on little pieces of the array I can compute something I can really quickly say if I'm converting say to do some scalar arithmetic to convert from Kelvin to degrees Fahrenheit I can see a little preview of what the result looks like before it's without having to compute the entire array because it only needs to compute one little block and then we can sort of hook into tools like again the execution engine would ask to do things like add to do things like add profiling for example or to see a progress bar that shows us how long it takes to execute each task or do things like to cache execution so we don't have to calculate the same values twice and if you compute things we build these task graphs we can really see what goes into our computation and how we would and how we're going to execute it so that's a bit of a preview of x-ray and desk let me go back and talk a little bit more about in the last few minutes here talk about some of the opportunities in the bigger picture areas that I'm excited about in the future of tools like x-ray and desk and the broader scientific computing array ecosystem so desk works very well for what it does there are some things that it struggles to optimize that are sort of based on based because of the data model that it has and I think an example would be figuring out the right way to divide up your data into little chunks is something that's very that a tool like desk is just really not suited for you really need something that keeps track of what your entire computation is can then maybe go back and say given your computer architecture given here's how much data you can store in memory here's how you're going to use it here's what the right side of the chunk should be in practice that might make a large difference things like avoiding unnecessary reading of data from disk so maybe I don't want to read an entire block of data maybe when I compute say like a sine function on the first 10 elements I would rather I want to do the indexing operation first and then I want to compute the sine function these are things that we can do that in principle if you know what the computation is mathematically and some of these things we can do in desk but some of them are hard especially things like if I want to add say add two things together again I want to pull out the data first before I add them together and these things can be hard to make work with some of the tools we have another example would be again if you have an abstract representation of the computation you can often optimize it and speed it up as well and you can do neat tricks like in a tool like NumPy when I add together I create a lot of intermediate I create an intermediate result that has a lot of extra data and in reality I could if you know it is ahead of time you can do the computation much faster without any temporary results and so that's that would be the fusing operations and this is possible especially with things like just-in-time compilers it's a tool like so for example there are some examples of tools that can do things like this like Theana or Numba so maybe there's another one for another comp project for lazy array optimizations and I'll mention there are a few products that are going in that direction that I'd like to see improved upon and try to build a tool that lives on top of Dask yeah yeah that's a great question so if I have an operation that does x plus one that uses like two different say does the same operation but not the same variable I think that's another example of something where I think with the way that Dask, Matthew correct me this but I'm pretty sure I think that currently the way Dask works it probably would execute both of those separately if we're lucky hopefully it will all be the same can you want to compute it twice that's another thing that you might imagine adding in some higher layer on top to notice that stuff and some of these tricks with Dask are a reasonable job but I think that there's some brum here for sort of building on top of this and something that could use other products as well the other thing coming back to x-ray x-ray is something kind of like there's a lot going on it's sort of I think it's really powerful almost like an in-memory array database with these labels and these multiple arrays that all in the same coordinate system and for a lot of use cases that's really overkill that's not what we that's more than what we really want or what we need to deal with and particularly for tools like say scikit-learn is a machine learning library in python they really would love to be able to keep track of just they just want to know if my matrix has samples or features or features versus samples or if it's features and like there's a lot of bookkeeping to keep track of this stuff and something simpler might meet those needs without being as complicated as x-ray so maybe something simpler than x-ray maybe something maybe there's a way we could put these labels into closer into a tool like numpy a simpler version of this this is something people have talked about for some time yeah so x-ray does maybe not quite the right word so x-ray absolutely can execute computations on top of desk and you could in principle even put x-ray objects inside inside desk in some cases that might make sense so it does do out of memory stuff as well but it doesn't do but again it's with this sort of complex data model and that's the last thing I'm gonna probably end here because I'm without going into too many details as well but there's a lot of really cool tools for working with arrays and I told you today about a couple of them that I worked on told you about x-ray and desk and these build upon tools like build upon numpy for example but there's a whole bunch of different about a whole bunch of different tools that fit together that meet all these needs for doing things like in-memory arrays or parallel arrays different ways I want to memory something like a sparse array or a compressed array or different types of labeled arrays different types of arrays with units different types of parallel sort of distributed back ends maybe I want to do it using the DAS approach maybe I want to scale it up, I want to do it on a tool like spark for example that's what Bolt does then there's like a million domain specific libraries as well so how can we make it so that we can mix and match these different tools together scientists can pick the right tool and they're not locked in to say we don't have to have every astronomer uses like the exact same set of like underlying libraries that has been standardized on how can we make this easier to sort of compose these operations, I think it's a really important problem I won't tell you, I have some ideas about how we can fix that in numpy, I won't I'm not going to go through this today but I will say yeah so there's a lot of different types of data we I guess I don't really have time to go into this but alright so maybe I'll leave it here, I think we can do really awesome science if we have better array structures we can do it faster, we can do it more reliably more reputable so let's build some awesome new data structures so rank statistics are hard to do with distributed with out of core data