 Hello. So I'm Donald Stevens. I'm a senior software developer at Anaconda. Just a little bit about myself. I've been at Anaconda since 2015. I spend a lot of my time working in open source, particularly in the HoloViz ecosystem, which is something I'll be talking a bit about. So this is my talk, Seeing the Needle and the Haystack, a single data point selection for a billion point datasets. So first a little tiny bit about HoloViz. So HoloViz is a collection of libraries. They're all open source. They're all kind of community open source where everybody who contributes kind of uses the open source before they actually start developing on it. And it's this collection of different libraries all about visualization that each have a sort of focus and sort of a topic that they try to focus on. And our goal is to try to make all these different things work together. So in this talk, I'll be talking mostly about DataShader, but all your plots you'll see are based on Bokeh. HoloViz and HoloViz is in the background, and DataShader itself is powered by Number, which isn't a HoloViz project, but is something that is open source and makes everything possible. So the first thing to say is that nowadays a gigabyte of data is really not that much. We have huge amounts of data nowadays floating around, and the thing to understand is when you have huge amounts of data, it becomes harder and harder to understand the structure in that data. So what you want to do is find a way to get those insights that you actually need into the datasets that you have. And from the beginning of computing, we used some simple visualizations to view our data, simple plots, scatters, whatever it is. But as the data gets bigger and bigger, we sort of hit this problem of how do you visualize large datasets in a way that you can actually get insight into them. So data is cheap and plentiful, but insight is expensive. And a lot of the plotting tools that exist nowadays don't really deal with large data that well, even with WebGL, and I'll talk a little bit about the differences with WebGL and our approach. And in particular, what's really interesting about large data is that the same approaches don't work at all because of issues such as over plotting, which I'll describe. I'll show you what that means. So DataShader is where I'm starting. So my whole talk is going to be a little bit about the history of DataShader, and as I go through it, I'll motivate the features that we have in DataShader and I'll show you how it works and so on. So it's going to be both a little bit historical as well as giving you insights into our philosophy. So DataShader.org is something that's been around since 2016, as well as DataShader, of course, the library that it talks about. And it's been around and it's been able to make really nice images and pretty plots for a long time. So this, for example, is the U.S. Census. So 300 million samples or so. And you can actually see the United States quite clearly just from every single person that lives there. This is a really interesting plot. Every single person in the United States is actually contributing to this image maybe a tiny amount. And what you can see is you have, of course, your highly populated areas or densely populated areas like New York and your sparsely populated areas. And then what's fascinating about this is you can really see the spatial structure of the country and where people live. And part of what makes this possible is something called histogram equalization. If you try to use a normal color map to make an image like this with either log or linear, you will fail or normally you'll fail. And the reason is you'll have these bright spots like New York City which have huge numbers of population compared to the sparse points where there's no one living there. And what that will do is that blow out your color map. Your max will be so high that it all gets flushed out even if you use log, you kind of have this problem. So another earliest thing that DataShader did is something called histogram equalization whereby it takes the distribution of your data to kind of warp the color map so you can actually see the spatial structure of the data that you're looking at. Here's another example of DataShader image. It's categorical. This time it's using the racial categories in the census data set. And each color is a different racial category. So this is Manhattan. You could probably make it out if you know the area. Okay, well. Okay, yeah. Let's try that. That's my... Okay, let's see. Is there any better? Okay, you'll see some of the other plots. I think we'll be okay. I'm doing a lot of geo examples in this talk, but I want to make it emphasize that DataShader's more than just geo plots. For instance, we have attractors. This is the mathematical functions or mathematical objects that are really pretty to look at. Actually, I have something that will fix the... fix the width of the slides in a second. So there's two things. So over plotting is something else, which I'll show you. And the problem is if you just get a normal plotting library that have a little point for each point, in a scatter, for example, you're just going to get a big mass of points and you won't be able to make out what it is. And then you have this dynamic range problem equalization in DataShader. So those are two parts. That's this dynamic range issue and this over plotting issue. So I'm going to give you an example. So here is a data set with 2.4 to million samples, which for DataShader is actually really small. One of the things I should say about DataShader is it really scales to as much data as you have. Because it's server-based, you can throw DASC at it and you can have a sort of cluster. A huge amount of compute machines, whatever you have, as much data as you have, if you throw compute and resources at it, you can actually get it to work for any size data set. And I mean terabytes and up, it really comes down to how much you can afford to work DASC and get the compute working for you. So this data set has these columns. I'll just show you an example. Longitude, latitude, temperature. So what this is all about, heading and height above mean sea level. So this is information from gulls. So this is taken from Belgium and the north coast of Belgium, where they were tracking gulls. They had little bracelets with GPS on them and other sensors and basically the data set of all these birds and where they went. So let's see what DataShader could do from the beginning. This is from 2016. You can make a pretty picture and you could see that there's some spatial structure, but the problem is, number one, it's a static image. It's just a static image. And you have no context. You don't know where this is located. You have no way of seeing inside this data. It's just a static image. So this is what we had with the first sort of releases of DataShader. Now, we can compare it to interactive plotting. Okay. So here we have an example of a bokeh plot which is kind of classic plotting in the browser. But we have a very small subset of the samples because if we actually pushed all our two million samples, we'd probably crash the browser tab. You wouldn't be very happy. So here it's like what, 0.5% or something? So here's the samples. And now we have all sorts of things we can do which we couldn't do before. We can zoom, we can pan, we can hover, we can sort of get an idea of what's going on by sort of interacting with our data. Now, what we really want to do, and our big goal for HoloViz, is to make this kind of easy interaction possible with the biggest datasets you can imagine, again, if you have the resources. And everything you see is running on my laptop, so I'm not using any sort of fantastic resources here. Now, the problem with this is it's not the whole dataset. As I said, this is the over-plotting issue I mentioned. It's hard to see what's going on here. You just have all these sort of circles overlapping each other and what's going on. You don't see the structure anymore. You don't have this dynamic range. You don't see anything about the data and how it's distributed in terms of where it's concentrated and where it's not, which is related. And also, you can't actually query this. By this, I mean compute statistics over samples, all the samples under the cursor. But there might be hundreds of gulls, and you won't tell because they're going off the screen. This is just a really long list of data. But it does have inspect. If you have a single gull, you can just easily see what that data is. Okay, so this is really something that we want. And in 2017, we added all the functionality in Bokeh to make an event system, hooked it up with data shader and HoloViz, and we ended up with disability, which is we can now zoom in, put tiles behind it, get the context, see the coast, see the structures that the gulls are sort of aligning themselves with, whether they're piers, or coasts, or whatever it is. And now you can zoom in and see the structure of the data using ecuhist. And this is server-side histogram equalization. The server-side is doing it. So this is a big improvement, but it still doesn't do all the things that we had before. There's no color bar, for instance. This is an image that's generated by data shader. And data shader goes through the whole data set and it builds a 2D histogram. And that's what it uses to color map things with histogram equalization. With that, you can actually have a color bar, but this one doesn't have it. And again, you don't have query, you don't have inspection, but now we do have panning and zooming and it's the whole data set. So a big improvement, but still not where we want to be. The next step is to say, what could we actually do if we do it slightly differently? Well, we could use something called rasterize. This time, there's no color mapping happening on the server-side. It's in the browser. This is a log color mapper. We've got a color bar, which is really nice. We can hover and get the counts, the number of gulls that contributed to the pixel that you're hovering over. So now we have some hover, which is good. Not all the information, but some hover. We can still zoom and do all these other things, but it's not as easy to see what's going on here because we don't have histogram equalization. It's just the log and linear is even worse. It's just you can't read it. So now we have some more things like the color bar, but we've lost the full dynamic range. Okay, in bokeh 2.2, so I think this is 2021, and we actually got the histogram equalization working in bokeh on the client side. So now it looks like what we had before, except we now have our hover and we have a color bar. So now we're getting very, very close to the sort of data exploration experience that you want. And note that this is something that could probably just about be handled by WebGL on the limit to 2 million points, 2.5 million, but it's really at the limit, and data should have really excelled beyond that. Okay, so now we have a whole lot of stuff. The thing that we don't have is information about the individual gulls. It's just this count, which is the number of gulls, gull samples that actually contribute to that pixel. That's 441 gulls in that pixel that I'm hovering over, for instance. Okay, so what we want to do now is to be able to query the data samples. And by querying, I mean getting all the samples with some simple spatial filtering. So around the cursor, there's a sort of delta in X, delta in Y. In that area, can I get all the samples, and can I get the information out of those samples? So this is something that we have in HoloViews, where what you can do is you can set up an inspect operation, and this is all API that we're going to put into hreplot to make it easier. So it's not much code, but we're going to actually simplify this. What it does, so what it's going to do is it's going to be able to essentially look at the samples in the original data, get that data, push it to the browser, and update it. So for now, all I've done is I've basically made some very simple thing without that, which is just a tile source, which is this cursor to Belgium, and the normal raster, which I just showed you earlier, with the color bar. And as you can see, the color bar updates due to histogram equalization. You can see these strange numbers here. Now with this operation that I've added, what you can do is you can now hover anywhere. And what's happening here is, effectively, I get the x, y position on my cursor, send it to the server, get a delta, so I do a spatial filtering on it based on some x and y delta on the original data frame. So HoloViews keeps a pipeline of all the operations all the way back to the raw data. It can figure out where are the samples that have those x and y in that range, figure out something here that's just doing the closest sample, the closest gull to the cursor, and then it can send it back to this little white circle which is following my mouse, and it gives me this hover information. So this is really great, and this really works for this kind of size of data set. We were kind of happy with this, but what we found is that that spatial querying becomes quite slow for the sort of data sets that we really want to work with data shader, 300 million, a billion points. Those sizes, doing all those sort of filtering queries becomes expensive, even if you do tricks like spatial indexing. We've tried that and it helps, but spatial indexing did not get us the performance that we wanted. So here's the stuff that's new this year. Well, actually, first I'll say, I kind of said this already, but what's interesting about this approach is that you get all the samples in there. So if I was hovering on New York City, it'd be all those millions of people in that tiny little area. I could actually get some statistics, some analysis, some means, whatever it is, and that's really powerful, but it's slow, and it's always going to be slow if you can have code that has to do those kind of aggregations. What we decided is we want something that we call instant hover inspection. And the difference is that often when you're looking over the data, you're not ever going to be able to see the whole data, because as I said, with like bokeh, even if you have hover, you know, this is not readable, especially if you have like hundreds and hundreds of these, it's just a giant list. It's not really that helpful. What we often do is just get an exemplar, a single sample, a single statistic, something very simple that gives you an idea of what's in that area, instead of trying to see all the data in that area. And that's what we call by inspect. So coming to inspect. So what, the way this works is that before Holovies was doing everything, but now what we're doing is we're adding support in data sheet of itself to actually do aggregates, whereby we actually get the index of the pandas data frame at that position, right? So we do it in one pass. That means data sheet goes through the entire data set, builds your 2D histogram for what you see, but also keeps track of the indices in the data frame so that you can easily look them up without this sort of expensive XY spatial filtering step. Okay, so I might actually demo it before I talk about how it works. Actually, this is part of how it works. So here's an actual example of this happening now. We had our gulls, and if we go back to our table, you could see that one of the columns was height above sea level. So height above mean sea level. So this is the height of the gull at that point when that GPS sample was made. What we can do is we can now do something like don't show me all the gulls, just show me the highest gull. So in a way, you can think of it like looking down on the earth and you're looking at the top gull, the one that flying highest, so what we have is something we call a selector where we can take the max of the height above mean sea level and use that together with the aggregation pass that makes the image that you see, and this is basically visualization of that. So now when I hover, I'm getting an index into a panda's data frame of that sample, which is the one that's highest, the highest gull at that point, right? And that basically reduces this problem of having to do a spatial selection. It's just been done with data shoulder as it went through the data the first time, right? So using this information, this is some of the new stuff we have now. This is what we call instant inspection. It looks a little bit like what I saw before, but the little white circle, which is hard to see, I guess, but kind of lags my cursor, and that's because every time I move in my mouse, I talk to the server, get a response, and it has to go into the spatial selection, which is fast for this size data set, but as I said, becomes a problem when you have huge amounts of data. But this is instant. I can move my mouse around, and I can do this with much, much bigger data sets as I'll show you in a second. So this is kind of where we are now. You have your whole data set. You can visualize everything you've got, no matter how much it is, far more than you could even manage with WebGL. There's no over plotting, because you can do histogram equalization for your color map. So you've got your full dynamic range. You've got all your interactivity. So again, the panning, the zooming, all the stuff that we like. For interactivity, color bars. We've got the tiles underneath. We can see the context. We've got the query, which is still available. It's still there as an operation if we need it. We need all those samples, but we also have this quick, easy, fast inspection, which is really, really great to give us kind of all the interactivity that we had with a sort of standard bokeh plot, without all the large data. So I'm going to give a little demo. As I said, there's a lot of geo, and I know not everyone cares about geo. I'm going to quickly... Where's my terminal gone? Okay. Yes. Actually, it might still be running. This is because it's full screen, that's why. Okay, so I'm going to switch to this tab. Let's see if this is still fine. Yeah, okay. So this is a really cool little data set. This is doing dimensionality reduction on language. This is not my data. This is data that comes from Christopher Akiki, with his data set, which is on the hug and face. What it's done is basically dimensionality reduction on a corpus of language across different languages, and it's done effectively some kind of clustering. Now what we have is we can actually look at all this data, and we can see all the nice colors. We can see nice structure here, and of course, we can zoom in, and we can now see all the different languages, right? Okay, Tamil, yes, there's all Portuguese there, but here we go. All the different languages that are in each cluster. And if you imagine doing this without the hover, you'd have to have a giant legend with all the languages and all the colors, and it'd be kind of hard to see what you're looking at. But this just makes it super easy. This is all, yeah, this seems to be Catalan, Portuguese, and so on and so forth. So this is a really pretty, really nice little example of a non-Geo data set, sort of more of a machine learning type task that's showing you the instant inspection. Okay, so back to the talk. So far, these data sets have not been the data-shader size data set, so since this is 300 million points, I mentioned a billion in the title of the talk. We do have data-shader examples with 1 billion points of open-street map data. Those are available, and that's the sort of thing that we want this to all be working with. But I will show you something which is not quite a billion, but 300 million, which I think should be good enough. So this is going to be, I'm actually running it now, so I'm actually starting it up on my laptop right now. It takes a second or two to come up. This is a ship traffic data. It's all shipped that traffic around Vancouver in the U.S., and each ship has a transponder that gives you GPS coordinates for that ship, as well as things like the heading, the idea of the ship, and of course, when you have the idea of the ship, you could look up that particular ship, the name of the ship, the length of the ship, the weight of the ship, whatever, without cargo, I guess, right? So here it is. This is 200 million AIS pings, which are these GPS pings. With our instant hover, you can see the name of the ship and the type of the ship as we look around. This is our instant inspection, but it also has this querying. So this little square is right here, and what it does is it's doing the querying, whereby I can actually get all the ships in that pixel. I can even look up the vessel by its ID and find a photo of that ship, as well as the name of the ship and everything else. So it's got a drill-down aspect to it as well. So I can click around. If I click somewhere else, that will update. I know the dashboard is... So these are new values. Let's actually click somewhere else again. It takes a second. So this is what I'm telling you about the inspection. With querying being slow, you can see it takes a few seconds. So it's based on tap instead of hover, and that's because of the size of the dataset that we're working with. But the actual hover, which is based on instant inspection, is just right there, right? And of course, I can zoom in and then pan and so on and so forth. So the red ones are... I think they're cargo ships. You can see just by looking around, they're mostly cargo, and these are towing ships. You can see they actually take different paths around the coast, right? So this is a sort of insight into the data that you can get with these kind of tools that are quite difficult without having something like DataShader running on the server for, again, huge datasets, which go well beyond what you could do with something in the browser tab with WebGL or something like that. So the last thing I'll show you is this notebook is itself a dashboard. Everything I talked about builds up to a final dashboard. It's only a small amount of code. It uses panel, which is another HoloViz tool, which lets you make dashboards really easily. And what it does is it lets you take all your pieces together, lay them out in columns and rows, put them in a template, and so on. And so for my last example, I can actually take this notebook and actually run it with panel serve, and it hits the serverable call at the last bit of code, and it turns into a dashboard. So this is the sort of final example of the gulls now presented as a dashboard that you could run on a server and present to someone else, so that you can share your URL, kind of share your insights with someone else. So this is it with a little bit more text, a little bit more information, a nice title. You can switch the mode to dark mode if you want your color bar, your EQ list. We're still working on it a little bit. That's why you get negative NANDs where there's missing data, which we'll tidy up a bit, but you can see that you get all the information about the highest seagull that you're hovering over, right? Again, so all the stuff, the heading, the height above mean, I mean sea level and so on and so forth. So we still have, we've achieved a lot. We're very happy with where we're going, and we're going to have this all released very, very soon. The inspection, everything up to the very last step, the instant inspections have been released, and we're getting that instant inspection out soon. In fact, that's actually in data shader already. We're just integrating it into HoloViews and HP Plot. But there's still other work that's going on. One of the things that has happened in Bokeh is the categorical color mapping can now be done client-side. So you can see this is the server-side color mapping or color mixing if you've got different categories, and this is it done client-side, which lets you do things like hover, again, but with categorical, and also have a color bar. So this definitely benefits to doing client-side color mapping. Also, we do focus on other things, like time series. We want to have very good support in data shader time series. We already have anti-alist lines. I'm not sure in any time series, but data shader does work pretty well with time series with anti-alisting, though there's another step which is getting these operations, these index operations to work with anti-alisting as well. And that's a tricky job that Ian Thomas is working on. So yes, there's actually a sprint. There's a panel sprint. It's not about this particularly. It's not about data shader or HoloViews, but panel, which is actually used by HoloViews and is actually an important part of our ecosystem actually at the base of a lot of our packages. If you want to help out, there's a sprint at EuroPython for the HoloViews panel sprint, so please do check that out. To wrap up in acknowledgments, exciting news this year for HoloViews is that we're now NumFocus sponsored, which is very exciting for us, and I'd also like to thank NumFocus for enabling me to attend this talk today. And lastly, of course, this is a work of lots and lots of people. Here are just some of the people involved in the last releases of HoloViews and data shader. There are many more, but this is what GitHub throws up when you look at the release notes. So that's HoloViews and that's for data shader. Okay, so that's good. It looks like we've got five minutes for questions. Please go ahead. Yes, so that's a very good point. So we are not using the geospatial stack in this example because you need GDAL and lots of other expensive geodependencies, but we have a project called GeoViews. Okay, it's not coming up. We have a project called GeoViews, which you can find out on the HoloViews website. So that is something I do recommend everyone goes to. If you go to holoviews.org, you can have an overview of all our different packages, including panel, but there's also GeoViews here, which if you click on that, you can see that we actually do projections and we do actually support all this stuff, and you need to actually start thinking about the curvature of the Earth if it's really big spaces, and that's where you need to use GeoViews because you need the full card of PyStack and all that Geo stuff. But we do support that, yeah. 3D, we don't do, but we do projections in 2D. So 3D, we can do a little bit of 3D with Plotly, but not for Geo, I don't think. Okay, so I am not an expert on any of this data. This is an example that I found that is very pretty, has some nice plots, and I describe it to the ability I have. If you want to learn more, you can get it from the real source, which is Christopher Akiki. It's his data set and it's on Hug & Face. Any more questions? I think we'll wrap it up here. Congratulations on a non-focus funding, and thank you very much for contributing to the open source community. Thank you.