 Hello, everyone. Welcome back to our second keynote of the day. And our speaker is Simon Willison. He's most well known as the co-creator of Django web framework, but also as the creator of Datasat, which you will be hearing about today. So to me, I love Simon's talk. I've heard other talks by hand before. And the reason why I'm super excited is because his talk is very much in the spirit of what CSB Conf is all about, which are highlighting tools and practices that make it possible to democratize data. So he'll be telling us today about Datasat, how it works, and how it makes it possible for people to both publish data but also allow other people to explore the data in different ways. So Simon, take it away. Hey, good afternoon, CSB Conf. It's really exciting to be here. I was at CSB Conf in person a couple of years ago. I have a demo relating to that a little bit later on. But yeah, so today I'm going to be talking about a couple of projects I spent the last three years working on called Datasat and Dogsheep. And I'm just going to fire up screen sharing. Well, hang on. I thought I had this set up. There we go. So there's screen sharing. So I'm going to start out with a demo. This is a database of all of my swarm check-ins. I use 4-square swarm on my phone. I check into different places as I go to. And I've exported that data back out again and loaded it into Datasat. So this right here is a relatively unexciting looking table of data that I pulled from swarm. But if I join that data, this is check-ins. If I join that against the venues, I can pull out things like the latitude and longitude. And then once I've got latitude and longitudes, I can draw stuff on a map, which means I can have a very exciting looking map of places I've checked into in the past, I think probably five or six years. Or I can turn on the pandemic mode and filter for everything that was created since that terrible day last year where Tom Hanks was came down with coronavirus and the entire world shut down and see that in the last year and a half I've been a lot more limited in the places I've been to. The map here is interesting because this is not a feature of Datasat itself. This is a plugin. Datasat has a plugin system where plugins can add all sorts of additional functionality. And in this case, I've got a plugin which looks at the table on screen, tries to find latitude and longitude columns. And if they exist, it attempts to draw all of those points on a map. But let's try to do something really useful with this data that we've got here. So every time I check in with Clio, my dog, I use the wolf emoji in my checking message. And that means that I can run a query where I can filter for where the shout message contains that wolf emoji. And if I do that, I get a map of places that Clio likes to go. So this is my dog's own personal view onto that swarm checking data. I can zoom in and see different things there. A crucial feature of Datasat is something called faceting. And the way that works is I can select one of these columns. I can say facet by this, and I'll see a sort of account of that column across the entire Datasat. So this right now is showing me my dog's favorite categories of places that she likes to go. Unsurprisingly, she loves parks and dog runs, but it turns out she also loves coffee shops. So then if I click into Coffee Shop, I can see the 30 check-ins that Clio has made at coffee shops and start getting an idea for her coffee preferences as a San Francisco, now Half Moon Bay dog. And I can even do a further facet by venue name to see the coffee shops that Clio enjoys the most. So she loves blue bottle coffee, but she has been to Starbucks four times as well. Now, everything that you can see in Datasat, you can also export back out again. So you can get that data out as JSON. Here's a JSON feed of coffee shops that my dog likes going to. Or you can pull it out as CSV. This is a CSV file I could then load into other tools to let me do an even deeper analysis of my dog's coffee habits. And so this essentially is, this is what Datasat is. Datasat is a web interface on top of a database. The database that it's using under the hood is SQLite. And the reason I'm using SQLite for this is, well, there are a whole bunch of reasons, but fundamentally, SQLite is a really great way of packaging up data into a single binary file, like the SQLite database is a something.db file on disk. And because it's a single file, and you don't have to run a PostgreSQL or MySQL server, these things are really cheap to create and they're really cheap to back up and send around and publish and so on as well. So a big part of the Datasat project has been exploring the limits of SQLite and figuring out how good is SQLite as a sort of format for publishing and sharing data. And honestly, the limits are pretty much endless. I keep on finding new things that SQLite can do. And I think the logical limit in size of a SQLite database used to be 1.4 terabytes, and then it doubled to 2.8 terabytes in the most recent release based on feedback from one of the users. So there's an awful lot that you can get done with a simple little SQLite database. So Datasat.io then is this overall project. Datasat is the software that I just showed you that lets you explore through these database tables, visualize them, run plugins against them and export the data back out again. And then Datasat is also an increasing ecosystem of tools that I'm building around this core idea. I mentioned plugins. I've now got 59 Datasat plugins. There's actually a few more than that that I need to get into this index that add all sorts of additional features like password authentication or GraphQL APIs, the ability to serve map tiles for using with JavaScript libraries like leaflet, all sorts of different bits and pieces. And then I also have these tools. So tools are not unlike plugins. You don't install tools into Datasat, but these are things for mainly for building SQLite databases themselves. So Datasat is great if you've got your data in SQLite, but in order to use it, you need to get that data from other sources and get it into that format. So what I'm going to do next is I'm going to show you a demo of how this overall set of tools works. Hang on, let's do... Here we go. So the London Fire Brigade published an amazing CSV file called Animal Rescue Instance attended by LFB. It's 2.6 megabytes. It's every time they've rescued an animal since January of 2009. This is delightful. And I downloaded... I found out about this a couple of days ago and I downloaded a copy of that CSV file. So now if we look at... Here we go. If we look at that CSV file, it is, unsurprisingly, a CSV file that's got a bunch of columns, like incident numbers and type of incident and all of this different stuff. What I'm going to do now is I'm going to load this into a SQLite database. And the way I'm going to do that is with a tool I wrote called SQLite Utils, which is a whole bunch of command line tools for manipulating SQLite databases in different ways. So I'm going to run SQLite... I'm actually going to... Okay. So I'm going to run SQLite Utils insert. I'm going to create a database file called incidents.db. I'll create a table called incidents. And I'm going to feed it that CSV file. Oh, it threw an error here because it expected... By default, it expects you to give it JSON, but you can use the dash-dash-csb option to tell it it's a CSV file. There we go. That has now created me a... Let's see. Oh, ls-lih. It has created me a 2.8 megabyte SQLite file. So now I'm going to run the dataset against it. Type dataset, give it the file name. It starts up a server. I could click this link or I can actually add dash-o and have dataset open my browser for me. So here we are. This is that CSV file now running inside of dataset. So what have we got here? We've got 7,370 rows. I have that plugin installed, so it can show me them all on the map, including one that had a 0,0 last year and shares up in the wrong place. But the really fun stuff is when we start faceting this. So I'm looking through these different columns. There is a column here for animal group parent. If I facet by that, I can see that cats get in trouble most often with the London Fibergate. 3,500 incidents. Then surprisingly, it's birds before dogs. Foxes, 338. Deer, cats, hamsters. I like ferrets. And it turns out there have been 8 ferret incidents serviced by the London Fibergate in the past decade. If you scroll down here, you get the description and see it's ferret traps behind kitchen unit, ferret stuck in cage, ferret stuck in lift shaft, ferret on roof stuck in guttering area. This is really quite a fan. This is a fantastically rich set of data that we have here about ferrets and about all other kinds of animals as well. But there are a couple of problems with the data that we've got here. In particular, we've got a column here called Instant Notional Cost, which is an estimate of how much it costs the Fibergate to rescue that ferret who is trapped behind a kitchen unit. But if we sort by that, you can't quite see it on this, oh actually for ferrets, if that's what's going to work, I'm going to get rid of the ferret search. The problem we have here is that because it's a CSV file, this field here is being treated as text. So first you've got the word null that shows up first, and then you've got the 999 pounds sorting before the 1,999 pounds because we're sorting a numeric column alphabetically. So the first piece of cleanup I want to do is I want to convert the types of these columns into numbers. And I've actually got a dataset plugin I can use for that. It looks like I haven't got it installed, so I can do dataset install, dataset edit schema, and this will install that plugin. And now when I open up dataset again, I actually need to log in to the dataset instance so I can start making edits. So I'll add dash dash root to login as a root user. And now I'm signed in and I get a new option on this instance table. I get this little cog and I get an option for edit table schema. So in here I can say, you know, I want pump counts, that's the number of fire engines, pump hours total hourly national coastal and international coastal. I want to treat all of those as integer columns, apply changes, and now when I go back here, these are now integers. You can tell that they're numbers because they're in a slightly different shade of color. If I sort by this, let's sort descending, unfortunately I've still got those null values. I can filter those out so I can say where incident, where, what was that? Instant notional cost does not equal null as a string. And now I can see the most expensive of those events, which was a cat stuck within a wall space where the RSPCA, the animal safety character, Chow-Tee were in attendance. And then there was a foal in the river, a horse fell into a swimming pool. These are the more expensive instance. It kind of amazes me that the London Fire Brigade only bill 4,000 or only consider that a 4,000 pound cost. I think the San Francisco Fire Brigade would charge quite a lot more money than that. But yeah, so this is sort of illustrates part of the expanding scope of the dataset project. I've started, initially dataset was just about read only queries, but I've started building out plugins that can write to the database as well and do things like, in this case, fixing up schemas and converting column types. I'm going to make a couple more changes to this. I'm really frustrated by those null strings. So what I can do there is I can just run SQL directly against that database file to clean those up. So I'm going to run a SQL query against incidents.db to update incidents set. Notional Cust equals null, where notional Cust is the string null. And if I do that, it tells me it affected 50 rows. I'll paste in these other three as well. So now, if I open up dataset again, I should now have none of those, those null rows don't need to be filtered out anymore. The one last thing I want to do is I'm going to set up full text search across this data so that we can actually run search queries against it. So I'm going to sign in this root again, and then there's another plugin I built called the dataset configure full text search. I can use that and say, you know what? I want to be able to search this data by the animal group parent and by that final description. So now I'll configure search across these columns. This is using SQLites built in full text search feature, which is actually pretty powerful. It can do stemming. It can do relevant scoring. It's a really neat piece of the overall SQLite package. And dataset is smart enough to spot if you've configured full text search for one of these tables, and that is search box at the top. So I'm going to search for, well, search for ferret again, and there's our ferret ones. I'm going to search for ditch, because I happen to know that a lot of creatures end up in trouble when they fall into a ditch. If we look at the, if we facet by animal group parent, we can see there have been 29 horses that have fallen into ditches around London and needed to be rescued by the fire department. Two deer, one bird. How does a bird get stuck in a ditch? A duckling was trapped in the drainage ditch. There you go. This is also, and this is really where this stuff gets really fun is when you start doing this kind of ad hoc data analysis with it. If we go back to that facet by animal group parent and look at just these horse instance, this map right here suddenly starts telling us a story. It shows us that the most treacherous place to be a horse is in this area of London and it looks like it's around some lakes. And there's a whole bunch of incidents where horses get stuck in the water, rescued from water or mud. They get, what's this one? They get into waterways. There's a whole bunch of horse casualties here. I presume there's also quite a popular riding stable near there as well. And so a lot of the original ideas for dataset came from the work I was doing at the Guardian newspaper in London. And so at the Guardian, we decided that we wanted to start publishing the data that we were collecting behind some of our news stories. So newspapers have data journalists. Data journalists go out there and fetch numbers and facts and figures about the world and they use those to help reporters tell stories. What would happen if you published the data behind the stories as well? So this was back in 2009, 2010 and the way we ended up doing this after discussing different options was we started a blog. We started the Guardian data blog where the idea was to publish the numbers behind the different stories that were going out in the newspaper. And the way we published that data was using Google Sheets. It was freely available. It was nice and easy to get started with. We didn't have to think about servers and hosting and so forth. And so we would put up stories and then we'd publish these fabulously detailed spreadsheets that helped support the news and helped explain how we are working in what we were in the stories that we were telling. But I always felt frustrated by this. I felt like there should be a better way to publish this kind of data. There should be something that's more open source, more flexible that lets people do more interesting things with that data. And that was the idea that sort of drove the creation of dataset three years ago. I was looking at this new trend of serverless hosting providers, things like the Cell and Google Cloud Run and AWS Lambda functions. And all of these have this restriction that you're not allowed to have a database because a database is a much more expensive thing to run. You have to have backups. You've got to handle reads and writes and all of those kinds of things. And I realized that in my case, for newspaper data, none of it requires any rights at all. It's read-only blobs of data. So the trick there is that you can put them in a SQLite database and then deploy that database file as an asset as part of the application that you're sticking up online, which means that you can use incredibly cheap hosting providers for that. So let's jump back to the fibergate demo. And I'm going to demonstrate that by publishing this data to the internet. I'm going to start by installing a plugin called Dataset Publish Vercel. There we go. So what this plugin lets me do is it lets me type Dataset Publish Vercel, Vercel being one of these serverless hosting providers, and give it a file name, instance.db. And it wants me to give it a project name as well. I'm going to say, just check my notes for what I was going to do with this. Let's do LFB Animal Rescues, instance.db.project. And what this is going to do is it's going to take that database file that I created earlier. It's going to bundle it up with the Dataset web application as well. It's going to push that combination up to Vercel to be deployed. They actually use some Amazon Lambda functions under the hood. So this whole thing is going to be wrapped up and turned into a Lambda function and it's going to stick it up online and assign a URL to it that I can use to refer to that project. And here we go. It gives me a link, which is this is the sort of progress report and what it's doing. It's been running for four seconds and it's already installing run times and installing various dependencies and so forth. Normally this takes about 30 seconds. At which point, I will have an online version of the thing that I've just been showing you. That's uploading, deploying build outputs. And here we go. There we go. And so this is that exact demo that I just showed you, but it's running on the internet now. If we go and look at the instance page, you'll see that the map is missing. I can run it again and tell it that I want to install the Dataset custom map plugin. And that will churn away and deploy a new version of this with that plugin installed. And I'm actually, if you're in the... Hang on. If you're in the Google doc, I'll drop a link to this in as well if you want to try that out. Where's that Google doc gone? Here we go. There we go. So that's a link to the demo that I just deployed. So this is pretty exciting because what we've got now is a workflow that lets me take some random data off the internet, analyze it, make some minor tweaks to it, fix it up, do a little bit of data cleaning, and then publish it online so that other people can then interact with it. And oh, here we go. This is the new version that I just published, which has the map. But crucially, now that it's online, it's got a JSON API. So if you want to build your own application against this data, you can do that right now. The JSON API is available. It's ready for you to start using. And if we go back to that ditch example, not Dutch ditch, the API is a search API as well. So we now have a JSON API that will let you enter search terms in the URL bar, get that data back and present that in different ways. And in fact, if you click view and edit SQL here, it will show you the SQL query that dataset used to, in this case, return and results in that search. And this right here is actually editable as well. So I can say, you know, it just give me the instant number, the calendar, the pump count and the final description, but drop out all of these other columns. I'll keep last student with you. So I've now run a custom SQL query. And I can even get that out of CSV. So this is now a CSV file returning the results of a SQL query that is embedded in that URL, in that URL up at the top of the page. And this is kind of fascinating, right? This is a API for returning JSON or CSV by executing SQL queries against, in this case, this arbitrary set of data. So this also seems like it should be a big problem. Like SQL injection is one of the most, one of the most damaging of security vulnerabilities. And I've built something where SQL injection is a documented feature of the entire package. I've got a couple of things in place to try and account for that. Firstly, this is all of these queries that run against a read-only database. You could try updating or inserting or deleting data, but it won't let you. And secondly, I put a one-second time limit on these queries. So if you run a query and it takes longer than a second, you'll get back an error message instead, which combined with the fact that these are running on serverless hosting providers that can restart something if it breaks, it means that this is a very robust way of serving up these arbitrary different SQL queries. So that's enough of the London five-grade demo. I want to jump back and start talking a little bit more about Dogsheep, this idea of, this idea that I started to demonstrate earlier by showing you my swarm check-ins. And there's a story behind the name here. Dogsheep is, oh, I've lost my window. Ah, here we go. So the inspiration for this project was a essay written by Stephen Wolfram back in February of 2019. So a couple of years ago, Stephen Wolfram put out this essay entitled, Seeking the Productive Life, Some Details of My Personal Infrastructure. And it's kind of incredible, right? This is a essay about his approach to productivity, which the principle feature here is the length of the scroll bar as you go down this. He mapped his heart rate and to see if a walking desk was good for it and that it wasn't good enough. So he switched to going for strolls outside with his own like custom laptop rig that he set up. He got people to scan every document he'd ever written since, I think the age of nine and had OCR against those so he can search everything he's ever done. He's got maps of places he's gone. He's got a green screen room in his basement to help him give talks, give remote talks. It's a lot. There is an astounding amount of stuff in here. And I was reading through this thinking, this is totally, this is way too much. This is not something I want in my life. Until I got to this bit, he started talking about archive and search, where he described something called a meta-searcher, which is a search engine he built that searches across all of these different sources of data about him and all of these different archive things that he'd created. And that I really liked. I thought, you know, I'd love to have a single search that covers all of my different personal, all of the different aspects of things that I've built and things that I've saved and so forth. And so I decided to build that and because Stephen Wolfram had built something called, because Stephen Wolfram had built this and I was building something that was inspired by it, but not really as good. I decided to call it dog sheep because he's Stephen Wolfram. My thing was going to be called dog sheep. And then the thing that clinched it for me was I had this idea that he built a search engine called Wolfram Alpha. So I'd built a search engine called dog sheep beta. And that joke stuck itself in my head and basically forced me into 12 months of on and off development work to make that vision a reality. This is very much pun-driven development. So I will show you inside my personal dog sheep. So dog sheep is a dataset instance with data that I've pulled from all sorts of aspects of my life. My swarm check-ins I showed you earlier, but a whole bunch of other different data sources too. For example, I have all of my tweets, but I also have all of the tweets that I've favorited. So if you've ever wanted to search your favorited tweets because I've got them in database, I can do that. This is a search of tweets that I've favorited that mentioned CSV, which apparently is 110 tweets over the past decade. I can search people who I've searched my followers. And the interesting thing about people who follow you is you can send them direct messages. So if I want to talk to an investigative journalist, haven't tried the search before, I can have a search and see if there's anyone who follows me who describes themselves as an investigative journalist who I could drop a direct message to. My dog, Cleo, has a Twitter account and she tweets about all sorts of things he does, but every time that she tweets, she goes to the vet. She tweets a selfie and she tweets how much she weighs. She says, I weigh 52.8 pounds. My vet assured me I'm not the only pet to gain weight during COVID. But this is kind of fun because I can then want a SQL query against that that uses a regular expression to pull out her weight, turns that into a table. So here's 49.7 pounds. And then use another dataset plug-in. This one's called Dataset Vega to chart Cleo's weight over time. So this is Cleo's weight over time by plotting her self-reported vet selfies, which I think is even more useful than using this stuff to figure out what her favorite coffee shop is. I have all of my GitHub activity. I built a set of tools called GitHub to SQLite. All of my tools are called something to SQLite. So GitHub to SQLite lets me fetch repositories I've starred, commits to my repositories, issues, things on organizations I'm a member of. And that, it turns out, is a crazy useful thing to have. So here's a couple of examples. This is a plot of the number of commits I've done per day since I started tracking. Is this per day? Yeah, this is commits per day. So I get a sort of chart over time of what I've been up to. One of the things this can do is pull in repositories that are dependent on my repositories. And so now I can run a SQL query that shows me recent new GitHub repos that are using software that I've written. And I haven't actually looked at this in a while. There's a whole bunch of interesting stuff in here that I might want to go and take a look at. I use the Apple Watch. A fascinating thing about the Apple Watch is Apple have a really responsible approach to this stuff. This watch stores a ridiculous amount of data about me and it keeps it on my watch and on my phone. It doesn't upload this to a creepy cloud server of any sort. And I found out last year there's actually an export option in the Apple Health app. You can go into the Apple Health app. And I think I've got a screenshot of this. I can literally click Export Health Data and I can get a 145 megabyte zip file of all of my health data that I can then airdrop over to my laptop. So I wrote a tool called HealthKit to SQLite that takes that zip file and turns it into database tables. And the amount of stuff in here is absolutely incredible. I've got my body fat percentage over time, the distance like swimming I've done, my headphone audio exposure showed up recently, the number of mindful sessions I've done. I am not a very mindful person, sleep analysis, all kinds of crazy stuff. What's really fun is anytime you track a workout on the Apple Watch, you say, I'm going for a run or I'm going for a walk, it tracks your latitude and longitude every few seconds for the duration of that workout. So I ran a San Francisco half marathon a few years ago and this is my exact route along the San Francisco marathon. And when I realized that this was going on, I started deliberately starting a walking workout anytime I was anywhere interesting at all on a hike or going for a walk around town. And over the past few years, this has resulted in 2.2 million latitude longitude points that I've collected through my watch and they're available just to me. And so this is a map of all of the places where I've recorded one of these workout sessions. I have done very little with this data so far, but it's kind of incredible that I can, that that richness of data is literally just sitting in a little SQLite database because the Apple Watch uses SQLite on my wrist, wasting me to pull it out. A couple more demos and then I'll move to questions. I did 23 in me. Turns out you can export your genome as a CSV file, so I did. And that means I can run SQL queries that tell me what color my eyes are based on my genome. So this query here, which took an entire weekend to figure out, tells me that my eyes are blue 99% of the time, which is, and I tried this against my wife's eyes and her eyes are brown, so it works for a sample size of two at least. And then the really fun one is I use the Apple Photos app. And it turns out Apple Photos uses a SQLite database under the hood. Once you know this, you can suck your metadata out of that database and build your own interface onto it. So here are 46,000 photographs I've taken in Apple Maps. I can see the recent photographs, but I can also run SQL queries against them. A particularly cool thing about this is that it turns out Apple run machine learning models on your phone and on your laptop to figure out what's in your pictures. And that data is there in SQLite as well. So this photograph was labeled bird, pelican, blue sky animal. It knows what a pelican is, and I can run this SQL query here to show me photographs that I have taken of pelicans, which are utterly delightful. They also have these fascinating scoring metrics that they calculate. Things like overall aesthetic score and harmonious color score. So this right here is the most aesthetically pleasing photograph I've taken of a pelican according to the Apple Photos machine learning algorithms, which is amazing and super fun to dig around in. I mentioned earlier that the inspiration for all of this was Wolf from Alpha and wanting to build dog sheep beta. And that's something I finally got working a few months ago. So this right here is a unified search engine across all of this different stuff. It's got my git commits and my tweets and my photographs and my things I've quoted on my blog and stuff I've posted on hack and use and all of this stuff. And as an example, I can search for Lama Portland and see a photograph from last time I went to a CSV con where I got to hang out with a lovely Lama in Portland. This is, again, I'm really just beginning to scratch the surface of what you can do with this. There are so many other sources of data that you can pull into this. But the thing that the lesson I've learned, the most important thing I've got out of this is that if you can get stuff into SQLite, you can do basically anything with it. So the problem with Facebook export, Twitter export and so on, they're all zip files of XML or whatever. But if you do the work to convert those, suddenly you can start combining them and visualizing and viewing them in all kinds of really exciting ways. I've got a couple of extra demos, which I'm going to leave in the document because I want to leave time for questions. But yeah, so my next steps of this project are, firstly, I want to get a 1.0 release out where 1.0 means stability for plugin authors. At the moment, the APIs are pretty stable, but I don't guarantee that I'm not going to break your plugin with the next release of the software. Once I hit 1.0, the idea is that plugins and custom templates and so forth will all be guaranteed to continue working up until the dataset 2.0, which I hope to hopefully I'll never get as far as that version. And then for Dogsheep, I really want to make this stuff easier for people to run. A lot of people have asked me about running a hosted version of this, but that makes me incredibly uncomfortable because a lot of this data is extremely private personal data for people. And I would much rather encourage people to run this stuff on machines they control as opposed to trusting some cloud service for it. I actually, I just got a Raspberry Pi a few weeks ago, and I'm hoping I can figure out a good recipe for running this stuff on a Raspberry Pi so you can literally have a $30 computer that sits on your network and does all of this stuff for you. So, yeah, I'm going to, if this piqued your interest, dataset.io is the websites with which ties all of this stuff together. You can also book office hours conversations with me. I think I've got a link here to sign up for office hours. If you want a one-on-one conversation about these projects, I'm always keen to talk to people who are either using this stuff or considering using it. And with that, I'm going to jump over to the Q&A document, which is here. Here we go. Okay, so, fantastic. How many llamas were rescued in London is now one of my burning questions. I didn't actually see if there were any llamas in there. Let's have a quick look. Where did I put that thing? Hold on. Here we go. LFB animal rescues. I'm just going to search llama and see what comes up. I will be disappointed if there were no llama. Well, actually, it's good for the llamas. Unfortunately, well, fortunately, no llamas have had to be rescued by the London Fire Brigade. Any chance there's an R-friendly version in the works? So, my feeling on interacting with other language ecosystems is it's all about the APIs. So, dataset can produce JSON and it can produce CSV, which means that if you're using R, provided you can handle both formats, you can load that stuff into your environment. Likewise, I'm really keen on observable notebooks. This is the JavaScript equivalent of Jupyter notebooks. And actually, just this morning, Mike Bostock, who created D3 and Observable, released a new charting library called Observable Plot. So, I knocked out this quick demo, which pulls in a CSV file from a dataset instance and then uses Observable to render that. So, I think there's an enormous amount of value that can be built in admitting that Python is not the language to use for everything, but if you've got these clean API boundaries between dataset and other languages, you can do anything that R is capable of, anything that JavaScript is capable of, can pull data from a dataset instance underlying it. How big can a SQLite file be before you run into performance issues? That's a great question. My rule of thumb is up to a gigabyte, it just doesn't matter, everything's going to be great. Up to 10 gigabytes, you're probably going to want to make sure you've got indexes on the things that you're querying against. Above 10 gigabytes, it can get a bit trickier. Like I said, the theoretical maximum is 1.4 terabytes, but I find that once I've gone above about 10 gigabytes, things have felt a little bit creaky. I have a hunch that there are optimizations that I could do. There are SQLite cache settings and things that would make that work better, but I've not done that piece of work yet. But yeah, if you're below a gigabyte, SQLite will just be absolutely flawless. Oh, this amazing link. So, if people haven't seen this, you should absolutely check out this link. This is an incredible thing, that piece of work. This chap here figured out how to have JavaScript run a version of SQLite compiled to WebAssembly, and that's been done before, and so you can have SQLite in your browser. But he figured out how to have it use HTTP range queries to fetch just little portions of the overall database file that he needed. So he's got a 670 megabyte statically hosted SQLite database, and I can run queries against this. I can do order by row ID desk limit 100, and when I run this, it fetches... Here we go, it just ran 49 HTTP requests, fetched 54 kilobytes of data in little chunks from this giant database, and that was enough to answer the query. I was blown away by this. I assumed that SQLite in the browser would always be limited by the fact that you had to fit the database file in the browser's memory. Turns out you don't. This isn't so great for... I mean, if you have a table, it's a full table scan. This is going to have to pull down the entire database, so it's only good for carefully designed queries, but it's a very exciting new development in this world of SQLite as a sort of unified format for publishing data. Do I have examples of people using data set for real world applications? So my target audience for data set initially is data journalists. I love the idea of supporting journalists who are trying to... Firstly, actually helping them find stories in the data, and secondly, publishing that data online. I know a bunch of publications are using data set internally, like ProPublica, for example, have used data set to reports, quite a few impactful stories. In terms of publishing, the best example I have here is the Baltimore Sun, who used data sets to publish salary records of public employees in the state of Maryland. What's interesting about this right here is this right here is just data set, but they're using data sets, custom templates, and theming to add their own color scheme and put a custom interface on top of that. But that's an important part of data set, is custom templates and custom CSS mean that you can use it to build entire websites. In fact, data set.io, the official data set website, this is actually just data set. If you go to data set.io slash content, you can see the databases that it's running off. Like it's got a... There's a news database, which is the news that shows up on the homepage, all of these different bits and pieces. And I'm increasingly exploring this theme of using data set as a way to build sites. Another website I built is called nichemuseums.com, which lets you dial in your location and it shows you niche museums near you. Unfortunately, the Berlin Game Museum of Pez memorabilia closed last year. But yeah, this again, this is just data set. If you add slash content to the URL, so it's slash browse in this case, you can see that the underlying date of this is a bunch of data in a museum's table and then the custom templates on the homepage and so on load that data in and display it in different ways. So does anyone know what time the session ends? If you could add that to the document at the bottom, that would be useful because I forgot to check before I started. You've got 15 more minutes, Simon. I've got 15 minutes. Brilliant. In that case, I've got some really exciting additional demos I can throw in. Do you have any tips for getting data off your Mac? So most of the dogsheet demo that I showed you earlier is data that comes from web APIs. The swarm API is, there's the swarm API and the Twitter API and the GitHub API. And so I'm actually running a $10 a month digit lotion server with a bunch of crons. And so every 10 minutes I have a cron script that runs and fetches my latest tweets and fetches things I favored on GitHub and all of that sort of stuff. And actually, I think I've got a copy of the cron tab that I will drop into that document. The harder part is the stuff on my Mac. So I've showed you the health kit data goes via my Mac and so does the photos. Right now, I've not automated those. Every now and then, I remember to manually export my health kit data, run a script on my Mac and then scp the healthkit.db file back up to my server. And I've got a script that does the same thing with my photos. I'd love to do better at that. I think, I mean, honestly, the biggest problem is I use a laptop and so I can't have a cron that runs at three o'clock in the morning because the lid might not be closed, might be closed. But yeah, there's also a really fun trick you can do where you can scan your Mac. Let's see if this works. Find... There is a way to... Now, I can't find it. There we go. There's a command you can run on your Mac that will find the largest SQLite database files. And it turns out you're going to have a lot. Like, so many Mac and iOS applications use SQLite as their format. And this is really fun because you can start digging in and saying, okay, what are these apps storing about me? What information do they have? How can I process it? Is there a dog sheep available to unravel the Git history with dates? So I have this technique I've been developing called Git Scraping. I've been promoting this technique quite a bit over the last few years. So the idea with Git Scraping is you use GitHub Actions or CircleCI or some kind of scheduled cron to run a scrape against something and then store the results in that scraper, commit them to a GitHub repository. The reason you do this is that it gives you a free commit history of changes. So this is the California Fire Department run this website with currently ongoing fires. I've got a scraper which runs, where are we? It runs three times an hour and it grabs the latest copy of their data and it commits it. So this right here is a copy of the most recent version of that page showing me what fires are live right now. But if I go to the history of this file, I've got how many commits have I got? 445 snapshots of the different incidents. So I've got a unique set of data here because they don't publish that history but I've captured that history and I can use that to do, I mean theoretically I can use this to analyze fires over time. What I'm actually doing is just leaving it sat there in a Git repository. So the question here was do I have any dog sheep tools for unraveling that Git history with dates? I have not built any reusable tools for that yet. What I have been doing is trying to collect patterns for handling that. Oh here's the most recent one. So the CDC have a website that tells you about, they've got a website that tells you how people are doing it being, how different counties are doing it being vaccinated. It's their COVID data tracker. They have an API for it. It's an undocumented API that passes on the pages on their website. But what I've been doing is I've been grabbing a copy of that API every, I think once a day I've got 111 snapshots of that. And then I wrote a script called build database.py which loops through that Git history using a Python Git module and turns that into a SQLite database. So this is a, I'm experimenting with other patents for doing this. But this is a good example of how you can take one of these Git histories and turn that into a file. And then once I've done that I publish that online. So this is published by the GitHub Actions. This is a dataset instance showing me the daily reports of percentage of the population who've been vaccinated in different counties around the state. They've got 125,000 rows in there. It's not a small database. Well, it's small but it's not tiny. And then what I did with that is I built an observable notebook which takes that data and uses it to plot visualizations. I think it's this one. Here we go. So this right here is a choropleth, like a heat map of all of the counties in the US which can animate over time to show you how vaccination is going in those different counties. And this is just under, what this is actually doing is it's hitting that dataset JSON export to a custom SQL query, pulling that data back into observable and then looping through it. But it's a good example, I think, of how you can use Git scraping to get the data. You can use SQLite to store it, dataset to put it online, and then an observable notebook to turn that into, in this case, an animation of the percentage of, I think it's percentage of over 65s who've been vaccinated in those US counties. So I will drop that into the document too. I have not yet engaged with the open humans community directly. There are a whole bunch of, there are definitely a lot of projects in this space, in the space of helping people process their personal data. My philosophy on this is I don't care where the data comes from, I don't care what shape it is, all I care is that I can get it into that SQLite database file. And then one of the beautiful things about having it in SQL is even if you import data in different shapes from different sources, you can reshape it with a SQL query. You can say select these columns from this table, rename to this, Union selects other columns from another table, and use that sort of knock different formats of data into shape. So I've been skipping out on the let's figure out a standard format for this stuff, because as long as I've got it in SQLite, the standard format can come later. But I'm very keen on I mean, all of this stuff is open source. I'm super interested in working with other people on taking these projects forward, especially with regards to making this stuff more accessible to people. Like right now, if you want to run Dataset and Dogsheep, you've got a lot of work on your hands. Firstly, you have to know how to install Python and how to install tools written in Python. You're probably going to want a service somewhere, so you have to know how to secure that. You're going to want to run Cron jobs and deal with APIs. It's a lot. And while I want to put together better documentation and tutorials, if we could make that stuff, I think if we could take the ability to pull in your personal data from different sources and then analyze it yourself and turn that into like an installable application that you can install on your phone or on your computer, that would open up the floodgates and make this available to so many more people. I don't want to do all of that work because that's a year or so of work just on the sort of building out those user-facing tools. But I'm super keen on working with anyone who does want to put in that effort to help democratize these things. Is there a plug-in for modeling tools? I'm not sure what you mean by modeling tools, but so probably not. One thing I can show you, though, is I've got a observable notebook which does entity relationship diagrams. Maybe if I search entity, I'm going to go straight to observable for this one. So yeah, so dataset exposes metadata about your databases as well, which means here we go. So I can give it the URL to a dataset instance, and this tool here pulls in the foreign key relationships in the tables and draws you a diagram of how those tables fit together. The GitHub one is particularly fun for this because it has a lot of relationships going on. And the joy of observable notebooks is, once again, this is, anyone can go and see this and view the source and fork it and do new things with it. So it's another example of why giving something a JSON API and putting it online opens up so many opportunities for visualizations and collaborations and things like that. So a list of services that you can recommend for, like for us, for those of us who want to deploy small datasets but not figure out options from scratch. I think the cell is the best one of these at the moment because it's got the best developer experience. Like you can install that, you install the software, you run the cell, I think it's the cell login to create an account with an email address, dataset published for a cell and off you go. At the same time, I'm always interested in new options for hosting these things. So I've got a section in the documentation about publishing data which talks about Google Cloud Run, Heroku, Vercel and Fly are the four that I sort of dedicated support for. Google Cloud Run is my favorite option for larger datasets. Vercel I think has a 50 megabyte cap on how much data you can put up there. Google Cloud Run is effectively unlimited although I find that once you get above about one and a half gigabytes of data, it's a timeout on you and you get various errors. But for data up to a gigabyte, Google Cloud Run works amazingly well and because all of these things scale to zero, they only charge you for when your application is being used. So most of my Cloud Run deployments cost me 50 cents a month. Like the biggest ones I think one of them cost $15 a month because it was being hammered by search engine crawlers. But for the most part, this is a really inexpensive way of publishing data. Vercel is effectively free for the things that I'm up for for these small datasets. So it's definitely worth taking a look at Vercel for this. There we go. Do I need help with documentation? I need help with so many things. So the dataset, as you can see there's a lot of it. There are almost 100 different projects under the dataset umbrella now. And that's fine. I'm enjoying it. One of the reasons I'm so engaged with this project is that it has so many different areas that I can explore with it. But I think when it comes to contributing to open source there are a whole bunch of ways you can contribute before you even get to the point of like sending and pull requests with code. The things I particularly interest in people doing, firstly it's using dataset and showing people what they've built. So actually I'm doing things up there and saying hey I used this tool to solve this problem. If you write about, especially if you write up a few notes on what worked, what didn't, like give me a little bit of feedback on how the tooling works. That's super useful. The dataset ecosystem is missing tutorials right now. I've written 150 pages of documentation and it's got great reference documentation I think. But I don't have those. Here's how to get started with datasets. So tutorials are basic. Video tutorials are super interesting as well. Writing tutorials is fantastic. Then step up from that, yeah absolutely if you look through the dataset documentation and find there are things that aren't clear or things that are missing or things that be worded better, really keen on having feedback and pull requests and contributions around that. And then the step above, the other thing that's really fun about dataset is that you can write plugins for it. And the thing that I love about plugins is plugins are features that my software gets overnight without me even having to review a pull request. I can literally wake up in the morning and if somebody has released a new plugin, the software has new features. So I'm sort of like the sort of grand vision I have for dataset is sort of inspired by WordPress where WordPress is a perfectly decent content management system with 7000 old plugins that mean that it can solve any publishing problem you can think of. So my sort of grand dream for dataset is to have dataset as this core utility for publisher data making available as API is giving you basic exploration and then plugins that add every kind of visualization you might want, every kind of data modification routine, data cleanup, all sorts of things. If the plugins can pick up on those I think that would be an amazingly exciting way for dataset to grow and for it could solve so many more problems with that thriving plugin ecosystem. And then beyond that obviously if people want to work on dataset and the dataset projects themselves, it's all open source. I'm very e-gift to have conversations and issues except pull requests, review codes. I think it's had about 30 contributors so far but I'm super interested in growing the field of people that are actively contributing to the software. So have you thought about running this in the browser? For example, using Pi Iodide I've very briefly thought about this but I've not done any deeper investigations into it. So Pi Iodide if you haven't seen it is absolutely phenomenal. It was I think it was a research team at Mozilla. They took, but they got Python working in WebAssembly and then they got all of the Jupiter Notebook stack working with that as well. So you can run the full Jupiter Notebook stack entirely in your browser assembly, which is just an astonishing thing to do. I'm happy enough running servers. One of the unique selling points of dataset I thought was you can have a gigabyte database and only return the bits that people need and then of course yesterday this chat demonstrated that's not actually true and you can pull ranges of data from SQL. But I think generally I really like the idea of publishing data along with APIs such that you can build, such that you can you're not limited by what fits in the browser you've got a server helping you out, but then the browser itself can do the visualizations and all of that other stuff. So if somebody gets dataset working Pi Iodide I'd be fascinated to see it, but I don't think that's something I'm going to work on myself. I'm going to quickly show one more demo. I wanted to get to this earlier but I thought I should cut things off. So SQLite has an extension called Spatialite which is kind of like PostGIS Postgres. It adds geometric functions and so on to SQLite. And it works with dataset. You can publish datasets with the Spatialite extension available. So I was playing around with this. This is a plugin I wrote called Dataset Leaflet FreeDraw which lets you draw shapes on a map to feed those shapes into a SQL query. So what I've got here is a SQL query I wrote which looks for mini parks in California using this amazing dataset from the GreenInfo network and what this lets me do is draw a circle around an area of the map I'll draw a circle like that run the query and this will show me all of the parks within that shape that I drew that have the word mini in their name. So this is a result set of all of the mini parks in San Francisco or I can draw a shape around Oakland and run that again and now I'm getting back mini parks in Oakland and really I built this as a demonstration that you can have plugins which provide user input so in this case draw a circle it inputs a giant blob of GeoJSON which gets fed into a SQL query which is used to return those results but I feel like this model of plugin this idea of building these sort of richer interface elements which just in this case it magically adds this interface if you have a parameter to your SQL called free draw so if I change this to free draw one if I change this to I know something else that map goes away it's oh I think I've got free draw in there twice that's why but yeah so without colon free draw as a parameter you don't get the widget with it you do but yeah this and so there's a whole scope of data set around these spatial queries building up GIS tooling as well which again I've only really just started investigating but it feels like there's a lot of potential there too popping back to the questions I think we are just about out of I think yes Simon we could just keep going forever because I'm having a great time and I think everyone else is too but I think this might be a good place to stop and thank you for a very exciting keynote for those of you still adding questions to the doc please do so Simon will look at it and add more answers even after the session is done