 So thanks for the opportunity to come and talk today. So about something that I've been really excited about the last six months or so. So it's a great chance to sort of play around with some stuff. So by slides today, so there's a link there. And if you follow that link, you'll find that you're actually loading up a live computing environment. And we'll actually be able to, if you wanna play along, you can actually do stuff as we go along today. So let's get stuck into it. Oops, it's on the screen, okay. So now it is a live computing environment. And we are actually gonna run some live code during this presentation. So if you wanna play along, it's pretty simple. It's just a matter of when you get to one of these sort of cells, just click on the little play button so that it runs out a little bit of code. Did you send them a link to the location? No, not yet. I was on the first slide. So what I'm using is Jupyter Notebooks. And you might all be familiar with Jupyter Notebooks. You've probably seen them before. So this might all be a bit old hat to you, but I'll just go through a few things. In fact, I won't actually talk about what a Jupyter Notebook is. I'm just gonna show you what a Jupyter Notebook can do. Probably in the context of the total of each data. So let's start. Let's just get some data from Trove. And I'm assuming you all know what Trove is. If you're not, you don't know what Trove is. So let's get some data from Trove. I'm just gonna do what I see. Click on that little play button, set some parameters, goes off, makes a request to the Trove API and brings us back some data. In this case, we're getting facets showing us the number of total number of newspaper articles published in each state in Trove. So we've chose 200 million newspaper articles, digitized newspaper articles. Okay, so we've got some data. What can we do with that? Well, let's make a map. So again, just gonna run this cell. And what we're gonna get is, let's go to the next, so we can actually show it, is a corollary map. There it is. Which just again shows us the number of newspaper results per state. So this is a really simple example of really the power of Jupyter Notebooks with these sort of exploration with cultural heritage sources. Now I can make those requests. I can get it back. I can do something with it. And we can play with it. We can start to analyze it all live within the comfort of your own browser. And it's able to do that because we're actually sitting on top of a live computing environment. So one of the nice things about this is that we're not just limited to what I put into this notebook. So if I now just go backwards, I'm just hitting Shift Space to go backwards through these cells. Go back to here. These are all editable. So instead of just having a blank query, which gave us everything, I can search for Kevils. Run that again. Let's go through this. We're just gonna run it again. Make our map again. Have a different map. So not only are they live, computing environments, which we can interact with real data sources, but you can edit them and change them. So you can actually use them as a tool for exploration yourself. So that's really the theme of what I'm talking about today. Okay. So what's Jupiter? Well, Jupiter is this presentation. This presentation is a Jupiter notebook. It's using a particular sort of plugin which enables it to be presented as a series of slides. But underneath it's just a Jupiter notebook. As you've seen, it's editable. And any of these cells can click on them and change any of the values. It's shareable, obviously. And indeed, this live version is running on a service called MyBinder, which is a service which basically you send a GitHub link to your notebook. It spins up the computing environment that you need in order to run that notebook for a document instance and then provides that back to you. So it's really handy for things like workshops. You can actually create a notebook. People can just log on. They can start playing around with it in that live, customised computing environment. Why? Jupiter. Why am I interested in Jupiter in the context of cultural heritage collections? So I've been playing around with various ways of working with cultural heritage data for a long time. Really seriously sort of hacking collections for the past 10 years. Sharing a lot of code and examples, creating various tools and applications, presenting lots of workshops and stuff like that. But, you know, I've always been a bit frustrated in our ability to really encourage and support people's own exploration. I mean, you can present tools and they take a certain amount of the way, but running an environment where they're encouraged to actually start poking around inside the code and go a bit further and see where it takes them has been a lot more difficult. And that's what really interests me about Jupiter. So, and in doing that, in the sort of stuff that I do, the sort of examples that I create, I've been focused on two particular issues in relation to glam collections. That's glam, galleries, libraries, archives and museums. And there, the challenge of abundance and the illusion of completeness. And what I'm going to do today, I'm going to explore those two sort of facades through the context of Jupiter, and then we'll try a variety of experiments and see where they go. So, first of all, the challenge of abundance. Okay. So, we can save a trove, trove. Tell me how many newspaper articles you have about influenza. And there's a little bit of code, which we'll do that. And if we just run that bit of code, it tells me that there are 1,614,300 digitised newspaper articles about influenza. And that's pretty common. If you're typing something into those digitised newspapers, of course, it's a huge, incredibly rich collection for many types of research. But it's also a bit overwhelming. You know, how do we make sense of that volume of material? What does it mean that there is 1.6 million results? So, let's start thinking about how we can drill down a bit to go down through that result. So, we could, for example, just do a little quick thing, which shows us... They have questions to us. Oh, okay. Nothing going on there. Okay. So, we've got the... This is just breaking down by category. So, mostly advertising, actually. So, presumably remedies relating to influenza. Let's see if we can get a number of articles. Let's just keep playing around with the possibilities. So, let's think about how could we look at this as a change over time, with the number of articles over time? And actually, I've got a tool specifically for that. It's a thing called QueerEpic. And it's been around for a long time now. So, I think I've created the first version of this back in 2011. In fact, the first version of this predated, the Trove API, and it was just sort of screen scraping data out of Trove. But what it does is simply shows you the number of results matching your query for each year. So, it shows you the whole of your result set. Instead of seeing a list of the first 20 results, you get everything displayed over time. And you can start to explore that, and you can drill down by clicking on a point. And that's good. QueerEpic is quite well used, and it's actually been cited in a number of published articles and books where people have actually used it in their research. Again, the frustration is that it takes you to a certain point, but then it's hard to know where to go. How do you follow that through? How do you continue your explorations? So, what we can do, again, using this notebook, is to start to build our own version of QueerEpic, a transparent version, which actually exposes its working instance. This is basically the same sort of code that's sitting behind QueerEpic. And in this case, we're going to look at influenza from 1880 through to 1940. So, again, I'm just running this code, and it's making a request for each decade in that period. So, that's why we're getting... It's actually six API requests that it's making, bringing it back again, and showing us that... And we can then get assets by year. So, we can take that. We can, again... Oops. I'm getting my shoes mixed up with my edges. We can run that one to make our chart, and then we can just display our full chart here. So, we see the number of articles over time. So, it's just same like this QueerEpic, just in that notebook form that we can play around with and edit. Now, okay, that's interesting. But then we might have been looking at this sort of thing. We might say, okay. But, you know, how do we know there just weren't more newspapers published in 1919? How do we interpret that peak in 1919? Well, one thing we could do is just try dividing the number of results by the total number of articles published that year. That's just a matter of making another API request. So, now we're getting the total number of results for each year, and we're just going to divide the number of results when we influence a query by the total number of results for each year. Okay, we've got another chart. And we see here it's slightly different. So, the peak in the 1890s is more significant than it was in the early one. That's clearly, as we all know, it was a significant peak. I mean, it influenced the epidemic in 1918. 1919, and that is a real feature of that chart. So, it's not just a matter of little more articles. There is something that we're all looking at there. So, let's focus in on that period, that 1917 to 1919 period. And so, in this case, what I'm going to do or what we're going to harvest is we're going to make a number of requests, basically one per month, between across 1918 and 1919. And we're asking it to show us this time, the facets, the titles of the newspapers that are publishing these articles. Once we've got those titles, what we're going to do is match them up with another data set, which has geolocated those titles. So, it has positions where those newspapers were published and put all our results on the map. And that's all the code is here. So, I'm not hiding anything from it. This is all the code that is lunging all that data together. It's bringing in the location data and it's just using pandas, which is a tool which is heavily used for manipulating tabular data. It's just using that to link the tables together. So, if I go here, I'm just going to make the map. And finally, I'm going to show the map. And so, we've got an animated hit map, which is actually taking us through that period, 1918 to 1919. So, we've started with, you know, a particular question around influenza. We've seen the sort of full-scale results that sort of 1.6 million results. And then we've started to drill down from that to see what we can find out about those results. And that's all just within this notebook, just those bits of code that I've been showing. There's no sort of magic behind the scenes here. Okay, so that takes us to a certain point. And we've been working there obviously with a small number of API queries. We're just getting faceted data out, which gives us summaries of the material. But, you know, we get to a point where we're going to want to dig a bit deeper from that. We're actually going to want to pull out that data relating to those newspaper articles and start to explore that in depth. And we can do that as well. So, this is a full notebook here, now not in this slideshow form. So, this is, so, a number of years ago, again, back sort of 2011 or so, I created a tool to do just that, to just harvest metadata from newspaper articles in Trove out into big data sets so that you could get, you know, 5,000, 10,000, 20,000, 50,000 newspaper articles, which you could then analyze in the sort of tool of your choice. And that's been through a number of revolutions over the year. And at the moment, it's a Python command line tool and it's pretty easy to use, but still it has that barrier in terms of you have to get Python environments set up and you have to install the tool. You have to use the command line, which can be a big barrier to people. Oh, once again, what's cool about the notebooks is I can actually just run that command line tool from within a Jupyter notebook just within, again, the comfort of your browser. So, this is another live notebook. I can actually show you that the Trove Harvester is sitting there behind it. So, that's the command line tool there. I've set this up with an API key and with a basic query, which in this case is cyclone and rag, limited to the decade of the 1910s. So, all I'm going to do is just run the Harvester. And if the Trove API behaves, it's going to harvest about 300 newspaper articles. So, in this case, obviously, I've kept it to a fairly small sample, but using this notebook, I have harvested tens of thousands of newspaper articles. Now, we're harvesting the metadata. So, we're harvesting the basic publication information from all those articles. But we're also harvesting the full text of those articles. So, we can actually, once the Harvester's finished, start to do stuff which explores both that metadata, the top-level information, actually digs down into the text files themselves. And that's just done, I think. So, you know, when something's done, when that little ester extends into a number, there it goes. Okay, so, now, if we're running, as in this case, on MyBinder, which is a cloud-hosted service, you want to download the results. So, if you can just run this cell, it creates, zips up all the results into a zip file. I can just run this file. This cell, it gives me a nice download link. I can just click on that and it'll download the results. So, I can use this page as a way of harvesting thousands of newspaper articles from Trove, downloading the results as a spreadsheet, and all those little text files as well. And then we can open up another notebook which gives you some hints that you might like to start to explore that data. So, in this notebook, we're actually going to just quickly open the last harvest. We can do things like show the newspapers that are represented most often within that set. This is just working on the spreadsheet. We can obviously look at them over time within that set. So, we can drill down. We can make a simple word cloud. Again, this is just using the titles of the articles. That one twice. So, these are just hints for exploration. So, the idea of me putting this notebook together is really just to give people some idea of how they can start to, again, sort of ask questions about that data. And you can actually go further. So, there's another notebook which actually enables you to work on the text of those individual files. So, you can go through whatever little text files which have the OCR content from those new paper articles. And you can start to feed them through text analysis programs to look for patterns and frequencies. And there's a little thing here which enables you to do some TFIDF analysis to look at most significant words within each of those, within those articles. So, all sorts of ways you can start to explore them. So, there's another, similar sorts of things that I've been putting together. So, there's a harvester for Record Search, which is the National Archives in Australia. And there was a pool of metadata out for a whole series. And I've created some sample data sets using that. So, this is a series in the National Archives relating to the wider Australia policy. And you can have to sort of build out to this notebook to get a summary of each of these series. And you can then just actually download the CSV file which has the data for it in it. So, you get a little summary. You get a little chart so you can see the date range. And there's a little link somewhere over to download the CSV file. So, again, that's using the notebook. So, there's a form of delivering data. Okay, just quickly. So, why do you do that? Well, you know, I think it's really cool that you've got everything that you need in the browser. And that makes it great for workshops. Everybody's had to do stuff in computer labs and knows how difficult it is in the university in computer labs. In a name you ask questions of the glam data and follow where they go. You know, to start big and then zoom in. And of course, you can rinse and repeat. You can start with a notebook and you can go back. You can edit it. You can change it. You know, people often ask me, you know, how do I learn to code? And my most common answer is, well, you get somebody else's code. You copy and paste it and you sort of fiddle with it until it's broken. And then try and figure out why you're powered with broken and you fix it and you go through that process. And that is really facilitated by the notebooks and that you can just edit it and change it and try it without the worry of breaking anything seriously. Okay, so the other question that I looked at is that one about the illusion of completeness. And I'm just going to do another little quick trove query here. And this is showing us all of the newspaper articles over time. Now, if we could actually speak, I would ask you what that peep there in 1915 represents. And normally when they ask that, people say, it's the war. But were there more newspaper articles published during the First World War? And the answer is no. That peep represents funding. In the lead-up to the centenary of World War I, money was invested in the digitisation of World War I era newspapers. So that peep doesn't actually represent anything about history. It's just about the way this, the way you trove the policies behind the digitisation. And that's really important for people when they start working with these sorts of collections to understand that they are constructed. That, you know, through accidents of selection, through the implementation of policy, through funding, all sorts of ways in which these collections get created. And, you know, I think it's a basic thing that we should be suggesting these sorts of things, APIs and CSV files and collection data. So the same sort of critical analysis that we would if we had a collection of primary sources in print form. The thing is that it's harder to do. But, again, the Jupyter and Rebels give us that opportunity to start playing around with these sorts of things in a form that we can easily share that other people can learn from. And we can all start to understand what's going on behind the interface. And I won't go through it now because I'm at the end of my time, but there's no book there which I spent some time last week looking at the new API from K-PAPA, collection API. That's really great. They've got really rich data there, but there's also some unexpected stuff which we only sort of find out about once you start digging through and making a few requests and going down through the facets. So, why Jupyter? So it's not just about working with the content of the data itself. It's the ability to ask questions around the systems and the technologies and the policies that construct these things that we're using. And the real fun part is that ability just not just to tell, but to show, to give people the opportunity to learn by these sorts of examples and to actually do real work while they're learning. You can go back through this notebook and plug in your own research topics, and you can see where they take you. Okay, so the work that I'm doing is sitting here in this repository in GitHub. So, I mean, the process of sort of reorganising everything at the moment. So it's a bit of a mess. But feel free to jump in and try any of the notebooks. Most of them will have links which enable you to open them up in Binder so you can play around as with this one live. And feel free to in the issues on any of the repositories on GitHub to add in any requests or additions that you might like to see. And thanks very much.