 All right, so my name is Mike Turznan, I'm a data scientist with the Smithsonian in a group called the Smithsonian O.C.I.O., which is our IT office data science lab. And I'm going to be talking today about the Smithsonian Open Access project that started last year and making the data that was released with that more accessible using Python and Dask. And then show a few examples of that in practice. Okay, so to start off, the Smithsonian, a lot of people don't realize that, yes, there are 19 museums, mostly in Washington, D.C., that are part of the Smithsonian Institution. We do have one in, actually two in New York, but they're not necessarily just in Washington, D.C. We also have 21 libraries and archives, nine research centers, and a zoo. So there are lots of different units at the Smithsonian working on, I don't know, the craziest range of different topics. So the Smithsonian's mission, I'm not sure if people know the origin story, if you will, of the Smithsonian, but it was founded in 1846 from an Englishman named John James Smithson. He had never visited the U.S., but decided to leave his inheritance to the United States government with this idea that the United States government start organization known as the Smithsonian Institution. So he thought very highly of himself just grabbing that name. So I think this is so cool. The mission would be to, for the increase and diffusion of knowledge, which is like such a bold statement. So this is a pretty cool comic that I came across a few years ago online and I printed it out, stuck it on my desk when I used to work at a desk at an office. It shows the progression from data, where you're just kind of collecting raw data out in the field or something like that. It becomes information as you connect to different pieces that becomes knowledge, and then insight, and then finally wisdom. I've also seen this comic kind of extended recently to show connections that aren't there, and it's labeled conspiracy theories, but that's another story. So the Smithsonian has been increasing and diffusing knowledge, which is this central panel here, but what about all of that data, which is the first panel here. All of that data and info, those are the first two panels that feed into knowledge, insight, and wisdom. They're all being catalogued and stored in drawers across the Smithsonian. This is a cool shot from the entomology department at the Natural History Museum to show you that there are just like floors and floors of these racks of drawers that are closed most of the time. So you don't get to see this on a daily basis, but it's kind of cool to see the contents of all these drawers, and it really kind of starts to hurt your brain a little bit when you try and comprehend all of the different amount of data that is stored at the Smithsonian, both in the drawers, and we're digitizing more and more of that. Okay, so that was the previous state was a lot of things in drawers. Then the Smithsonian made the decision in throughout 2019, we were kind of ramping up to this and then had a launch event in February of 2020 to create the open access policy to share our data on a CC0 license. And this is a photo from the open access launch event, I took it there. You may note that late February 2020, it's known as the before times now, it was kind of weird, we were all packed into the American History Museum, but it was a pretty neat event. So yeah, what is in the open access release. So, across the Smithsonian, this is just an estimate because we can't really count it all. There are 155 million objects, 2.1 million library volumes, and then a ton of feet of archival collection so it's just boxes of people's notes essentially. But within the open access release itself, there are 2.8 million 2D, mostly 2D images, and then we also have a lot of 3D scans in there as well. And then kind of most useful to a person like me, we have, we've made all of the metadata records for all of the digitized objects we have in our collections. Available and there are over 17 million of those at last count. Okay, so what does open access mean I've kind of been throwing this around this term around. So, before February 2021 we had our open access launch, all the Smithsonian museums and research units. They're data available and searching but it was kind of on their own policies they set up their own search engines, or data dumps, but it varied by unit. There was no consistency in terms of, I guess, formatting, keeping things consistent or anything like that. So, pretty much every unit had their own use usage agreements. So, I worked at the natural history museum, then, and I know the term of use said that you can use any of the data for an educational purpose. What this, the open access did was it made all of the media that fell under the open access policy completely open access. So, on the CC zero, that stands for copyright, something copyright zero policy, which means you can take any of the images that are distributed through this and do whatever you want with them. You can you can take scans of paintings, you can reproduce them sell them making money off of them. And the idea here is to kind of put all these things out there, they're funded by the US taxpayers anyway so they kind of belong in the open realm there. And see what people do with them. So, how can all of this data be accessed. We, I said that we release this data but what does that mean in actuality. So, I'm going to cover three different ways that you can access all of the Smithsonian open access data. Three of them share their the metadata records in the exact same consistent Jason structure. And I know here that it's deeply nested, and I'll show you an example of that later. So, the first way of accessing the or accessing the open access data. Sorry, there's a lot of access here is through a web API. So, I have a link there. I'll share the link to these slides afterwards. So you can click on all these links. What this link gives you is the self contained documentation and kind of like a playground you can test out the different endpoints in there. So, it's free to sign up for an API key to use it. However, it's, it's free. I believe they only access or ask for your email address. And it's, it's a quick turnaround it's automated. And I recommend it for for getting a feel for how the data is organized and what the record structure looks like because it's a little peculiar. The API is kind of the most marketed form of getting to the data. However, if you're interested in doing some big things with the whole data set, you're going to run into some limitations pretty quickly. So, the limitations at list here are that the records are pretty extensively indexed for searching. So if you want to search on a certain term, if you want to search for like a person, all of the names are indexed. However, you're you're quickly going to find examples of things that you know are in records but you can't search directly. And another is the limit of you're only limited to 1000 records per API call. If you're looking to do some sort of analysis across millions of records across multiple units. That's just not going to be feasible. I want to show you an example of where I use the web API and then the other sources. And I want to set that up here so in 2017, the data science group that I'm a part of now they did this actually before I joined the group. So they did a little pilot project where they went into the holdings of the botany department at the Natural History Museum. And the botanists traditionally store all of their plant specimens on a flat sheet. And they have those in drawers at the Natural History Museum. And they're great for digitizing because they're these flat pieces of paper, and they actually set up a conveyor belt. I completely forgot to include that in here, but they set up a conveyor belt that takes pictures one by one as these things come through. So we have a ton of digitized or various sheets. So one example project we tried out was to try and detect herbarium sheets that had been stained with mercury. This was a practice, I think in the 1800s early 1900s, as plant specimens came in from the field a lot of times they had bugs or mold or things like things like that. They wanted to kill all of those bugs and mold but kind of preserve what the plant specimen looked like. So they would dip it in mercury, which at the time nobody really realized that, oh, mercury actually is pretty dangerous. We should stop doing this. They did eventually, but we now have these sheets that are soaked in mercury. And people who work with the sheets on a daily basis are able to spot them. But this was a great chance to try out machine learning to see if we could, I mean, take a set of expert identified mercury stained sheets, as well as sheets that we knew were not stained by mercury and build a machine learning model to differentiate between the two of them across the entire collection. And there's a link to that paper in there. So that paper was completed and published in 2017. If you work in machine learning or familiar with the field, 2017 is kind of ancient history from now. A lot has changed since then. And I was looking to duplicate the same experiment and use some of the more modern techniques to build a model. So I wanted to get all of those training images and do it again. So all the images were shared on fixed shares part of this paper, but I wanted to get the original images from the database because the ones that were shared to fixed share were resized. And I also wanted to get all of the metadata around it, like when were these specimens collected by who to see if we could use that to build a more accurate model. Okay, so here is this example of the record format that we published these records to out in. So this believe it or not is a single botany record. And all of these different pieces of information are included in here and you can see they're kind of like variously nested if you if you look through here. All I had from the fixed share repository was the barcode. And like I said, barcode was not one of those index terms so I couldn't search on barcode so even though I had a list of several thousand barcodes, I couldn't use the API to grab all of the images of the herbarium sheets that I needed. So I had to download the entire data set and work through that. So this shows you kind of like how nested that that might be. This picture is kind of funny to me. It reminds me of like those Mars rover pictures where they stitched together multiple images into a mosaic. And I kind of had to do the same thing to capture the entire JSON record. Okay, so I mentioned the full data set. And there are two different sources for that. So this is option two and three for accessing Smithsonian open access data. The first is on AWS S3, which stores static data. And that source has all of the metadata records, as well as all of the open access images can be accessed from there. So it's pretty quick to download. And then GitHub has all of only the metadata. However, it's versioned and I'll show you a few other advantages to using GitHub. And the way that the files are packaged are has lined a limited JSON so slightly different different than the normal JSON you may be used to. They're compressed using B zip to to save space and make it easier to download. And the directories are all organized by which unit they came from. And then the files are split up, according to how they are hashed. And I'll show you an example of what that means. So you probably can't squint and see this, but this I'll describe this is a screenshot of the holdings from the American History Museum, and it's split into all of these different files. And you see that you may see that the first file here is 00.txt.bz to following the the hashing structure there are almost always 256 of these are varying sizes across all of the units. So that's a little intimidating if you open up if you say like I downloaded all the American history records and you get this. Bummer if you need to go through every single file one of time to try and do something with that it's going to take a lot a lot of time there are almost 10,000 files across all the units to process. But believe it or not there's actually a benefit to having so many files. I'm not sure if it was thought about before but I my eyes lit up when I saw that there were so many different files, because we can use that to multitask. I'm going to just jump in here and say there's five minutes left. Yep, got it. So this is where Dask comes in. Dask is a Python library that's built exactly for multitasking processes. So here's a little gif from the Dask website that shows you how that you can set up a miniature cluster on your own computer. Most of what I'm going to show you here was done on my laptop. Or you can actually scale using the exact same code to real compute clusters either in the cloud or on high performance compute clusters. This is a screen grab that I took yesterday actually when I was trying to count the number of images for that or the number of records for that 17 million count here. And this shows how Dask is able to split tasks across different processors. And it shows you in this really cool dashboard which I love just watching. I can watch this kind of all day. So Dask may be known a little bit more for its parallel processing of data frames. So like tabular data, but it's really great also at parsing text data like JSON data that we have. So these are the lines right here that it takes to process every single one of the records either from S3 or from GitHub. This is the exact line of code that you would need to run to process all of those at the same time. And the way that it does it, Dask is delayed in its computing. So it kind of like identifies all of the different file locations and then sets it up so that you can compute against those when you're ready. And they also have a built in S3 connector so they know that people store a lot of data on S3 and they have a connector ready to connect in. Here's an example of going back to that Mercury sheet example. I downloaded the entire natural history botany collection which is gigantic in terms of the number of records. But I wrote a little function to pull out the IDs and these are the lines that it takes to grab useful information from that crazy nested JSON record and what it looks like when you can pull it out and you can actually do some processing on it. Okay, so I wanted to point out another cool example. I've been working with an intern from George Mason University, Patrick McManus may or may not be in the audience. He and I worked together. He focuses on American history. We looked at the holdings of the American History Museum to look at the different dates in there and how different topics, different places show up across different dates. And this is a screenshot of a streamlet app we built together. And then another example, there's a link here is a tutorial we built as part of our launch with AWS that shows how to pull down all of the painting images from the American Art Museum and then run them through a machine learning model to cluster them by the content. You can't really tell at a closer glance, but you can zoom in on a larger cluster image we have on the GitHub that's linked here and see why the painting showed where they did in this big cluster. Okay, and that's it. So. All right, well, we have time for one question. So let's go ahead and take Megan O'Donnell from Iowa State University asks, dare I ask what metadata schema or schemas the metadata is in. I, I don't actually know. I mean, there's no like quick answer to that there is a documented JSON schema for, I think, all of the different units. And I believe it's on GitHub. I can, I can share that in the in the slide question and answer afterwards. There's like a name to the schema that we came up with, I think it may or may not have been created organically to fit our needs.