 Hi, welcome. Good morning. Thanks for coming. So we're Greg Janssen and Richard Marchiano from University of Maryland, also from the Digital Curation Innovation Center at the iSchool at Maryland, and we're gonna talk about a big project we've been working on and some of the areas that it touches in the community and go into that. So this is called drastic measures. Drastic is the name of our repository project, and we're applying drastic to big archival problems. So exploring how we can create workflows and storage and compute around big collections. So here's what we're going to cover. We're going to go into our approach to 1 billion files, the drastic repository, NCSA's brown dog service, which is a key component here, and our source of funding. This is our National Science Foundation grant project, and a brown dog is a federal dibs project. It's a data information building block. Okay. We're going to go into our workflow for automatic feature extraction and curation and dig into collections with elastic search, and then talk about further projects and opportunities. So to set the stage, I just wanted to go into a little conceptual digression about format debt. So this is a sort of theoretical number, the amount of money, time, resources that it would take to completely unpack, decrypt, decompress, and understand what's in all the files in all your collections. So it's a theoretical number, and as your collections grow or become more diverse, it grows and it grows and it grows and it grows. It's a number that you can theoretically pay off, but probably not practically, and if you don't pay for it, somebody pays for it sometime when they want to access that data. If I want to access a file, I may have to use a legacy tool or use some sort of reverse engineering in the future. So this keeps accumulating across media types, across hundreds of formats, and there's an investment required to unlock things. You can pay it now, you can pay it later, or your users can pay it. This is our collections. So this is a half a rack at University of Maryland, four compute servers and a whole lot of hard drives. We have 72 terabytes of data, 100 million files, so we're at one tenth of a billion. So that's why we claim a billion. We're approaching a billion. Yeah. Drastic is an acronym, digital repository at scale that invites computation comma to improve collections. It's a product of two years of startup funding for a company called Archival Analytics out of Liverpool, the UK. They were partners with us on this NSF grant, and we worked together. But now we are the inheritors of this project. It's it's given to us and we're now the stewards of it. It's now a community based project, and we're looking for partners. It's unique in that it's a repository that's built on distributed database, Cassandra, and it scales horizontally. So you can keep adding more nodes and have stable performance. There's not really an upper limit because Sandra is used by many Fortune 500 companies and used to store and build applications for millions of people. It has a web UI. It has a command line client. It implements industry standard REST API called CDMI. It's an SNIA standard. It's called Cloud Data Management Interface. It stores key value metadata, at least right now. And it does eventing over MQTT. So it has a messaging system you can use to build workflows around it. The Python source is on GitHub. And like I said, it's Apache Cassandra. This is kind of a component diagram. The drastic API server sits in front of the database. The cluster, the database cluster can grow. It can be much larger than that. It can be multi data center. It can be configured to replicate the content across however many replicas you'd like and across different data centers in different numbers of replicas. And you can run more than one front end server as well. You can run more than one drastic API server. And the consistency of your information will be eventually present in the database, which is a little bit different from traditional database types of apps. This is just the web UI. A little screen grab. We're using it to store about 150 different federal record groups. So that's our test collections. So next steps with drastic, we're looking to integrate with Fedora's repository API. Very exciting work, which we're talking about just today. Distributed computing to analyze extracted metadata and the content and the full text. So the Cassandra database offers a parallel computing platform that you can apply to all this stuff. And we're looking to operationalize drastic. It's been a R&D project. And so now that it's a community project, we'd like to make it more sustainable on a technical level with quick start installs, backup recovery scripts, multi data center, proof of concept and cloud deployments, things like that. So which years NCSA's brown dog project is a public web scale API for data transformation, feature migration feature extraction and format migration. Big NSF project. So it's at NCSA or Spanish campaign. Many partners, mostly in the sciences, but we're sort of there to represent the archival use case. So brown dog is really just a few rest APIs, which you can send your files to and either ask for extractions of metadata, in which case it will fire a range of tools off against your data and extract everything you can from that file. It's sort of a facade in front of many, many different virtual machines that are doing many, many different things with your data. So it extracts the metadata irrespective of format. It reuses existing tools packaged inside virtual environments to do this kind of work. So they've all been sort of Mac road and auto hot keyed if you're familiar with that and scripted in many cases to do this work on behalf of the API. The community, us, all of us can contribute more extractors and more converters. So it has an architecture that invites contributions. And that kind of growth. They've engineered it for web scaling. So when more and more requests come in in an hour, they have done the the testing and the scaling elasticity to bring more virtual machines online to do more extraction, and they can compensate for larger loads as they come in. So it's got to do that to be public and web scale. And it's easy to use. So not only is it a pretty simple API, it's also available as a command line client. It's also available as a browser plugin. Pretty neat stuff. So how does it work? File format conversions happen, for instance, in virtual environments, using legacy software. So the original software that produced the file, perhaps, or other software that can read it. So an example would be using an old Photoshop to open an old Photoshop file and convert it to TIFF. And then a second step to use a Linux Docker container with image magic to convert that to JPEG 2000. So that that's the sort of orchestration the API does behind the scenes, it will link those virtual machines together, passing the data between them, and it knows how to find the chain of processing to get you to your destination format. So you can request a particular output format, and it will find a path to that format. For the extraction, it does the same thing. It has legacy tools, all kinds of different tools sort of the kitchen sink. It takes your file in, and any tool that can do something with that file, according to my type, will request the content, process it and give back JSON LD link data in the form of JSON. There are many different kinds of extractors available already. Lots of them are image based. The NCSA group that worked on this originally are the imaging group. And so they have a lot of computer vision things already built in. Some adepts are OCR, facial recognition, data tables from PDFs, river paths in satellite images or in historical maps. Talk a little bit more about that. Yeah. Is that a series of longitude that certain intervals or what's the it I think turn into a data array. I think it doesn't do the geo referencing of it, because that's not necessarily there in the image. You pass it an image file, and it will find paths, identify paths that are likely to be a river. And then if you have geo coordinates that are, you know, reference points that you have to Yeah, yeah, I believe so. I think that's right. Yeah, yeah. Well, that might be a good application for a later thing. I'll talk about the compute, the parallel computation could do the second half of that. But anyway, the extraction of vegetation patterns from wider is another interesting like aerial photo extractor. So they can see how much trees vegetation there are in an area and see how high the trees are and what the so this is taking a science use case, building an extractor from it. And then kind of we all have access to that after that. It's interesting. Most of their use cases are science. They have a tools catalog, which is the way you contribute more tools, get credit for your tools, make them available to others. And this is some of the technology stack behind a brown dog, their website. So we've got these two big components, brown dog and drastic. We're going to build a workflow out of that. And we're adding elastic search to the mix. So this is our our sort of workflow overview. We have a work queue and workers that process the workflow. We have our repository and we have brown dog in the cloud sort of remotely elastic search, just a search index like solar. Yeah. But it's a clusterable. So it's it's distributed. Solar is to now actually. But this these workers listen to events from drastic, they add tasks to the queue and respond and then talk to brown dog, get the results back, put them in the repository trigger indexing of those results. So it's a it's a big engine here. And I'll take you through one example. So this is the workflow for PDF object. And the first thing that happens is we index the data we already have in elastic search. So that's the file name, the directory path and the file size. Then we extract text. So we request conversion to text from PDF from brown dogs, conversion software, conversion API. Get back some full text and we add it to our metadata in the repository. Then we put it in elastic search. Then we also get a PNG from the PDF. And then we do OCR on the PNG. And that's because lots PDFs don't have full text, really, they don't have embedded text. And also because lots of them have images that contain text. So we'll get really all the text from this workflow. So that goes into elastic search as well. Then we run an extractor that does format identification. This is Siegfried under the hood. But it's built into brown dogs. So when you fire off all the extractors in brown dog, you get this back. So we put that in the index as well. Brown dog says we have six faces in the image. So we put that in elastic search. It says we have a close up. It says we have three people in profile. It says we have 12 eyeballs. So just for example, I don't know what kind of photograph would have a close up and three people in profile, but it's an example. That goes into elastic search as well. So we've got an object. We've enhanced it. And we've indexed it. We've got lots of different formats. We get varying levels of metadata from all of them. Whatever the brown dog extractors come back with, we put it into drastic. And then we can look at it and index parts of it. And we have lots of formats. So this is more representative of the formats, kind of zooming out, getting our big data level approaching a billion files. So now we've reached a point where, okay, let's say we can do extraction over approaching a billion files, right, 100 million files, and we have all this metadata. So now we have two problems. Now we have to figure out what we have. So the promise of the elastic search is to give us a view into the data, allow us to qualify what different folders have inside them, and sort of generate a finding aid, and do appraisal, if you like, or other things. So elastic search has plug-in called Kibana, and we've been using that. I think we'll continue to use that and also explore other kinds of visualizations. But Kibana is really good. It gets you a lot for little investment. So here's an example of an image-based chart. It's just showing pixel counts, sort of on a logarithmic scale. So that's a nice way to overview your data. Here's another one that's a pie chart of formats. So the inner ring is the MIME type, the outer ring is the particular PUID or format subformat. You can combine the charts and dashboards. Full disclosure, this is not a dashboard that we have, but it's an example. And the dashboards can be faceted, so you can drill down into folders and folders within folders and redraw the whole dashboard or all the charts to see what's happening throughout the hierarchy in the repository. There's a lot of neat text features in elastic search, but this is my favorite. It can do significant terms between neighbors. So if you, like, have your data divided up into some buckets, it can say, well, what's different about this bucket is it uses the words denied and v. And what's different about this bucket is it uses science and budget and this other one specification and diagram. And these are federal record groups, so if you want to guess, you can guess perhaps which agencies stuff we're talking about. But it's an interesting full text application. So Supreme Court is denied and v. That's in Janssen v. North Carolina or something like that. And then science and budget is the Office of Science and Technology. And specification and diagram is the National Institute for Standards and Technology. So it's neat to have, even when your folder names are cryptic, that there's some generated help there. So that's our workflow. And it's a combination of these two projects and ourselves, our efforts. So this is a slide to invite you to join us and help us or work with us. We have a drastic and we're looking for institutional partners on that. We're also looking for use cases for the parallel computation that can be done in Cassandra. So what would your repository do with collections if you had that kind of computational environment? And lastly, in a drastic, we're looking for Fedora Sprinters, because I think we're setting up a Fedora Sprint to prototype this as a back end for Fedora. For Brown Dog, you can try the Brown Dog APIs. You can become an early adopter. You can fire it off against whatever data you have. But I'd encourage people to try it on their research data, their science data. You can contribute to that project. Their public API soft launch is the end of the year, I believe. So it's timely. You can sign up now. And lastly, we're the UMD iSchool. Do you want to talk at all about UMD iSchool? Well, we both moved there two years ago and we actually started a kind of big records and records analytics group. That's this DCIC centered. And there's a ton of support at the University of Maryland. So the actual university invested resources, lab space, hardware and machines. So it's been a great place to actually incubate these projects and relaunch them. Another feature I just wanted to mention. It's not part of the open stack, but you just install the enterprise version of data stacks. And there is a graph database, a graph database engine that comes with that. So some of the projects we have are actually looking at taking library collections, archival content, refactoring them, looking for relationships, anything that can be graph networked and doing sort of graph based analytics. So we have a number of prototypes working with students. We're about 50 students working in a lab on a series of projects. And we're very intrigued by this kind of architecture. We're going to be playing with that if any of you are interested in network analytics and visualization scalable graph databases. We'd love to partner and compare notes and work with folks. We're looking at Holocaust museum data, National Archives data, so specific examples of Holocaust data, anything where you can track people and relationships with people and movement of people. So most of our projects look at historical, forced and voluntary displacement. So you're looking at event models with people, location, places, events, and you're trying to link all these things together into very large graphs. And the really intriguing thing about this architecture is that you can have all this data reside in the archive and drastic. And you can perform these kinds of analytics, networking, computations and analyses directly on the archive without moving the data. So it's not going to work for everything, but there's some intriguing possibilities of doing sort of in-situ graph analysis, networking data. And so we have three or four projects exploring that. And it's all about sort of giving meaning to these collections that otherwise would be locked in place and kind of opaque. I should say DataStacks is the company that mostly contributes to Cassandra. And so what we did recently is install the enterprise version of Cassandra that's from DataStacks and you have to obtain a license for that. They're going to come up with a research and educational license, I think shortly. We sort of snuck in under the startup license, which is fine. And so one of the other things on our next steps is to use the graph database in tandem with the repository. So we still have five minutes or a little more for questions. Then we're probably out of time. Thank you so much for coming. Yeah.