 We're going to go over basically how we manage data at BDRC. I'll do a very short introduction for everybody just to our organization and what we do. And then this, you know, I was talking to Nathan about this is kind of interesting kind of perspective or angle this conference because it allows us to talk a little bit more of a higher level about how we're actually managing all the data and what kinds of data, what we need by data and the types of data and so forth. This is an 18-year project. So we've created a lot of data and all the issues of scale and workflow and all those things. We've made all the mistakes and struggle for many years, especially in 1999 and the dark ages of the internet. So it's been, so we want to elucidate a lot of the methods that we've used in terms of how we manage the information, what kind of information we're talking about. This project was founded by Gene Smith in 1999 as Tibetan Buddhist Resource Center. In 2015 we changed our name to the Buddhist Digital Resource Center. It's going to expand our impact and the kinds of material we're working in. We focus on text, textual resources and metadata. We don't get into audio or video or anything like that. So basically what we do is digitize source material through partnerships or through our own digitization programs. We scan centers, field, we do a lot of field work and then we curate the source material using scholarly validated meta-data, it's an important part, it's a major interest of genes, not just standard library cataloging but more scholarly validated documentation. And then we make everything, we preserve, make everything publicly accessible and also back things up in a long-term preservation platform. So our current status here you can see down here, this is year 2016, this is millions of images. So we're at about 13.6 million images primarily in Tibetan but also increasingly in Pali, Sanskrit and Chinese. And we also, our website is a global platform, 200,000 online sessions a year and we just released an app, an app store, an Android and then everything we do is preserved in Harvard DRS and I'll go through a little bit of what that means. What kinds of data are we talking about? Digital text resources, you could say digital assets, those are the assets we're managing and then metadata about those assets and then source code. Ellie's going to talk about metadata and source code. Source code actually is where there's a lot of business rules, what's called business rules and software development that are implemented in the software. They're not in the metadata, they're not in the resource that you're managing, they're implemented in the source code and so you have to keep track of that as well and where those business rules are. So these are kind of digital text resources, images of text using open source standard formats as much as possible but also page transcripts. I'll go through a little bit about how we use TEI, issues with TEI and then our multi-layer architecture. So types of data here, metadata also, documents that describe digital assets. The metadata itself becomes so rich that it becomes, even the definition of it as metadata starts to blur. This is data that's worthy in and of itself of curation of sorting, storage and so forth. So metadata is critical for accessibility management and preservation. And then also the source code, manage the digital assets, there's a whole apparatus of software that manages all the digital assets, makes them discoverable, organizes them and archives them and also connects the metadata to the digital assets and makes them discoverable. So scale, I've heard a little bit of discussions of scale. Scale for us is critical. It impacts basically every decision you make, what metadata format you use at all, how much metadata you put into, how much metadata you actually use, workflow systems, how you visualize and present information if you have like faceted browse, if you have 40 texts, there's a limit to the amount of information that the user will see on the screen. If you have 450,000 texts, it's a very different scenario. So the scale is really important and it really defines how you implement what you're trying to do. So our scale is that actually 13.6 million images. That's actually 31.2 million total digital assets. So that's all the web delivery images but also archival images. And then page transcripts, we have 17,000 text documents. So that's the scale right now of what we're doing. Scale of the metadata, 160,000 records, items, persons, places, topics which equals and we'll go into, Ellie's gonna talk a little bit about RDF, that's about 10 million triples and 450,000 individual work objects. So that's out of the items in the library, it's about 450,000 individual works. And the source code, 220,000 lines of code in our current platform and right now we're developing an open source platform and Git, which is GitHub, which is 25,000 lines of code. So I'll talk a little bit about data management of the digital resources, kind of get into the weeds a little bit because this stuff matters when you're at scale, every single thing you do in terms of how you name your files, what directory structure there is. We've had four major migrations of our whole digital archive. So now we have a mirrored architecture, local storage with web deliveries on Amazon, web services, AWS and S3 buckets and backups. We have RAID 5 for local storage, actually mirrored, so it's actually four backups. And then it's also mirrored at the Internet Archive, we're building that platform right now and then our long-term preservation backup is at Harvard, so I'm just kind of rifling through this stuff here just to give you a sketch of what we're doing. This is our architecture. We have the local storage here mirrored to S3 bucket and AWS, which is made available, will be made available through a IIIF service to image clients directly or to a whole series of applications. So we spend a lot of time with software development and we have five software developers now. So yeah, that's the scale of what we're doing. Directory structure, it's like, these are item IDs. So when we started, we had a couple thousand images and then we had 10,000 and 100,000 and then a million and at each time things started to break and we had to kind of keep track of things. So now everything's in S3 and it's quite nice. This is totally unique. We had to break up this long list of directory structure into buckets. So this is very stable now, all unique. Down to the individual TIF number or image number. Who did the presentation on the file naming? Okay, these are rock solid and unique on the internet if you take the whole path. So policies, how do we make stuff accessible? Big thing we worked on is we differentiate license and access. So we worked with a copyright activist at the Harvard library who is very much obviously has much more of an interest in making things accessible than restricting them and kind of concerned with cultural heritage projects that are using copyright in ways that aren't necessarily appropriate. For instance, if you have an image of a public domain text, that's not copyrightable. The image is not copyrightable. So we have about 75% of our library is in the public domain. So we had to deal with this. What we were doing before is we didn't have a mechanism to restrict access. So we copyright restricted. So when we went to Harvard and said, well, we want to give you all of our scans for long term preservation, they said, well, what's the license on this? We said, well, it's copyright. And then you do a copyright analysis on it and it's like, well, no, it's not copyright because that's a public domain material. We said, well, the people that gave us the text don't want it publicly accessible. So then we needed to find a way to differentiate between the license, the legal designation of the asset, and then it's access. So we came up with this, which is separating license and access, public domain and copyright, and then a set of cultural restrictions based on the owners of the traditions and what they want restricted or not. This is very managed by us, basically, and it works to a certain extent. A lot of people want these texts, so we do not have a very strict way of providing access except basically you have to contact us. Accessibility, we have a whole web application framework for serving up images and an authentication for access control. So I'll get into this. Actually, how much time do I have? Okay, so Ellie's got to talk to you. So not only scans but transcribed texts produced from OCR and transcriptions and digital editions. So our current e-text repository is in TEI. So I am not a proponent of TEI, and the reason is that TEI is very strong semantically but it's expressed as XML. So you can imagine a case where there are different XML formats. So how do XML formats overlap? They don't. So the most simple semantic breakdown is of a page. So if you have a page and then you have an outline that breaks, that goes across the page, it's very difficult to express that in XML. So you have a lot of workarounds with non-breaking tags, milestone tags. And in our case, with 17,000 documents, it's very difficult to manage different text documents that have different TEI customizations. So the interoperability and the lack of overlap of XML. So our problem with TEI is not that the semantics of TEI, it's actually the XML of TEI. So kind of a way forward where you use the TEI header as metadata pointing to basically the TEI body, which is the asset. And then the tags in the TEI body encode the character position in a text stream. So this is our multi-layer architecture, which looks like this. You have the metadata and a linked data get repository. And there's a base layer, which is just the ID and a stream of text. And then we chunk that up and serve it up to our search engine. And this is our search application that's written on top of that. So this scales for an infinite number of tags. And then we have, this is actually under the hood, you do the same things if you have language analyzers which look at different languages, and this is using the Apache Lucene framework. So the multi-layer kind of looks like this. You have basically a table structure here. And then you have a JSON file here. Actually, this is turtle. And you can't read it, but basically it says that this slice number is on line one and the start character is here and the end character is here. So you can basically, these are all layers that point in to your raw text. So what we want to get, how we're doing this is that you're going to have text files which are raw text and your layers point to the raw text. And when you do this, you can have an infinite number of layers pointing to the same base text. So there's no collision. And then there's semantics involved with each of these, each of these layers here expressed in this RDF file. So that's our multi-layer architecture. And then, okay, so metadata. So Ellie's going to talk a little bit about metadata. Yeah. Okay. So what I'm going to talk about is the way we design our metadata in our new platform. So it's quite, it's not finished yet and it's not what you will see on the tbrc.org website. So first, due to the scale of the data and our idea of a semantic network within our dataset, like all the entities should be linked together and also linked to other datasets, we think of our data as linked data. So what it means as some core requirements is that every resource has a persistent ID in the form of a URL. So this URL you see is, for instance, the ID of a jingling by the tibetan person in our database and it will stay there forever. Also the data needs to be served in a standard format, like JSON, SNL, et cetera, and with a standard vocabulary. And the way to do that in terms of technology is through RDF. I mean, RDF is a format that's been designed for, to meet these kind of requirements. And so that's what we do. So we use RDF to have persistent IDs and standard formats. And for the standard vocabulary, we have design a role, which we call the Buddhist Digital Ontology, and it aims at being a standard for, in the Buddhist field, so that any other project can use this standard or at least convert this data to this standard and link to our data in a very semantic way. And so that's work in progress. We are documenting our vocabulary and we are hoping to release a final, maybe not final, but at least a public version with some documentation, maybe in January or February. So now in terms of the data themselves, we have different documents like one document per person, per work, per place, et cetera, which is in RDF. And we serialize it in total formats. So it's a standard RDF way of representing a file. So it's human readable. It's plain text file. And it's uncompressed. And thanks to these properties, we can have these different files in Git repositories. So this allows us to have versioning of all the different resources and also versioning on the type level, like we have a Git repository for the persons, for instance. We have sort of a version for the whole data set of persons, which can be quite convenient. And it also solves the backup problems because you just need to clone your repository on different servers and you will have all the repository and all the history of all the resources on different servers, so it's backed up very easily. All our Git repositories will be available on GitLab.org, so it's sort of a GitHub platform. And so it will all be publicly available and you will be able to access it for free. And one aspect which is interesting also with Git is the human readable disks. So if you want to see what changed between, like, two days or one month, et cetera, you can see that very easily because turtle format is very human readable. And so when you ask Git to show you a diff of all the files, you can see very easily that, okay, this property changed, this name changed, this name was added, et cetera. So it's quite convenient to see what's going on. In terms of access, for the read access, all our data or metadata will be under CC0 license, so basically kind of public domain. And there will be, of course, the Git access if you know how to use Git, but also public interface that you can think of as an enhanced version of TBRC.org with more features and more semantic search. In terms of write access, only our staff can write on the data. So that's the current system, but we want to, let's say, make it more flexible through a mechanism called annotations. So I mean first the general idea of annotations in our metadata is that every piece of information and every, even this person has this name, this is one information that we want to be able to annotate. And also even the region of an image or a set of images or a region of a text, et cetera, we want to be able to link all that and annotate all that. So for instance you could say you could use annotations for provenance indications like, okay, this person has this name, and you could annotate that saying that this person has this name according to this text, for instance, or there's a controversy about the name that you can see in this article for all sorts of data about the data. And then it can also be used for discussions about the data because you can annotate annotations. So if someone makes an annotation, someone can also comment on this annotation, et cetera. And the way we want to use it for access is to allow users to propose some changes to the data through annotations, like for instance, a user could annotate a property of a person, et cetera, saying there's a spelling mistake in the name or this is the wrong name or whatever, or a name should be added. And users could propose that through annotations and then the BDRC staff could have an interface where they could validate or discard the proposed changes. So that's a way to sort of enlarge or allow a user base to interact with the data. So in terms of source code, one important aspect is that due to the semantic nature of our data, we want to make the source code read the semantics of the data of the schema and build the UI and the business rules from the data, from the schema of the data. So in different domains like localization, like in the anthology, you can have the localization of the different properties. Like this property should be displayed as this English string, Chinese string, et cetera. So you will have all the different names of the properties. The data is also semantic as it allows you to have rules about inference. Like if you have this and this information, you can deduce also this information. There's all sorts of inference work in RDF and all, et cetera. So this can be implemented like it's stated in the data, but it's implemented in the source code that will read the data. And access control too. I mean, if you have conventions in your anthology and data, the software just reads how to handle that in the data and play on the business rules. In terms of the code we are building is open source, it's under Apache 2 license. And we only use open source libraries and frameworks that can work on all different OS, like OSX, Windows, Linux, et cetera. We contribute as much as possible to the libraries we use or the frameworks we use. Like for instance, if there's an issue feature, we try to implement it upstream in the framework that we use so that we don't have to write a lot of code, et cetera. And this can be maintained outside our code. And then we try to respect, I mean, what we're trying to do is really build a platform that's production ready. So we are documenting the code, making automatic tests, and making it available on GitHub, having correct dependency management, et cetera, so that it can be used by, I mean, it sort of speaks to other software developers. So it's not just our platform that we are developing for us. And also a nice feature we've added is build automations. Like if there is a software called Begrance, you can download a script that will automatically fetch Linux virtual machine, install all the dependencies of our platform, all the source code, all the data, and serve it on your local computer. And that's the way we ourselves have set up our development platform on our laptops and also on the servers we have. So doing this, you can very easily have a complete platform on your laptop and do some search, et cetera. Very easily. I mean, it's just a matter of launching a script. So you can stay there. So in addition, I think we're pretty good, maybe two minutes. The issue of long-term preservation, all three of these levels, types of data, will preserve long-term. So we go to the next slide, maybe. So Harvard Digital Repository Services are a partner for long-term preservation. And basically what you have is a series of content models that you publish into and then they handle the obsolescence and the digital degradation, the bit rot, and backups and migrations and so forth. All of the page images, archival images, as well as web delivery images, all the metadata and all the source code. So this is important, I think, in a lot of digital humanities projects to have some sense of long-term preservation. I know a lot of funding agencies are now requiring data management plans in grants. So that's it. We have one minute for questions. Like step by step, when did the scale up to this kind of a large dimension? Well, we have maybe six major releases of our Pub Digital Library. So we started in 1999. And then really in 2006, we did a major shift in the architecture with new metadata and a whole web platform that delivered images. So that was one major release. And then in 2009, we implemented full localization of everything using the same metadata format. Now, this project started basically this year. But all of our data before was kind of like RDF ready, but not actually RDF. So now the full move and the semantic web and linked data is now. That's our project now. So yeah. Others, you developed yourself. I was wondering when you choose to develop something yourself. And let's say you're on top of the food on, you need to be on top of the project. When and how is that situated and why? It's just if it doesn't exist anywhere, we implement it. But if it exists somewhere, we really want to use like standard libraries, et cetera, like the. But I think you, on top of it, do exist. Yeah, but not with that level of details we need because one aspect is our core data is in RIA. So we need a way in our technology to describe all the details we need. Current ontology tend to be quite, I don't know, very small and more focused on interoperability. Like on schema.org, you have very few properties. And we could, I mean, we could export our data to schema.org ontology, for instance. But we couldn't export all the details. Like everything, like how the data would be lost in translation. So we need something that can really manage all our details of our data. It's a good question, but it's actually at the heart of semantic web link data is that decision when you use a standard guideline vocabulary, common vocabulary versus when you implement your own. So at least there's a framework for that consideration. Right, yeah. I'm also thinking about the only channel you mentioned about your channel search.exe. You call it a very stupid, long story. So you should do that. They would try to get a highly, some very different languages, search.exe. Yeah, I mean, it didn't. We need to take a very different approach to that. No, it just, what we are doing is just plug-ins for the scene. It just didn't exist before. There was no Tibetan analyzer for the scene. And so we are just developing one Tibetan plug-in for the scene, and one for Sanskrit, one for Pali, etc. But in a very easy way. Yeah, it didn't exist. That makes sense, yeah. We have to go on, let's go down and talk. Sorry, we have to go on.