 Thank you very much for that very kind introduction and thanks all of you for coming and I should apologise in advance about my voice, sorry for my cold, I'll try and get through all of this. So I'm going to talk about data management in the Chinese Text Project which is a project which I've been running for about 12 years I think now since it's been online anyway. And this is basically a digital library of pre-modern Chinese textual material. So primarily transmitted texts from throughout Chinese history up until the early Republican period and it's an online project, it has a web address here ctext.org. So I'm going to start by introducing a bit about some of the most important types of data in the system and then I'm going to make some comments about preparation and curation of data. So where our actual data comes from and how it's maintained and this is going to primarily involve a quick discussion because we don't have a lot of time of optical character recognition for these types of materials and how we get better results with this and crowdsourcing because this is really very important to the way this database is maintained at the moment and this actually has a lot of consequences for how we run various aspects of the project and maintain the data. And lastly another important thing I want to cover is dissemination and sharing of the data particularly how we use an application programming interface to export this data in real time to allow for integration with other projects and to allow use of the data in text mining, research and teaching. So there's a lot of data in this project and in the system but the most fundamentally important types are images of historical editions of historical texts. So sequences of page images like this one which is a page image from a book in the Harvard Yenjing library. We have the whole of the Harvard Yenjing rare books collection is included in this project along with material contributed by other libraries. The second type of important fundamental kind of data that we have are transcriptions of this type of material. So in most cases like in this example these two representations are linked together within this system so that for example you can have full text search using the transcription or you can work with the transcription as your primary way of interacting with the text but you can also refer back to the authoritative scanned image data to confirm exactly how this looked in this particular edition. And within the ctext interface this looks something like this. You can interact with things in the full text view but you can also click a button and immediately view the same material in the side-by-side image and transcription view here with the search terms highlighted and so on. It's important to emphasize at the start that these two things are really just different visualizations of the same underlying data. So this image and transcription view which is very closely tied to the original page images is just one representation of the underlying data that's also used for the textual representation of the data. So these two things are linked internally using a XML representation which I'll come back to and give an example of a bit later on. But these are not two different copies of the text. These are the same copy of the text visualized in two different ways. We also, for materials which have been contributed by libraries, we try to preserve direct links back to the library catalog. So for example for the Harvard Genging materials it's possible to directly click on a link and to open the same catalog entry for the particular copy of this particular edition of this particular text which was scanned to produce the images that we're showing here. So this allows us to kind of to some extent offload a lot of metadata concerns which are much more naturally handled in the library to librarians who are specialists in this. So if people want to find the original metadata or the up-to-date metadata for one of these texts we basically point them to the library to show exactly where to find it. And the system has a lot of other functionality that's sort of secondary to this. Built on top of these representations we have other features to provide a reading environment particularly for the classical Chinese materials. So we have things like dictionary links. This is actually an interactive visualization if you view it on the site. You can move over any particular part of the Chinese text and it'll give you dictionary information. It's also aligned with an English translation. We have things like parallel passage information which is also layered on top of this and references to historical commentaries for this particular paragraph of text. So I'm not going to go into this in too much detail but we also have these kinds of functionality. So this is now a fairly large project. It's been online in something like its present form since 2005 although it's evolved significantly since then. We have around 25 to 30 thousand users every day so a lot of people make use of this system. And in total we have around five billion characters in 30,000 texts. 25 million pages of primary source historical scanned materials are in the system. And we have over 10,000 registered users who collectively contribute to the database in very concrete ways which I'm going to get to in a moment. We get something like 100, something on the order of 100 contributions every day to this project through crowdsourcing. And it's also important to have some idea of where our users come from. The vast majority as you might expect come from Greater China. So over 50% from China and over 80% from China, Taiwan and Hong Kong taken together. But we do also have significant numbers of users in the US and Europe. And the entire project has two, has several different interfaces for different languages. So basically all of our introductory and explanatory material and all of the user interfaces available in Chinese as well as English simplified as well as traditional characters. I'm not going to say too much about OCR because it takes quite a while to get into some of the details of this, which are quite complex. Basically the way that we get adequate OCR for these historical materials is by adapting to the specific domain, adapting to the types of images that we're expecting to deal with here. One of the first things that we do is training data assembly. So we have automated methods of extracting training data for Chinese, which consists of labeled character images for all of the characters that we want to recognize. This is particularly tricky for Chinese because there are so many characters and it's not practical to create this data manually as it would before many other languages. Character segmentation, there are various things that we can do to improve character segmentation accuracy on these particular types of texts. And language modeling, which is a fairly standard kind of technique in OCR to basically take advantage of some contextual information in the ways that human readers would naturally do subconsciously when they read these texts. But I'm happy to answer questions about these, but I'm not going to go into any detail of that. Other than to say, whenever I talk about OCR people always ask me about accuracy. And I don't want to just give a number because I don't think it's meaningful to just give a number without any sort of context and without any sort of comparison. So here I'm going to give a very, very short comparison on one particular page image. So here we have a page image from Google Books at the top and Google Books at the bottom. And I've highlighted and read all of the inaccuracies here. So on the top I've highlighted the things which have not been transcribed and on the bottom are things which have been mistranscribed. So the error rate in this transcription is the combination of all the red marks on these two pages. If you put this into other OCR systems, you will get a corresponding result. This is Abby, fine reader's result. This is test-reacted result. You can see both of them are clearly better than Google Books. And this is the C-text procedure result for the same page. So again, we don't get a perfect result, but we've significantly reduced the error rate. And on this particular example, which is fairly representative, the accuracy goes from something like 40% in Google Books to something like 70% off-the-shelf OCR and 94% for C-text. So that's a significant reduction if you look at this in terms of error rate because the error rate for Google Books here is 60% and the error rate for C-text is 6%. So that's a significant reduction in OCR errors. Of course, 94% is not 100%, so there are clearly errors in our OCR output. And this is where the crowdsourcing comes in in the C-text project. So we have this side-by-side view, which is how many scholars interact with our data because they want to refer directly to the primary source materials. The idea is that users, when they identify mistakes like this one highlighted in red here, can click a button and immediately a simple textual representation of the content on this particular page is shown to them. So it's worth mentioning that this is not the underlying representation of the data. This is a transformation of the underlying representation of the data that's easy for users to edit. So they can just directly modify this in their web browser. And when they submit it, this is immediately stored and committed to the database. It's also important to mention that we're very open about who contributors are. This is basically the Wikipedia model where anyone can create an account and anyone can modify our data. So of course we have other ideas that are borrowed from or modeled on Wikipedia like an edit log. All of this is, of course, versioned. Every editing operation that's made is recorded in the edit log and we have a link to a diff style visualization showing what was changed. In this example, a 1 was deleted and a 3 was inserted. And we can also link this directly to the side-by-side representation, which makes it very easy to determine, for anyone to determine whether the actions of another editor are actually genuinely helpful edits or whether this is a mistake that should be reverted. So this is quite easy to police because of this direct link back to the primary source data. And behind this, there is a serialization of the data of which our users are editing. It is also possible for users to edit this directly, but usually the easiest way is through our visual editing interface that I just showed you. But the representation itself is basically in terms of XML fragments which serialize chunks of text into a form that contains all of the data that we have about that particular unit of text, including, most importantly, in this context, the links to the specific locations on each of the scanned pages. We also use the same idea or the same approach of visual editing tools to allow contributors to contribute more complex types of data. And this is one way in which we deal with the problem of characters which don't exist in Unicode, which frequently occur in this type of corpus. So the way this is handled in a visual editing tool is by users literally drawing on the image data, the region corresponding to the character of which they have no way of inputting of which doesn't exist in Unicode. And this gives them a rare character or variant character input box which asks them to provide certain metadata about this particular character instance, specifically things like composition and radical stroke, number of strokes and so on. The system then uses their submission to do two things. The first of these is to look whether anyone else's input the same character in any other location. And they're then presented with a list of possible candidates. This actually also uses OCR to see whether the region they've selected actually corresponds to a character that we already have. And the idea is that then if it does, the user selects it and this allows us to link it automatically to the other instances of the same character. So this isn't simply saying we're going to use images instead of characters. This is saying we're using images but behind them a more complex representation which links the same images, i.e. images of the same abstract character together. And the result of this is then an XML representation which can be pasted directly into the editing window. So our users don't actually have to understand XML. They're not expected to understand XML, but what they're editing is fundamentally XML. So once this is submitted this basically creates a character and this can then be used to represent that character throughout the site. And this also makes it possible to have full text search including characters which can't be input in Unicode. It also makes it possible to aggregate data about rare and variant characters because we have the standard representation and the individual instances are linked. So this all happens transparently when users input characters using this method. And the same approach can be used for inputting illustrations which occur as a percentage very infrequently in our data, but because we have so much data significant numbers of them do occur overall. We use a similar interface to allow users to actually mark up regions of particular pages which correspond to illustrations and provide some very simple annotations, for example whether there is a caption for this particular image in this particular context. And again, what is the result of this is an XML representation of all of the information that the user submitted. And so this can then be used by the system to visualize and represent these illustrations in the textual view of the same text. So the image data of course is not necessary if you're dealing directly with the page images, but many C-text users prefer to deal with the textual representation of these texts once they've been transcribed and only refer to the images when they actually want to check or cite something. Once we have this data, of course there are other things we can do with it. We can implement image search based on the captions and we can also use this in the future to improve the accuracy of OCR and potentially to identify illustrations automatically within our data because we now have a significant amount of training data of specific examples on specific pages and specific locations. So I'm going to move on to another topic now which is replication and failover. So one of the issues we have here is that we have a very distributed user base. We have many users in East Asia, but also in the US and Europe. Many of the international links on the internet between these locations perform sub-optimally at various times of the day. There's significant latency just because of the distance, but at particular high peak traffic times the connectivity between these locations is not very good and this is a very bad thing for us because we're relying on our users to actually contribute information to participate, not just to be passive consumers of this content. So we want them to have as good a user experience as possible and one of the ways that we make this possible is by having the entire system replicated in two completely different locations at the moment, one in the US and one in East Asia and both of these have complete up-to-date copies of all of the information in the site. So it's possible if you connect from London, for example, you will normally be connected to the US data center and all of your edits will be committed there, whereas if you connected from China you would connect to a data center in Hong Kong. Of course this is where the crowdsourcing element is something which also has to be considered because our data changes continually. So we have instantaneous replication between these two sites. If you make an edit to the US data center this will be transmitted to and committed to the East Asian one and vice versa. So this is quite separate from the question of long-term backup, which is also something we do. We also have RAID volumes and multiple copies of all of our data offline, but we also have an instantaneous copy made online in a different location. And this contributes to the project or the website having a very relatively, I would say, very fast experience for our users, regardless of where they are connecting from. This also allows us to do something else which is very helpful, which is to automatically cope with any types of technical issues that cause problems at either of these data centers. If we have continuous monitoring of both of these sites in real time and if one of them, for example, the North American data center due to any kind of external factor could be power failure, network failure, hardware failure, software issue, stops responding in the correct way, we can automatically fail over to have all of our users connected to the Asia Pacific data center and vice versa. So this is very convenient when things go wrong because things, of course, do go wrong. This happens transparently and doesn't require any intervention on our part to happen. So the last of the things I'm going to introduce is how we actually make this data available to other people. And this is primarily done through an application programming interface designed for this project. The motivation behind this is partly that this is the largest database of this type of material for pre-modern Chinese, but also it's not a fixed target. It is literally changing every day, every hour of every day, edits are made and things change. And most people typically are interested in some subset of this data, which is relevant to them, but they always want to have access to the most accurate version of this data. And one particular use case of that is the case where somebody is transcribing a copy of a text because they actually want to have access to a high quality transcription of it. This site provides an interface through which they can do that. If we have an API which allows them to export the latest copy of the data, then they actually have a ready built workflow which they can use to transcribe a text and obtain the copy of it when they've completed. The other motivation behind the API is that there are different use cases and different requirements for dealing with this material. People may want to deal with the data in different types of structured forms and people may want to connect this system to other external systems in different ways. So the API, as it looks at the moment, consists of three major components. One of them is what we're calling CTPURN, so Universal Resource Names, which specifies some particular text or object within Ctext. So these are machine readable identifiers that refer to either one particular edition of a text or one particular chapter of one particular edition of a text. The other main components are JSON API which provides a mechanism for extracting machine readable data and metadata from Ctext. And slightly more unusually, a plugin system which is designed to allow our users to create specifications which Ctext can understand of how to connect Ctext to other external projects. And the idea here is that these are completely open and freely definable. Anyone can log into Ctext and create these. And if they create a plugin for their own project or for some existing project, they can share this with other users and there's a mechanism for them updating this in the future when the link structure of external sites changes. So what these look like in practice is something like this. This is a page from the Ctext dictionary system so the users looked up a particular Chinese character here and there's various information from Ctext. But the highlighted red section here corresponds to the plugins which this particular user has installed at this point in time. And each of these links to the corresponding page, the corresponding entry in a particular external dictionary. So there's a level of customization or personalization built into this in that users choose which plugins they want to use, they can create their own, but they can also choose to install ones which other users have created. And this helps to get around issues where some dictionaries, for example, may be subscription based, some institutions may have them, others may not. If you have access to this and it's important in the context of your research, you'll probably want the plugin, but the plugin doesn't apply to everyone because not everyone has access to this resource. The second main type of plugin at the moment are textual plugins. So these basically do the same kind of thing with textual material by sending a textual reference to an external resource with the expectation that the external resource will then use the JSON API to fetch textual data or metadata about that resource and do something with it. And of course what it does is specific to the external resource. The very simplest type of this is the plain text plugin for C-text which is really an external application which means that you can download it and modify it if you don't like the format that we export things in. This simply fetches the data and provides a download link or allows you to copy and paste the data for a textual object in C-text. But external resources can do much more complex things. And a really nice example of this is the Marcus project for marking up and identifying using semi-automatic methods, proper names, place names, references to time periods in historical Chinese texts. So there's a plug-in for Marcus and Marcus supports the C-text API. So it's possible to directly read any C-text textual object into the Marcus interface and to immediately start performing these operations there. Another example is the text tools plugin which performs very simple types of textual analysis on Chinese texts. For example regular expressions, search, aggregation. And the nice thing about this is that it can make use of the structure of these texts which is also exported through the API in a consistent way. It also provides things like automated identification of text reuse within arbitrarily selected texts from the system. As well as visualization of this data in this case as a network graph. So the very last thing I want to talk about is also related to the API and this is use of C-text data in digital humanities research and teaching. And the motivation here is that what this system provides via the API is a very simple and consistent method of importing data into an external tool or program. And the key thing here is that the data is always delivered in the consistent format specified by the API. And particularly for teaching use in addition to the API itself which is entirely specified through HTTP protocols, we also have a Python module which is how to interact with this API. And the goal of the Python module is to add an additional layer of abstraction which is particularly useful in the context of the teaching we're doing at Harvard which is digital humanities practical methods in Python for students with a background in Chinese studies. And the reason that this API is very useful for this is that it can greatly reduce the amount of time required for data wrangling in a teaching context and also avoid needing to work with pre-chosen example materials because any materials accessible through the API are available in the same consistent format. And also again in a teaching context although nice to have generally is that this tends to lead to simpler programs which have easier logic to follow. And a really simple example of that is this very, very simple program here which if you're familiar with Python should be fairly clear what's going on here. The first line just says we're going to use the C-text module. The second line fetches an entire textual object in a particular format. And then the remainder of the program performs a simple regular expression search against this text. So the key thing here is that the urn identifies a textual object. So it's pretty obvious to anyone who's understood what I've said so far and what urn's are in this context how you would perform exactly the same function on a different textual object. You simply replace the urn. Replace it with something else so you get the expected result. And the function here and this is really what the Python module contributes to the process specifies the type of structure. So in this case we're getting the textual object as a string but we didn't have to get it as a string. We could have got it as something more complex as you would be able to see in this program except that it's going to be far too small and it's far too complex a program to read through. But the things I want to highlight in this program aren't mainly in the first line which is we start out by specifying a list of in this case four textual objects again specified using urns and then later in the program the data and meter data for those texts are extracted using the urns and the API. And what this program actually does is to perform a principle component analysis visualization of certain features of this text used for authorship attribution or stylometry in this particular case using the data obtained from the API for these particular textual objects. And you can see in this visualization here the data has all come from the API, the meter data has also come from the API and the structural information about the text has also come from the API because each dot on this graph represents one chapter of one textual object. The colors represent the distinctions between the four textual objects. So the key thing here is that this relatively complex looking program is actually trivial to use and trivial to modify to perform exactly the same analysis on other materials because simply by changing the urns the rest of the logic in no way depends upon the specific content of these texts. So it's very easy and intuitive for even for novice users who've never seen a Python program before to see how you would change this, how you would experiment and how you would play around with this kind of technique. So the API makes quite sophisticated programs like this very easy to write in a way that's general and not specific to particular data or data set. So I'm going to stop there to allow time for questions. All of this is available and documented online and I should say for those of you who are not working in Chinese primarily we do have a dual English and Chinese interface and all of the explanations are available in both English and Chinese. So if you're interested please take a look. Thank you. Interesting how you got to this what's the history of this just in the video because it looks like there's a lot of work on in here and more specifically I was interested in your plugins and dictionaries how you are that looks like a lot of work in mapping what you're doing to each of those different dictionaries. How do you what's the process OK, thanks. Thanks very much for those questions. So OK, so the first question the history is basically this started as a very small project. This began started out really, really small as a system for navigating one very specific historical text that I was interested in and gradually grew from there. So the design was initially fairly general but conceived of as an entirely static database system for working with textual materials but textual materials which had particular type of format that made the map very closely to database structure. So the texting question was actually the most cannons if any of you are familiar with that. It's a text which has a really strange writing style. It's almost written like an encyclopedia with very short entries. So a map sort of naturally onto database structure and it's the kind of thing where you're likely to want to search for particular entries. It's also a text which has terrible textual issues and is very, very difficult to interpret. And these kind of things sort of inform the original design of C-Text back in 2005. So it's grown organically since then until about 2012. And throughout that period of time, it was basically essentially the same system gradually with additional functionality, additional content added to it but primarily static database. So the only sort of fundamental change in the design was really in 2012 where we moved to this wiki model where things are serialized in XML so that users can actually edit them directly. Prior to that, it was effectively using stand-off markup but in a custom designed format, not XML. So your question about the plugins? Actually, the plugins at the moment for dictionaries are much simpler than they might at first look. So they don't actually do anything particularly complex. All they do at the moment is to send the reference to the particular character or word that's been looked up to an external system. So in fact, each of those plugins consists of a very, very short XML file of maybe six or seven lines which basically specifies some trivial metadata about the external system, what it's called, how to refer to it. And the key thing it contains is a URL containing a component which can be replaced with the data that you want to send to the external site. So in the case of dictionaries, the dictionaries that we support are all dictionaries which support searching within that dictionary for a particular term. And effectively, all this mechanism does is provides a way of specifying a URL template which you can add certain things into in certain places to create a transformed URL to a specific location in the external dictionaries. So at the moment our integration with external dictionaries is fairly superficial. But in the future, we would like to be able to have a mechanism to actually include external dictionary data within the C-text interface as well. And that's something we're working on at the moment.