 I'm Saïd Chaudhry. I work in the Sheridan Libraries at Johns Hopkins, and I'm joined here by Yap Yeratz. I hope I got that right. Who is a research associate at the Center for Editing Lives and Letters at the University College of London. We are both involved in a project called the Archaeology of Reading that we'd like to talk to you about today. The principal investigator for this is Earl Havens, who is part of my colleagues in the Sheridan Libraries. And we have two of the world's preeminent scholars who are the leads for this project, Tony Grafton at Princeton University, and Lisa Jardine from the University College of London, who sadly has recently passed away, but was a very influential part of conceptualizing and moving this project forward. We're going to take a slightly different approach. So I think sometimes when you hear these project updates, what you basically see is the end result. You see some sort of website, some sort of interface that shows what's happened. And we do have that. We can show you that at the end of the Q&A if you wish. But we're taking a different approach of trying to show you how we got to where we are today. And so that's why we're taking that name of the Archaeology of Reading, which is really a study of the history of the reading practices and so on, and thinking about the archaeology of how we built the infrastructure that supports this particular project. And I use infrastructure with a small eye because we're not building something that's national or global in scale, per se, but rather something at a project and an institutional level. But we think that's part of a bigger picture, part of a broader network of infrastructure and connectivity. If you heard Cliffs' remarks earlier today, if you attended Herbert Bondisampal's session earlier today, you heard both of them talking about operating its scale. And Herbert, in particular, moving away from this idea of a repository-centric view of the world to a web-centric view. So not that you have some content, you put it into a database or repository, and then you build some interface. But rather that you treat and prepare the content in a way so that it can be accessed and used in lots of different services, through lots of different services and applications, including ones that you may not even anticipate. So that's the story that we're in some sense going to try and convey. And infrastructure is typically thought of as technology, but it also includes humans, and that's very much the story we want to talk about as the people part of this. So, very quickly, the archaeology of reading, you can see the URL for the project website right there, is a project that's been funded by the Andrew W. Mellon Foundation for two years. So if you really think of this as the first phase of at least two phases of the project. The partners include Johns Hopkins University, the University College London, and Princeton University. And each of them is bringing both a scholarly and a technology capacity to the project. The scholarly goal relates to understanding the history of reading practices, and Yop will be telling you more about that. And then I will be telling you a little bit about the technology goal of building the infrastructure and describing that in greater detail. But I do think it's important that Yop go first, because even while I think the story of the infrastructure is particularly interesting, it's still done towards supporting scholarship, towards supporting teaching, research, learning, and publications and so on. So I'll turn it over to Yop first, who will tell you about the scholarly parts, and then I'll come back. Right, so basically what we're doing in this project in a nutshell is we are developing a tool which consists of a viewer, which contains the digitized images of 12 books annotated by Gabriel Harvey, as well as the digital transcriptions of all of Harvey's annotations, which will be fully searchable. Well to start with the corpus, the 12 books, and Gabriel Harvey himself. Gabriel Harvey was born around 1550, we don't know exactly in what year. He was born in Severn Walden, a small market town close to Cambridge. He did his BA in MA at the University of Cambridge at Christ College. He finished his studies as a civic lawyer. But his civic career really wasn't that important, and luckily for us what he did do, he annotated many of his books. It's quite a large library, although we don't know exactly how many books he owned. To date we have, we know that around 200 of his books have survived. And the 12 books that we chose for our corpus are books which deal with a variety of topics. Probably the main book is Livy's History of Rome. There are several books which deal with warfare and strategy, and Machiavelli's Art of War. And then we have interestingly two copies of Castellona's Book of the Corce, both in Italian and English language. So it will be interesting to see the differences in the way in which he annotated those books. About the annotations themselves, as you can see he has two images there. He wrote loads and loads of marginal notes, not only in the margins of the page, but also in the printed text. He underlined the words in the printed text as well. He had a variety of marks which he employed such as plus signs, equals signs, dashes and slashes. And he also had interestingly a set of symbols, astrological symbols, which he used to mark up pages in the book and indeed parts of the printed text. As you can see, if you look at the left image in the upper left corner, you see a circle with a dot in it, which is the sign of the sun, which represents a king or kingship. Another symbol which he frequently used was the March symbol, which denotes war and warfare. Our project is very much the fruit really of quite a recent historiographical development, which can be summarized as a movement away from the reader towards the history of reading. So normally, marginal notes were used to probe into the mind of the reader to unearth the innermost feelings of the reader and of the annotator. However, in 1990, Lisa Jardine and Tony Grafton published a seminal article studied for action how Gabriel Harvey read his Livy. In this article they argued that Harvey did not read Livy just for the sake of reading, but his reading had a really practical purpose for it. They showed that Harvey, in the capacity of a professional reader, that read the book together with Philip Sidney in order to prepare Sidney for his diplomatic missions on the continent. However, after the publication of this article, there have been many other of these kind of case studies, how X read his or her Y, but really a study which addressed a larger question of the history of reading practices has not been produced. Interestingly, our project, although we in the first phase worked with Harvey in the second phase with John Dee, so with known identifiable annotators, the project is not about Harvey. We do not want to do this project in order to add to the biographical record of Harvey himself. Rather, Harvey, as Lisa Jardine phrased it, is seen as an operator. He started reading and writing in one book, moved to another book in which he referred yet to other books. So how can we follow Harvey and how can we follow in his footstep and in his intellectual pathways? Harvey then is more seen as a neutral vehicle through which we can address the history of reading. Well, based on these developments, some scholarly goals were formulated. One thing we need to do is move away from just the marginal notes as the privileged, some sort of the highest form of annotation towards an XML schema which we have developed, which can capture all the different forms of annotation. Interestingly, when we started doing this and when we started generating these transcriptions, it became clear that one of the things why a study of the history of reading has been so difficult is because of the size of the data. We have now generated all transcriptions and we have used more than 102,000 tags, only the underlined tag which captures the underscored words in the printed text, containing more than 220,000 words alone. So the scholar really needs to have some tools in order to help him understand reading practices on this scale. Another thing we needed to do is in order to discern patterns, we need to provide a way of examining these different forms of annotation in conjunction with one another. It's not enough just to do a simple string search, for instance, in his marginal notes, but we also have to relate the marginal notes to the printed text, for instance, looking for key concepts which appear in the printed text and in the marginal notes as well. Another important thing is to make it possible to follow Harvey through his books. As I mentioned, often in his marginal notes, he referred to other books but also in his books themselves, he referred to other pages in the book he was reading in. So we need to enable the user to follow Harvey through his books and to make these kind of links possible. In order to do so and in order to capture the annotations, we make use of XML. One of the advances of XML, of course, is that it's flexible to formulate and design our own categories in which we could put the data. Another benefit was that Cell had experience in working with XML from previous projects. When we started with this, we did a survey of the existing schemas and of course the text encoding initiative, the TEI schema standard, is very well known. It has been widely praised for its comprehensiveness and is able to do what they want to do, namely generating electronic digital editions of printed books. However, its comprehensiveness is at the same time also its liability. TEI is fairly top-heavy and indeed the manual of the latest version of TEI, TEI 5, amounts to several hundred of pages. More important for us though, TEI focuses on printed books whereas we focus on annotated books and objects which sit between manuscripts on the one hand and printed books on the other hand. TEI therefore is not really able to deal with annotations, manuscript annotations in printed books very well. Some of the elements of TEI, they show where in the printed books you can find annotations but we want to capture much more information such as the people, the books and the geographical locations Harvey's mentioning his marginal notes. Therefore we decided not to use TEI but instead create our own bespoke schema. The development of the schema started in the Cell HQ in London and you can see there's a whiteboard. My colleague Matt Simons and I, we started drafting a schema but from then on the development of the schema was very much an iterative process. We sent the schema over to a colleague at DRCC at Johns Hopkins and we explained why we want to capture certain data and why we want to capture it in a certain way and they commented on the schema and explained their preferences to us as well. After several of these interesting conversations we came up with a version of the schema which was deemed ready and from then onwards we started generating these subscriptions. Alongside the creation of the schema was the creation of a transcribers manual. This manual on the one hand is some sort of a log which explains why certain decisions regarding the schema were made and also explains the concept and the ideas that underpin the schema and on the other side the manual is indeed the manual for the transcribers explaining the schema, explaining the way in which the text should be used or in order to create a standardized data set. All of this means that it was not a client-provider relationship. It was not a bunch of over-enthusiastic scholars who cooked up some nice ideas and at some point went to the technology people and said, look, can you build us this? From the answer it has always been a really collaborative effort joining scholars, techies as we call them in the office and librarians as well in order to create a really robust foundation on which the rest of the project rests. Having said that, the development and further process of developing is explained by Ced. Thank you, Jaap. Most of you probably don't know that I'm actually an engineer by training even though I've been working in the library for many years and my advisor in graduate school had an interesting view of engineering. He used to call it a liberal art and I asked him what he meant by that and he said engineering is about people, processes and products and the workflows that connect them. And if you only think about the technology or the math or the so on you're missing the entire liberal arts aspect of engineering. And I think that this is the way we have approached this particular project. In many ways people jump immediately to the product side of development of these kinds of activities or projects and I think that it wouldn't be at all unusual to basically take this kind of activity and say give us some use cases and then we will extract requirements and then we will build an interface that sits on top of your repository or database. And I would say use cases are about translation of understanding and what I think this kind of engineering approach is about shared understanding. So one of the stories I like to talk about is Lisa Jardine. We have these regular Skype phone conversations and she was participating in one of them and talking about her research and why she cares so much about these books and what she tries to learn. And I said you know I'm really sorry I have a meeting to go to in just a few minutes but I'll listen for as long as I can. But I ended up listening to all of it. I listened to everything she said and I was late to my meeting and she apologized. She said I'm really sorry I know you had this meeting and I've kept you for too long. I said please don't apologize. Please don't apologize for sharing your story and please don't apologize for sharing with such passion because that's ultimately what we're trying to capture here. So it really is this process of trust and understanding in a shared kind of manner. Yop showed you that whiteboard. I think if you're a technology person I mean this in the most loving way. The best way to engage a technologist is write something within pointy brackets on a whiteboard. You've got us. If you're willing to sit down and go through that kind of process then there's engagement, there's shared understanding. Let me give you one example. Harvey is an operator so Yop gave you the scholarly reason for why they're comfortable with looking at Harvey in this first phase and why there are 12 books. One of the most interesting conversations that took place early on was again technologists and scholars in the room raging debate about whether this would be enough number of books whether we should go beyond Harvey and I'm completely lost. There are words like post-so-graphical. I think that was the word. To this day I don't even know what that means but I'm trying to keep up the conversation and then after about an hour or so I basically say I have a naive question for you. Can you tell me if you take these 12 books and you take Harvey as a reader do you think you're going to capture the universe of symbols and annotations and underlines and so on? And they said we probably will. And I said then from a technology perspective I'm done. That's what I need to know because you've defined the possible set of transcriptions, markups and so on that we will need. So this kind of understanding comes about when you have this people, processes, products and the workflows that connect between them. So what happens at the end of this is typically you get these what I'll call public results. You get this digitized corpus of books. So we can now look at those images. We have physical images. We have transcriptions and so on. We have the markup of all the physical annotations that you see in this book. We have these two viewers that are IIIF compliant. Again, if you heard Cliff's comments earlier, there's something called the International Image Interoperability Framework or IIIF that is about interoperability of sharing and accessing images. So it basically manifests itself in two APIs. One is an image API that allows you to deploy an image server and the other is a presentation API that allows you to actually show these images as a set of canvases and so on. We have a layered infrastructure that supports this. At the bottom of that layer is the storage. This is where the bits actually reside. On top of that is an archive layer. And that layer is where the validation of the content, the management of the content, the integrity checking of the content takes place. It's at this third layer where these APIs are implemented. So we have this IIIF image API and presentation API implemented in a generic enough way that we ought to be able to swap out image servers. We ought to be able to swap out the viewers themselves. And then the very top layer is the website, that bookwheel.org where you will go and see the content and then ultimately see these viewers. So these are the kinds of things you typically see. Now the use cases, as Yop mentioned, we did develop a set of use cases, we didn't do it at the very beginning of the project. We did it roughly six months into the project because we needed to build that kind of shared understanding and that shared capacity to work with each other. And then the use cases were developed at that stage. And they built upon use cases that we had from a previous project called the Ramon de la Rose Digital Library. So this is a set of digitized French medieval manuscripts that also have annotations. And we had a clear postdoc develop a set of use cases for that project that we shared with this project team and said, can you build from these? Can you adopt some of them? Can you modify them and so on? So in some sense we can't get both of those project teams together but we can again share that understanding or translate that understanding between the use cases. And we did ask the scholars to do this and they did in an incredible way. These are very rich detailed use cases. It's not a couple of sentences. They actually follow a very specific template that we've used for a lot of our other software development activities and they talk about preconditions and pathways and assumptions and so on. So this is a very serious time commitment on the part of the scholars that we've been working with. Now these are some of the results that you probably typically wouldn't see because they sit on the infrastructure side of this project but they are equally important and in some ways they're at the heart of why we think this project can be a model for others to think about. There is a data model that describes the content that we've produced and a precursor to IIIF is something called shared canvas. In our view we think shared canvas is very much the data model and IIIF is a protocol for implementing that data model. Now why should you care about data models? So one of the things that you will think about with this repository centric view is metadata, sharing of metadata and I think that's really great and it's important but it's a certain kind of interoperability. If you share your metadata it's really more about discovery and searching and knowing something exists but if I actually want to use your data if I want to process it, run analytics against it and so on metadata is probably not enough. I need to be able to look at your data model and compare it to mine and see if there are commonalities or bridges or places where we can make those connections and so the W3C now has a working group on annotations and they have a draft data model and one of the things we intend to do is compare our data model to theirs to see where there are commonalities, overlaps or modifications that we will need to make and I suspect by doing it at that level we'll have a much better chance of being actually able to use the data from each other's projects not just discover that they exist. As Yop mentioned we have this XML schema that was developed in this iterative process in many ways that is an expression of the collaboration between the scholars and the technologists and it allowed us to create the RDF or express a lot of these transcriptions and comments and so on in an RDF format, link data if you want to call it that that is at the foundation of what we're trying to do. I'd mentioned this other project the Ramon de La Rose Digital Library this was built some years ago and started in the mid-90s and we basically used shared canvas that precursor to IIIF to create an interface for that particular project the lead scholar for that effort and the postdoc that I had talked about Tamsen Rose Steele and Steve Nichols being the medievalist had said we want to update this particular digital library as well and we said we can do that for you but what we'd rather not do is build you a new interface we'd rather update the content so in essence what we did is we made the rose data IIIF compliant and there's another set of manuscripts that we have access to from someone called Christine de Pizon and we've made that IIIF compliant so now all of that content can be accessed by the same set of viewers or the same set of services and we don't have to keep building new interfaces for all of these different kinds of content so it's not about updating the front view it's about updating the processing of the data that underlight sort of in the back end if you will we have detailed statistics about the work of this project so if you're familiar with agile software development there's this concept of velocity how much you produce in any particular sprint or fixed period of time while we didn't quite get quite that far we are in fact looking at when did the scholars make a lot more progress where were their bottlenecks what were some of the constraints what were some of the things that worked really well so in essence in the future when we work on more content or other teams start to work on similar content we may be able to say to them if you take this kind of approach it will actually help you in terms of moving down this path and just recently last week we have a data release so if you go to the website and you go to the downloads page you can actually access to a bunch of data that we have produced and these data are about the books themselves they're about the kinds of symbols that we found the frequency of those symbols distribution of those symbols all sorts of data about the content itself things that we think are interesting things that we have produced spreadsheets and charts and so on but you can use them you can come along and produce your own views put your own lenses onto that data and in fact I would encourage you to do so so let me come back to this Mars symbol that Yap had mentioned earlier and for me this is a very clear cut case of the difference between these two approaches that I'm describing I think if we took that let's put the content somewhere and build an interface and work off use cases we would have ended up with a website or a viewer where you could page through and see all the Mars symbols and I think that's very helpful and I think that's very useful and I think an expert someone like Tony Grafton or Yap or so on would be able to take that and go into a classroom and say here you go I'm going to scroll through and show you these pages with the Mars symbols and we can talk about what the possible implications may be by taking the approach we did what we will be able to do is generate a virtual book of all the pages with Mars symbols so it will grab from different books and reassemble them into a virtual book that is just the Mars symbol will that be useful? I don't know but the question is we can do it and we have experts who can look at that and start to say look Harvey seemed to use the Mars symbol always in this case or he never used it in that case or it's around this particular word those are the kinds of interesting questions that we didn't think about at the beginning of the project we didn't build use cases around those but we can ask them now because of this approach that we've taken I think the key is to be able to take the data that you're given deconstruct it and reconstruct it in interesting ways and I'd like to propose a new metric for success of these projects that we work on that is that somebody comes along and uses your data in an unanticipated way without asking you for it there doesn't need to be this pairwise conversation about this is the really particular way we did this where we built a shim or a connector that allows us to work with your application or so on it needs to be actionable enough that they can point to it and whatever machine mechanism they're using can access the data and it would be fantastic if they did so in ways we didn't think about because that team would prove that they've actually come up with new ways of using the data in ways that even though we have assembled a tremendous scholarly team we know there are other people who have interesting ideas we know undergrads they want to try different things and so on and the key is that the data are in that kind of format and they're available for this kind of processing so with that I will just end with a few acknowledgments we of course are very grateful to the Mellon Foundation for their generous support each of our institutions is actually put forward significant support as well there's a page on the website about the archaeologists themselves so you can see all the people involved including our advisory board and we have plenty of time for comments or questions and I'm sure Yop and I would be happy to hear your feedback, thank you