 The plan I had was to kind of have an informal discussion about it in two parts. First, I'd kind of lay out the vision behind Open Library and what we're doing and try and get you to be excited about it and feel part of it. And then I can give you a more candidate status report about how we've done so far and how you can help. The idea behind Open Library started out really simple. It spiraled out of control from there, but it started out simple. The idea is one web page per book. Books are such important artifacts. They're the kind of repositories of knowledge in our culture. They're where you go when you have a great idea or a great piece of literature or a great piece of art. You write about it and you publish it in a book and it goes in a library. And it seemed kind of tragic to me that these important cultural artifacts never had a really strong first-class place on the web. They were always either on a publisher's website if they were in print or a bookstore's website or occasionally a couple pages scattered across libraries. But there wasn't a definitive place you could kind of go to to find out about a book. An Open Library's goal is to be that place. The first thing we realized is that if we're going to make a website this big and this ambitious, a website with a page for every book that tried to collect all the information about that book on that page, it had to be editable. This is the era of Wikipedia and user-contributed websites like YouTube. And so it doesn't make sense to be the definitive website unless users can contribute. So what we did is we decided to make it into a structured wiki, a new kind of wiki in which we store all the data about a book in a structured database format but anyone else can go and edit it. So if you look at Wikipedia, so this is the page on the book and you want to edit it, what happens is you get back this big block of text here. And the problem with a big block of text is while humans can sometimes edit it, although Wikipedia's editing syntax is somewhat confusing, computers can't. So it can't be reused in any kind of library system. It can't be reused by software. You can't do queries about it and find interesting things. We wanted Open Library to support all those things. So what we did is we made it more structured. If you go find a book on Open Library, this one, we have separate fields for all the different data elements like the title, the author, all of those sorts of things and that lets us, for example, if you click on an author's name, we automatically generate a bibliography and show you all the books that author has written because that data is stored with a structured way when you add a new book and you say it's written by this author, we automatically know to add it to this page and all of this data is available for people to reuse and recreate in different ways. The other thing we realized very quickly is that it had to be really open. Our name is the Open Library. Openness is like a core part, but the plan was that this isn't something like I could do or a small group could do or even a big company could do. This is something that really has to be a collaboration between a lot of different people. So we brought in publishers, we brought in libraries, we brought in book lovers, we brought in reviewers. We're trying to get everyone to come together and contribute their data to one site where it's all available for free, it's all available for download and reuse. Everything is completely open, anyone can contribute, anyone can take advantage of it, anyone can reuse it. We're just this kind of central hub coordinating all these different actors. The other thing we realized was that you wanna have full text. So for books that are out of copyright where we can scan them and put full text, we wanted to do that. We spring up some interesting questions about how to read books online. Nobody has really kind of found a great model for reading books and webpages. You can have a really long HTML page that people can scroll through, you can have a little flip book interface, and we're trying both of those, but I think we need people to help us find ways to make these books more accessible to people. We also wanna have a way for people to find them in their local library. So if people come to our site through a search engine, they come across a page that seems interesting, they wanna get a copy of that book and so we wanna be able to say, oh, just down the street there's a library that has a copy, go and visit them. And in an era where people think research means going to do a search on Google, this is really important because it allows us to pull people from Google into physical libraries where they can go and visit and collect books. So our motto is kind of on the side of every book, we have a little option where you can buy, borrow, or steal, or download, I guess. Buy by going to an online bookstore or a physical bookstore, borrow by looking at a local library or at one of these book trading services and then download by, if we have a scan, we'll make that available. The other thing, of course, is we wanna integrate reviews, both professional reviews from major journals and publications and also amateur reviews like Amazon has where people can contribute reviews. We wanna have subjects. When we first started the project, we brought in a bunch of librarians and the first thing they started arguing about was which subject system to use. Should we use the Library of Congress subject system or should we use this dissidence version of the Library of Congress subject system where the Vietnam War is classified as a real war because the Library of Congress only changed that recently? And so our answer was very simple. We don't have to choose on the internet. We don't have to put the books in one place or only put one category system on each heading card. We can store all the category systems, let people sort by whatever system they want. And so that's our goal is to be open in that way that if you wanna sort it by Dewey Decibel, you can add in Dewey Decibel numbers. If you want LCSH, you can add LCSH headings. All of the various options we wanna be able to put on our website and let people sort and direct through whatever mechanism is necessary and try and connect up similar categories to say this Library of Congress subject heading is very similar to this publisher subject heading and so on. Same thing with identifiers. Right now, there's ISBNs, there's OCLC numbers. Every organization seems to come up with a new identifier each week, UPCs and so on. We wanna be the repository to take all the different identifiers, to take everything we can, put it in this one database and then let people jump between them. So you give us an ISBN, we'll give you back an OCLC number. You give us an OCLC number, we'll give you back an open library number. We obviously have to create our own identifier system because we're gonna have more books and different books than other systems do but we're gonna be able to link that with all the other identifier systems. The other big challenge is what, much to the confusion of my friends with children, the library world calls ferberization which basically means connecting books, like physical books in a library, connecting those to the collection of all those physical books that are basically the same and in the same print run, collecting those to all the different editions and collecting those to the translations and video adaptations and the CDs and managing the links between that. The library world is taking some tentative steps to that but they've mainly been really focused on the physical copy of the book they have on the shelves and so we're going to have to come up with new and different kinds of relationships between books to represent that and we wanna store that all on our websites. You can jump from a physical book to the audio CD to a movie based on it, keep that all in one database and keep the relationships between them structured. Also we wanna let people add new kinds of relationships like this book was inspired by that book or this book is a rebuttal of that book. This book shows some serious errors in that other book or this book is a new addition that should replace this old book. We wanna be able to store all those links between books in one website. In terms of getting books to people we're also experimenting with print on demand. We wanna be able to let you print out a book, you have it mailed to you that we've scanned and is available in full text. You can read it in physical form. We also wanna do scan on demand which is the reverse side of the equation. So we've gotten, we want you to be able to find a book that seems interesting in our catalog and if it's not scanned, click a scan this book button and maybe pay a $20 or something. Have someone go page it off the shelves, bring it to the scanning center that's nearby, scan through the book and send you a PDF. So the big vision we started with was that this would be kind of a true library of ideas in a place where you could go and find an interesting book and click jump from there to all the other books by that author or other books on that subject or sputtles to that book or reviews of that book. Just be able to float through this kind of vast web of interconnected ideas that's represented by books and make that really visible and exciting to people. So the immediate question people ask us is oh, isn't someone already doing this? So the major candidate is Amazon. Right now when you want to link to a book, you link to Amazon. The problem with them is that they're basically a bookseller and they give you a lot of information about selling books but little beyond that and they really don't have much good data on stuff that's out of print which is the core of what's really interesting about libraries and what's really unavailable on the web right now. Google, which I'm sure we'll talk a little bit more about later, has been going through and scanning books and collecting library catalogs and making them available through Google Book Search but they have very little community features and if you look at Google, they've never been good at building community features. The closest they came was like Google Answers and they shut that down eventually. And of course there's Worldcat which is a non-profit run by, it's a, OCLC is a non-profit. They run the site called Worldcat which aggregates library catalogs from a bunch of different libraries. Unfortunately, their business model depends on selling this data to people so they've been very hesitant about making it available for free online and letting people reuse it. So we really wanna be this public resource that's not controlled by any particular group that everyone can use unlike these other organizations. So that's the big vision. I guess I'll pause for some questions about that before going on to tell you about how we've done so far. Has anyone confused or think it's a bad idea? Right now we're just English language only but obviously internationalization is a huge part. We're trying to get library catalogs from other countries and we wanna be able to both have different, translate the content of the website so a summary of the book should be able to be translated in different languages as well as the interface so people can browse it in different languages. But we don't have that yet. Sure. So the plan is to start with just monographs as kind of books. Serials is the next big task we wanna handle after this. It's a little bit harder because you know there's, it's so much more complicated in that with a monograph you basically get the title on the author and you're done whereas serials are in these vast sets over time and each individual article is interesting and people wanna read it. But that's the next big challenge we wanna tackle after this. Wendy? I'm curious about what you're doing with the fuzzy connection. Is West Side Story an adaptation of Romeo and Juliet? What's the score connected to the performance connected to the video? Yeah, that's an interesting question because the library world's kind of binary, right? The systems only have one way to enter things in usually. Right now we've mainly tried to adopt that by having lots of different kinds of ways you can connect books but we haven't really done anything fuzzy. This is a similar question. In addition to storing categorizing systems will you allow users to make up their own categorizing systems? Sure, so we wanna have tagging and that kind of user-contributed cataloging as well in the same way that we're gonna have to have a new identifier scheme to represent books in the open library. Tagging is kind of our new category scheme. It'd be interesting to know what you would advise the head of this library if they wanted to play with you in the sense and maybe the answer is so we have more than journals. Every loss of the library could help on journals. You know, you might have to do it. You know, it's what I suggest with you but maybe before you close I'd love to have your sort of free consulting in person to share your environment as a crowd for how a huge and helpful library could contribute to your question. Sure, so let me talk a little bit about how we've done so far in each of these goals. This was kind of the grand vision that we set out for ourselves but it's been really difficult as you might imagine collecting a page where every book in the world has been hard and so I can talk a bit more about how far we've gotten and how people can help fill in the gaps. This software, as you can see, has been working really well. We thought it would be hard to build something that combined this kind of database structure as well as the flexibility of a Wiki and it never really been done before and on day one when we started importing books we got six million from the Library of Congress and that's bigger than any Wiki that exists so far but that's working really well. I'm really excited. We imported six million books. We've got around 10 million now and it seems to be working fine. So, for a book we need a catalog record. So, we've got the catalog records from the Library of Congress. For full text, we only have around 400,000 so far and we get the full text ones mostly through the Internet Archives OCA scanning project. So, the Internet Archive builds these scanning machines. They truck them out to various major libraries. I think they have like 70 or something now. They have a lot of them. The Boston Public Library is one in the Library of Congress and the New York Public Library in Toronto and so on. And then people fund them. They take books off the shelves, run them through the scanning machine and they get uploaded to our servers and converted into PDFs and so on. We've been trying to work with Google and with other people who have copies of scan books and if Harvard has some scanning collections we'd love to include those as well to try and bring all of those when full text is available into the open library. For catalog data, publishers have been really helpful. They want to promote their books so they send us lots of data about imprint books. Bookstores have also been pretty helpful but again, mostly focused to imprint books. For libraries, hello. For libraries, it's been a little bit harder persuading them to give us their catalogs. They've been kind of hesitant. It's been great getting six million from the Library of Congress. We've gotten another five million recently from University of North Carolina but we're still trying to push them and just getting them to contribute data to our site would be really helpful because it's those kinds of features that allow us to do things like find in your local library and stuff like that. For reviews, we're trying to scan review indexes and this is another place we need help is if people have copies of review indexes digitized or on CD-ROM or something. We'd love those to be contributed. It's been really hard tracking down a copy of these. Identifiers have been kind of a mess. Nobody really wants to share. Scan on Demand is making some good progress. We've got University of Toronto and University of North Carolina signing up and so basically we've got this system where you go on the web, you pay 10 cents a page and a $5 fee and they go and get it off the shelves and they can turn it around within a couple days now and they remember that you paid for it. We put a little book plate in the online version. So we think once we get that running smoothly that'll be really popular. Print on Demand has also been pretty exciting. We got number four of these on-demand books machines in our offices in San Francisco. It's this huge machine that takes up half the room but you feed it some paper and ink and some glue in and out comes a fully printed press book on the other end within five minutes or so. It's really amazing to watch. So we have this machine just cranking out books. They're gonna send another one of them to New Orleans to reprint the New Orleans public library system that got destroyed taking the hundreds of thousands of books we scan and just printing out copies of them and filling the shelves again. So that's been really excited. But we still need a lot of help. The biggest thing is with data. If you have data that you can contribute about catalog records or about books you've scanned or reviews of books, anyone who has data about books we love your contributions. If you don't have data but you have good social skills we need help calling people and haranguing them and totally give us data. If you can program we need more coders. It's a vicious problem trying to reprocess the data and munch it into some form that's usable. And if you love books soon we'd love your help curating and collecting them and annotating them. So I guess that's basically my progress report if people have questions about the bigger project or about any particular aspect I'd love to take them. That's it. So what do you follow in the discussion about interlibrary loans for books and copyright? So obviously right now you can do an interlibrary loan for books and copyright and you can get the physical book sent to you and often you can get a photocopy of the books sent to you. What we wanna do is we wanna do digital interlibrary loan so that we can scan the book, save a copy and send you the PDF. And we've talked it over with some publishers and they seem okay with it. We've talked it over with some libraries and they seem okay with it. And there are a couple who are gonna go ahead with it and have us as their scanning partner and do digital interlibrary loan for out of books that you can't get in a bookstore. They're out of print, right? But they're not yet out of copyright. Yep. I guess what I'm hearing is really interesting because it sounds like you were saying that the publisher were having an okay time but traditionally they've had problems with which is kind of digitizing this information and sharing it. But then you're getting resistance from the non-profit libraries and you maybe explain what's going on both of their interest and interest. So yeah, the publishers are happy because this is yet another way to promote their books. People come across it on the internet and they seem to get interested in a book and they sell another copy. And the publishers have a system set up for promoting their books. They have a system of ONIX feeds as they're called. These XML feeds that describe their books and go to the major bookstores and so on. The libraries have been much more difficult. Mostly it seems just because of the complicated bureaucracy and because they're worried about legal issues, they're worried about whether we're trustworthy. And so it's just been this long hard slog to persuade them to give us records. And I mean, there are a lot of librarians here. If you guys can give us advice on what we're doing wrong or what the right way to approach libraries is what we're missing. That would be really helpful. What you've described so far has been mainly US related. Are you making an international push as well? We wanna do an international push. We don't have, you know, we have better contacts in the US. We have some contacts in Canada. We're trying right now in India to try and get a collection of Indian books. We have a couple contacts in some European countries to try that, but we're basically, you know, we just, we don't know people in the right areas. And so we're looking all the time for international people who can help us with it. But, you know, we've had enough trouble dealing with US libraries, that dealing with other countries is a bit beyond our grasp at the moment. Are there other libraries with consumer software? The main people we've been working with is this site called Library Thing, which is a popular web-based version of kind of like Delicious Library where you uploaded a list of books and it builds you a catalog. The major sites like Delicious Library and so on, all those programs, they kind of pull from other library records. So we've been thinking instead of trying to aggregate those, we go back to the source and try and get the data that way. What are you doing to sort of reach the social tipping point that I think you'll need? Because I feel like there's a lot of effort towards getting the data. And boy, there's a lot you can do there, right? But I mean, if you look at the change log, almost nothing's going on. All of the effort seems to be on the outside of the application, making the application, talking to people. Yeah, so the plan is kind of to do it in two phases. One is kind of getting the data into the right format so that we need to be comfortable that the data is up there in a way that's stable. And so when people try and make contributions, they won't get lost the next time we try and import new data or change the format. And we're just not at a point where we can declare that kind of stability so we've been holding off on encouraging new contributors. One that happens, obviously, that's gonna be a big part of our push is trying to find people. And we've looked at the history of Wikipedia and of other Wiki sites to try and figure out tricks for bringing people in and getting people to contribute. I think a lot of it is, as I suggested, people will come in through Google. There are an enormous number of people searching for books and there's very little data about a lot of the out-of-print books. So if we can start to pull those people in and get them interested that way, we're hoping we can build a community through that. Another question, what's the financial plan? How is the money working on all these? So right now, we're mostly funded by the Internet Archive. It's not for profit in San Francisco that's been doing book scanning and other projects. We've gotten a grant from the California State Library Commission or whatever it's called to work on it and we're applying for more grants through funding. A hope is that long-term, it'll start to be self-sustainable through things like affiliate revenues of sending people to bookstores and stuff like that and through a little bit of money we make off of things like scan on demand and print on demand. But for now, these startup costs are being paid by these foundations. The tension between the being the glue that connects various other tools and what people are doing with books and wanting to do it yourself and wanting to have the content on demand in your office. Yeah, I mean, there's this constant tension because a lot of the other people at the projects say, oh, I want to do that part or either I want to have this bookshelf feature and I sometimes have to say, no, let's hold off and let other people develop this one. We just, the fundamental constraint is we don't have the resources to do all of it ourselves. There are some things where we think we can add real value like scan on demand because we have the scanners and the library contacts that at the Internet Archive that can make that kind of thing happen. With print on demand, I think, we have a print on demand machine we're one of the few people that do it and we got it because of some other projects that we're working on. So we might as well use it but we're also going to link to Lulu and other people doing print on demand operations. Hey agent, URL, I don't see a unique identifier. Yeah, we're still working on picking the identifier system. I mean, right now the URL is our unique identifier but we're going to, we've gotten some, we're trying to get feedback on what the right identifier is for people's different applications and in the next round I think we're going to use that instead. And of course, we have an API so that you can query by all sorts of different things and get back, you know, various library records. So further, furtherization comes from the FRBR? Yes. Probably due to infection. Much, all my friends with kids get totally confused when I talk about furtherization. It's incredible. Kids cry, is that right? Structured cry. Structured day. It's very similar. It's just we let the books cry at us until we feel sympathy and catalog them properly. The crying. So FRBR is a fairly well structured set of metadata concepts that goes, I should say goes from editions all the way up to broader ideas of what constitutes a book. Right. The canonical example is Hamlet. Or collections even. Or, or, or, yes. So is Hamlet the particular edition? Is it the large print edition? Is it a particular print run and so forth? So there's this fairly structured thing in existence. And then I think it was Wendy who was asking about the, sort of all of the messy relationships that one might create such as based on, you say, refutes is a parody of and so forth. So, sorry, along with an introduction to the question, which is it's easy to see how somebody could sit down and how a centralized authority could come up with a fervor that would be useful. It's harder to see, and it's easy to think to imagine that people can make up whatever tags whatever relationship they want, whether it's a parody of or is funny if you know about. It's harder to see how you're gonna get those, get the right degree of structure and coordination around the set of relationships that are predictable about what people are gonna see. So what's your plan for doing that? Or do you just want to see what emerges? Yeah, I think, you know, it's kind of accommodation, right? So we're gonna obviously have some built-in ones like refutes or is related to or you probably would also like. But after that, we want to- How are you gonna decide on those? We pick our favorites, you know. We don't, we want to just kind of seed it with a couple that seem interesting to us. And then after that, the goal is to, the database software and the Wiki software uses very flexible so people can add their own fields, they can add their own tags, they can add their own relationships and we'll see what happens. When we get a user community, our hope is that it will kind of end up like Wikipedia where these people who kind of build a community on Wikipedia and say, this is the right way to categorize this kind of object, this is the right way to connect these two things and hopefully the same thing will happen with us. If it doesn't, we might need to hire some people to push things in a line. I was gonna say, you know, this goes somewhat to something David had a post about Hamlet where, you know, maybe the answer in these circumstances is not to impose the same sort of binary relationships that exist in other library data. Like, for example, LCSH, something either is or is not a love story. Well, that's absurd, right? Things are more love stories and things are less, right? And the connections between books, whether Romeo and Juliet is like West Side Story, is also a similarly sort of nuanced thing. And it would be cool if some system could capture that. You know, appropriately. Yeah. If you know how to do that, that would be great. It's really tricky. It doesn't, to some extent, because you can't change other people's tags. On library thing, we just introduced a feature where we have a fielded Wiki for putting in data, right? And the first thing that happened is people said, wait a second, your gender field doesn't have enough options. Wait a second, you're asking for BCAD, I want BCECE, right? And the problem with cataloging is that someone wins, right? And you can offer more options. Like, well, we're gonna put in multiple genders. At some point, someone wins. And librarians really, really like to win. That is an interesting problem. Oh, that's not the librarian, it's the authors. Take David Smith. Which David Smith? I mean, if you're David Nathan Smith, you want that and the dates in there and you want to be specifically identified and you want to get mixed up with other people, but it's hard to do that if everybody's just putting in the data. And some people don't know that it's David Nathan's, but they just know it's David Smith. And then the data gets messed up. So what do you want? Clean. But there really is an answer to who wrote the book. There really isn't an answer to what is the gender of this transgender author. Right. Why do you need that? Well, because you want to know who are the female mystery writers in Nebraska. You don't? Yeah, I mean, the idea of letting everyone kind of pick their own answers is interesting. I don't think anyone's really tried to wiki that way. For the particular problem of making sure you get the authors right, what we want to do is have a kind of autocomplete dropdown. So as you type an author's name, it'll say, oh, did you mean the author of this book or the author, the separate author who was born this year and wrote this book. And hopefully that'll get people to pick the right author. So we've been working on that lately. It's not quite finished yet, but that's our hope. But as for letting everyone decide what someone's gender is, I mean, I think we could, SJ will probably comment on this, but Wikipedia's kind of picked the system where you have a discussion and come to a consensus and then you pick the answer that everyone can kind of sign on to as opposed to letting everyone pick their own point of view Wikipedia and write up the story from their perspective. I think it would be an interesting experiment and I guess there's some wiki clones that are trying to do that, but I think it's just, we're already taking on some of the challenges that that one would throw us over the top. I was gonna ask a different question, which is whether you're hot linking to any databases right now. Hot linking to databases, meaning? Either they're drawing their data directly from your databases. No, right now it's all imported, just because when you have 10 million records, if you want to do a live query every week to get it up to edit, it's difficult. Things like price records we're gonna do on the fly because they change frequently and because nobody really wants to query too much on them. What? Because Phil G likes that a lot. There is an answer. There is a best answer to the question of the gender. Oh, I know both, can you play it out for me? There's another inside joke, which includes me. Please. Sorry, I first met Aaron a long time ago at a summer event around our stage there. Phil Fritzbun, who's some people in the room know. He's a luminary around. Cambridge has, he ran this company. We built some social software for building communities online and one of the canonical experiments was to see whether you knew how to use that software was figuring out how to automatically calculate the price of the biggest book that you liked by drawing from it in this office. Kind of, thank you. Sorry, I'm sorry. Makeup? Frequently in the context of worker discussion on Wikipedia, the answer to, we don't know what is the correct gender is. Well, we're phrasing the question completely wrong anyway. We're going to create a new way of cataloging this for, we're just going to introduce it differently by writing a little note that explains that gender in this particular situation is less meaningful. The kind of thing that seems like that, there's a whole class of solutions that are easier in the context of Wikipedia because it is unstructured and you can have answers, you can have ways of addressing or answering the question that are unstructured and one-off ways of doing it. But the advantage of having a field of wiki is that you get the structure of it. That's right, that's right. And you want to know when the query you were making was poorly framed for the results. That's right. The kind of question is just. That's an extra bit of data. The type of, you're right, that Wikipedia is very good at coming to people collaborating in Wikipedia are very good at answering this question in some way. But that's in part because the answers, the types of answers in that context are less constrained than the world in this situation. So this was one of the problems we faced at the beginning was where to find this balance between structure that people can reuse and put into databases for other purposes and flexibility like the kind you can see on Wikipedia where you can completely revamp the page and make it look totally different. And the compromise we came to was that things would be structured but you could change the structure on the fly. So. For the individual entry. Yes, for each individual entry or for a class of entries or something. So this is the structure for making comics which is just a normal book. This is the thing that generates that page. It's a little ugly at the moment but basically you get to say each property and give it a name and tell it what type it is. So, you know, right now it's DOI is a string and the source is a string and there's some dates in here somewhere. It doesn't make it less useful for computers that are trying to access the database because they can't predict what author's name gets changed to name of writer. Well, so, yeah, the hope is that people change the names of the fields, that they add new fields, that they put an asterisk next to old fields and say, well, gender doesn't really make sense here. That might be the answer that gender, there is no answer to gender in this situation so it's not going to return to me. Right. Perhaps. You take that field out for that particular example. But yeah, the hope is to, you know, make the schema reconfigurable on the fly and hope that's like tagging people come to consensus about what the properties are named but they can change what properties they use in each particular case based on what it is. So, for example, if you're categorizing Bach things, there's a special Bach number that only makes sense for things written by Bach and you want to be able to add that to just that class of records. And similarly for books by people with confusing genders, you might want to take out the standard gender field. Would there be any value or merely confusion? You're solving the multiple categorization scheme issue by accepting them all, or as many of the Library of Congress do, Desmond. Would it just be too ridiculous to have, when people cannot decide, the most fucking edit war over metadata to allow multiple metadata pages for an entry? Does that just make it? I mean, you know, they can certainly do that, right? We can't stop them from creating new pages. I don't know if that's the solution that people are going to be happy with, but it might be. I want to ask Greg Crane a question. You did something like this 25 years ago with a fine corpus, right? If it was today, would you do it this way? Or open it up to the world? That is the big challenge. We are working on, I think I can claim responsibility for kicking the OCA vigorously last year to open up this books on demand scanning thing last year's meeting. And it's worked extremely well at Toronto, I have to say. So the question is, how do you get things out there? How do you allow people to depict what they want to build as collections? And then how do you bring to bear the specialized expertise that actually adds the real value? Because in fact, nobody's really interested in books. They're interested in logical units, which may sometimes overlap with books, but they want a chapter, they want a poem, they want something which is inside, and usually subsets from multiple different sources at the same time. But most of the expertise in the world is distributed. So we're working on verbalization and we're working on top of adding value to an OCA type collection and integrating that in with other structure materials. And so the big question is, how effectively to take advantage of this distributed labor? It's hard, and it's a tricky question, but it is clearly, from my perspective as a scholar, that's not just the means, it's an end, because we're just redefining the relationship between what you do in the academy and the world. So Wikipedia is the world, and the academy is the dog, the academy is the tail. How do you integrate professional academics with this massive populist movement and have a more interesting balance? But I agree with you that it's not done by picking, say, monographs or something, it's done by picking a genre, whether it's law or science fiction or classics or the classical world or something, but you want books, journal articles, photographs, maps. And this gets into the issue of, it's not just books, but it's objects. So we've been working on the integration of Ferber OO or the object-oriented version of Ferber with the C.I.Doc CRM, which is the European ontology for dealing with museum artifacts, essentially. And we have always had, for 20 years, the problem of having museum objects and textual objects in the same environment, and that's the real world problem. The same extent you reduce this back to the 19th century when you're dealing with books in isolation. So, and you also need more powerful, this is really cool in that you can actually add systems, more powerful systems. So we have the classes that came up with what they call the canonical text services protocol to define, you have chapter and verse of the New Testament, for example. You may have an enduring coordinate system describe the contents of a textual object that are the same from addition to addition. How do you describe that? Because that actually describes the logical structures that you want to work on. Shakespeare doesn't do that, except they have globe Shakespeare, semi, they have a few standards. But there's the general issues, how do you add more structure within the book? It looks like you could probably do this within this environment. Yeah, that's the hope. And that's certainly gonna come up a lot with journal articles, right? Cause you wanna be able to structure them in many different levels. And similarly with music, you wanna be able to point to like songs within an album, within a collection of works. But right now, we have our hands full with just picking up monographs. So we've kind of been planning about that, but haven't done it yet. I mean, there's the problem where we actually had, we had obscure languages of a feather. So we put in Syriac, we put in Old Norse, we put in Sanskrit, we put in Greek, we put in Latin, by the OCA. And of course, one Sanskrit guy from Brown said, oh, they got the wrong edition. They got like, I wanted the seven volume dictionary and I got like the one volume addendum. They're idiots, they don't know what they're doing and how do they, and then we said, well, now we gotta fix this. But the mechanism was not in place. And there was some discussion among the scholars as to how you would go about actually adding the value that you need to give you that precision. And that's kind of, that of course, your general issue anyway. Just viewed from the standpoint of where were those academics who when given things for free just complain. Yeah, I mean, like I've been into these scanning centers and it's pretty clear people there don't read Sanskrit. You know, like they basically, trucks of books come in, right? And they do their best to figure out what the hell book it is. You know, they scan it as quickly as they can and put it back on the shelf. What hope is that by opening it up like this, there are a lot of really committed people like that, like scholars who will say, no, this edition has a different cover and it's got this mark here and you know, it's a completely different book. And so let's make that really clear on this webpage. And we hope that by letting them contribute that data to a public website, we'll be able to collect a lot of that knowledge and book lovers and you know, book collectors and you know, academic scholars that right now, if you know, send an email to the OCA about it, they're just like, well, sorry, what can we do? You know, problem happened before now. It's important to realize that at least in my experience, the library community resists community generated data. And in fact, when I was, when I argued that we should have, allow people to have, you know, have a little book cart in a wider library. Drop a book you want to have, scan it or whatever library on the cart and they checked it to see if it's in copyright or if it's, they can do it, they scan it. I was told that was bad because only librarians understood collection development. I was actually told that to my face and I didn't understand how to build a collection. And so you know, it's similarly, how do you create your authority lists? How do you create all this data? So there's a culture which has logical reasons behind even if it seems to produce illogical effects that you're, that you have to deal with. You're dealing with that when you're trying to get metadata and you can't get it. Right. I mean, right now we're just working on getting it from the libraries, like giving it back to the libraries in a way that they'll accept it. It seems really hard. I mean, you know, libraries insist you follow like the Anglo-American catalog requirements, right? Which is this like 500 page book about like how to properly capitalize Thai names every weird edge case you can think of. And you're just not going to get people on the internet to do that perfectly. But here's the fundamental problem is that, and even in a professionalized field like classic, there's a finite amount of material. It is all, every Greek and Latin author, even a little fragment has been cataloged to a high degree of precision, but nobody bothered to integrate that in with the Library of Congress, the LC name authority file. So it's not Cicero comma Marcus Tullius. It's Marcus Tullius Cicero or M. Tullius Cicero. And there's no integration between the two. So you have a broad system which covers everything, but at a thin level. So only the most popular authors. And then you have incredibly dense, very powerful knowledge base that covers everybody, but they don't interact. And how do you go about having, marrying the two back together? Well, we're doing it. I don't know who's ever going to take our records, but we're creating metadata that hopefully somebody will upload into the system. But there's no, it's not clear what the center of gravity is. Right. And I mean, you know, long-term, we'd love to be one of those places that integrates that data, you know, by having a consistent identifier scheme and by letting anyone contribute and tag things. You know, it's not going to be overnight, right? It's going to be decades of kind of persuading people that the internet is okay, that, you know, letting people share things is good. But eventually I think there is going to be some big collaborative website. And hopefully it'll be this one where that kind of stuff gets integrated, not just for books, but for, you know, there's lots of areas of science where all the specialized knowledge is really distributed out. I think it's what's significant about your work is the fact that it is open. The idea that it's to be open and editable, and thus it doesn't have to be yours. Right. It has to be your principle that you are making available so it can go more of any way it likes. And that's what's really important. Yeah, I mean, you know, the source code is up there, the data is up there. If someone takes the site and makes a new version of the same principles that, you know, completely outclasses ours, I'll be happy because that means I can go do something else. Father's library, and I have a stack of unpublished books that don't have ISBN numbers and some paper that we're kind of bound together. And I figure I want to scan them and upload their metadata somewhere. Can I stick those into the open library? Yeah, if you scan them, you can create a record and link to the scan. We can, I mean, you know, there's a bit of a question about what stuff belongs in the open library, but right now, you know, we're not going to start kicking stuff out. I mean, you know, it's a, we have lots of servers. We don't have to start taking things out. Can just, you know, if worse comes to worse, we can start tagging which people think are real books and which people think are like, you know, manuscripts and documents and then hide them from annoying people so that they don't complain. In theory, it could become a self-publishing mechanism. You could put your book in there and then order copies from your delivery system. Yeah, I mean, some people have actually, you know, wanted to do that, right? You upload your books to the open library and then you just link people to the page and they order it through print on demand, right? And long-term, I think, you know, I think we'll start to see people do that. We're seeing more and more books published only digitally in PDF files and by having a one-stop solution for getting them in the catalog and getting them printed out for people. I think that's going to become more attractive. And as soon as that happens, don't the pages get spammed? Because I've just written my great Hamlet meets Einstein and it's, so I've added to the Hamlet page. Well, it's the internet, everything gets spammed, right? Mark of success. Just means they love you. You know, and the thing that comes with spam is like very dedicated spam fighters. We were just incredibly pissed off by spammers. And, you know, that's worked very well for Wikipedia. A lot of people don't realize this, but the reason Wikipedia's pages are so spam free is because there's, you know, a group of like 50 people who sit and watch every edit to Wikipedia and say spam or not spam. Like they have it as their screensaver, you know? So, yeah, I mean, hopefully, you know, we're not going to have anywhere near the volume that Wikipedia has any time sued. You know, they have dozens of edits every second. But I think, you know, we'll have fewer number of edits, fewer number of users, and we'll still have these vigilantes who like, we'll go out there and keep it from being too commercial. Can I ask a question of the librarians? We've heard from several of you, but what do you think? And if you like it, what are the opportunities for helping with this? I think it's fascinating. I don't know, I'm not sure how it can help us be good if libraries share their data more. We were certainly talking to the library field more about sharing among ourselves and also trying to integrate into this wider world. The library community is now engaged in rewriting the Anglo-American cataloging ability as referred to and to something else which is hoped to be more in a way or more, you know, able to integrate with other metadata communities. How successful that's going to be is a question because of the desire to sort of maintain the legacy of traditions and data and all that kind of stuff. But we're trying to figure out where the future is taking us and this might be one of the many directions it's going. So what's that mean? I mean, for bibliographic information, we pay to send our data to OCLC. We pay OCLC, so I'm not paying anymore. I don't understand why they wouldn't give it to you. They let Google. Yeah, OCLC won't give it to us. But do you want it in a certain format? Do you want only the monographs? We'll take whatever we can get. We're, you know, we have the programmer is that whatever format we can convert it into something useful. We just want the data and, you know, I mean, we'd be willing to pay. I'm like, we're not going to charge anything, obviously. We'd be willing to contribute time or money or whatever it needs. But it's just getting through the library bureaucracy has been impossible. It's always, oh, you know, I think we need to check with legal and it's too complicated and I mean, I just don't know what the trick is for. So you need to find the right person at OCLC? Yeah. Because I mean, it's better just to deal with OCLC rather than a whole bunch of individual libraries. I would be, but I mean, we've talked, you know, to OCLC at fairly high levels and they just won't give us the time of day anymore because, you know, we're going to be this open repository of library records and they want to keep control of that. And they, you know, they say, well, we're happy to, you know, send us everything you have, but we're not going to give anything back, which is really too bad since they're not for profit. You know, we're supposed to be pursuing the same mission we are. Library records, I don't think can be catalogued anyway. Copyright. Right, they're not copyrightable. OCLC has some regulation. Like they get libraries to sign contracts about how they can use OCLC records before they give them to them. Actually embracing that, I mean, I see that the greatest thing about open library is being that it is an OCLC killer. And that is the greatest thing. It is, you can take the records down. If you get enough people putting the records up, then libraries won't be paying for their catalog records. They never should have been in the first place. And I've seen you talk about this in a way that suggests you don't want to say that outright. Right? But yeah, bound not to say that our goal is to kill Google or OCLC. So I didn't say that. More on that. But I mean, why not just explicitly say, look, you know, the enormous value here, you know, there's all these libraries that are not gonna have to pay for cataloging records. Most libraries don't actually care anyway that much about the quality of the record. There's lots of small libraries in New Hampshire that aren't gonna pay up off the record. Yeah, I mean, they can't afford OCLC's fees, right? I mean, like, you have to pay tens of thousands just to be a member or something. And by having a- No, no, you go through a local. You go through a local consortium? But I mean, we'd love to be a free source for records. Well, I think it's somebody has to create their records for, in fact, it shouldn't be a free way. Though, you know, everybody's trying to figure out where to get them free, but who's gonna create them? Well, I mean, they have been created. New books are created, records are created by the publishers, by the Library of Congress, for cataloging and print. They're created, you know, by the libraries contributing to WorldCat. You know, they're created by librarians all over the place. And it's not like the librarians wanna lock these records up or charge for them. You know, they're part of a public service mission to get books more available. And since it's not copyrightable, we don't have any legal restrictions if we can get our hands on the catalog records from redistributing them. They're just facts, and facts in the U.S. can't be copyrighted. So, we don't think it'll be too hard to use the data if we just get a couple of libraries to agree to contribute their collections. So, OCLC provides a number of services other than taking in data, packaging it, and shipping it back out. In fact, they employ a lot of people to do those intermediate steps. It seems like those kinds of steps, including some culling and some kind of things, that the open library is going to need wherever they come from. And framing the discussion as how this really awesome cooperative can become something different and better in an internet age might help them. I think that being an OCLC killer in the sense of killing this pretty cool institution, I like it, sorry if I'm breaking the stone, is, that's wrong. And, but framing it in terms of, look, a lot of things that you've been doing, you now get for free, and there are lots of new things that people need to do, and we need experienced librarians who have exactly the skill set that your staff has developed over the past couple of decades. That could be pretty neat. So, maybe the funding that they have is on a much larger scale than the kind of funding open library is getting from various sources. I'd like to see that discussion in a public more. Yeah, I mean, I'd love to discuss this with the OCLC people more. I think they're in an unfortunate position. Their mission is the same as ours, but they've built this huge office complex and hired lots of people and started buying up companies and other collections like RLG because they have this enormous revenue stream from the cataloging records and they've kind of gotten away from the public service mission. I think one of the things that's been valuable is we've kind of pressured them to be more open and since we announced, they have this new worldcat.org where you can view OCLC records without paying one of their exorbitant service fees. And so we're hoping that we can keep the pressure on and make them more open, but unfortunately, there's just this kind of shift in mindset between the old generation of how the huge headquarters and Dublin and you charge carefully to the libraries who are in the club and the one where everyone on the web is now caring about this data and wants to contribute to cataloging. I hope we can work together. I still hope, but. So this is to the librarians here again. Please, yeah, I really want more. What is stopping you from putting your hand on your heart and saying, we will deliver our catalog records to open library A and D, what is standing, what would it take for the vision of open library to become the standard that has moved away from OCLC? What would open library have to be? And the first question is obviously way more urgent and timely. Why not just hand over those? Because I have the records I have in my catalog came from OCLC and came under a license arrangement, so I can't give you those. Because I took them from Columbia or Switzerland or some other library. Do you know which one can go? Not easily. Yes. Well, so we've looked at the OCLC contract and it has an exemption for giving the records that you own a copy of the book to another library or a non-profit. And we're both a non-profit and a registered library. So according to the contract, it seems like we can legitimately get a copy. Oh, are we okay, good. So the loopholes, yay loopholes. So it's a non-trivial undertaking to put the records into some deliverable form, but could be done. Or we'll be willing to help out with that? Yeah, if you can just point us at, I assume it's in a cataloging system. You just want a one-time feed, you want regular feed? Just a dump of all the mark records or something like that would be great. But the mark records include a lot of journals. That's fine. We can take care of that. I mean, it'll come in useful. You don't want me to start tackling journals. Well, I mean, I can talk with big Harvard about it. I would be hard. Harvard runs the catalog, so I don't know. But I'm not sure what, you know, where does it take us? So you get a lot more records. So that's the second question. I mean, do you have any non-copyrightable books in your system? I mean, you're just showing us making comics so far. Right, yeah, no, we have a lot of out-of-copyright books and about half a million of them have been scanned, let's see if I can find one with full text. I mean, I'm not particularly interested in promoting in-print commercial books. You know, I mean, Amazon makes an office as far as I'm concerned. I agree. But the common law, if you have Oliver Wendell Holmes, the common law. I know that's free on the web. So here's an example of a book that we've scanned. We put it in this little flip-book interface. But yeah, the reason we have books from publishers is because the publishers are happy handing over the data. The reason we don't have more out-of-print books is because it's been such a struggle getting them from libraries. We, you know, all the people at the project care more about out-of-print books than in-print books and we want to weight the search engines so that those come up more. We want to bring those to more people. The goal is to pull people from the latest hot thing to the older, more interesting things that have been hiding on library shelves for... But if a library has scanned an out-of-print book, why wouldn't you just link to it? Do you need to re-scan the whole thing or do you want to put a TIFF file? No, no, we'd link, yeah, we'd love TIFF files and we'd love a copy so that we can archive it. But we were linking, so Stanford scanned a bunch of books and other people have scanned books and we're linking to all the books that have already been scanned. You know, we don't want a duplicate effort on scanning books at all. We just, again, that's another place where we need records so that we know which books have scanned copies there. There, isn't there a risk that it would sort of annoy if there was something that would upvents and make you in the future? No. Is it you need or no, can't do anything? No, I mean if it's not against our contract, I don't... I wouldn't annoy who. Those he'll see. Oh, okay. I mean, the only thing I can think of is that they're probably planning on doing something like this themselves or maybe thinking that they might be planning something. It's hard to see them doing that. You tell me. I don't know. What, to what? There was a WikiCat project, but as far as I know, it's good. We talked to them and they sent us what they had so far, but yeah, it's kind of dead. Cool idea, they just didn't... Follow through. Maybe you can just get that to you. I think they've all got different jobs. I think we tried talking to them in there. I mean, that was the idea for them to have some subset of data that was world readable and everyone had access to the comment on it. I mean, if you want really... If you want bibliographic data and you really want a big file, I mean, the bigger the file, the cleaner the data has to be or you just get harder and harder to search accurately. It's the people that help you. Yeah, because I know when you're searching for that file, is there anything to do with it? Yeah, we have some microfilms, Karen. You'll get the larger data that you're working with. Yeah, and that's why one of the things we want to do is get review indexes and stuff to try and figure out which of the books people have been talking about throughout history, which of the ones that got reviewed in the New York Times in the 1900s and weight those more heavily than agricultural deposits. Also, by getting library catalogs, we'll see which ones are more widely held. So that's another indication of quality. But we have a lot of people working on this trying to improve the quality of the records, trying to scrub the records and trying to improve the search engine and things like that. But the plan is to do that by getting more data and integrating that into our algorithms. Right? I'd just like to ask you to speak up a little bit so that I can hear you better back here. Sure, sorry. How do you plan on promoting this once you've opened it up? So they're the standard ways of promoting websites, the various blogs and so on. We want to get ranked highly in Google and one of the ways is a way to draw people in. We're also talking about a partnership with Wikipedia. So right now, when Wikipedia cites a book, there's a very tedious way of typing in the details of the book and noting it at the bottom of the page to say that some of the information in this Wikipedia article came from this book. We're working with them on making open library that system so that instead of typing in the details of the book, you search for it on open library and then just link to the open library record. And I think that's gonna draw an enormous number of people to the site. Resolve an age-old problem which is where that little ISBN shortcut goes. Yeah, the current solution's a bit absurd. Actually, every media Wiki instance implements the ISBN shortcut. If anyone here runs media Wiki, if you do a little ISBN colon and then an ISBN number, that'll automatically get converted into a link to some special book page, which if you haven't set up on your site, doesn't do anything. But maybe you could even be on Wikipedia to have the media Wiki done. Yeah, that would be great. What's the, we'll wrap up in a minute. What's the very short list of things that you need in order to make this project succeed? More data, more people contributing and using it and book lovers. Tim's built an amazing community of book lovers that's been very responsive to new features and new ideas and we're hoping to have something similar for open library and a few more programmers wouldn't hurt either. Thank you very much. Yeah, thanks everyone for around to answer questions or talk to people if people want more. Thanks.