 Good afternoon, everyone. I'm Ken Taylor. I'm the culture producer north of the British Library. Welcome to our first webinar for the Leeds Digital Festival, a rethink research illuminate history. We're really pleased to be part of the digital festival for the second year, and of course this time we are operating in a virtual environment. So we welcome people both in Leeds and further afield to this discussion. I'm going to go over to you shortly to Dr. Mia Ridge. Mia is our digital creator for Western Heritage, and she's also the co-investigator on Living with Machines, which is our joint venture project with the Alan Turing Institute and universities that's going to form the focus of what we talk about this afternoon. There'll be a particular segment of time for questions towards the end, but I will be in the background hopefully if you have any issues just pop a message in chat and I'll try and resolve them for you. Without further ado then, happy to hand over to Mia to start the more interesting part of our event. Mia, over to you. Thank you everyone for joining us today and I hope you're all well and happy. I'm going to talk for about 40 minutes and then hopefully there'll be lots of time for questions. I've tried to anticipate common questions as I go, but not being able to see the audience makes it slightly harder. So hopefully Ken can pick up questions and then anything that's not answered during my talk will pick up at the end. So as Ken said, I'm the digital curator for Western Heritage Collections. I'll talk more about what that means and a co-investigator for Living with Machines. A co-investigator is one of those pieces of academic language. In more purely tech talk it might be something like a responsible person or senior stakeholder. I'll be covering a lot of different topics across this talk but also providing a lot of links so that you can find out more outside of the talk. So I thought I would explain something about the British Library because not everyone is familiar with our work. Talk about how we're responding to changing research practices, particularly changes in the digital environment. Explore why we want to collaborate with academics and how it helps us understand the future of research and then go into more detail about a case study with Living with Machines thinking about how we're combining data science and historical research methods and then closing by talking about where you can find out more. So the British Library's mission is for research inspiration and enjoyment. We're the National Library of the UK. We are a copyright library which means we receive every publication produced in the UK and Ireland and have done for some centuries. Our collections predate the existence of the library. We are formed from a number of different institutions libraries including the British Museum's library. One of the main challenges that we face is the pure size of our collections. So we say we have up to 200 million collection items. There's a a gentle tussle between us and the Library of Congress about which is the biggest library in the world. We're definitely one of the two biggest libraries in the world. We have about 16 million books, lots of stamps, manuscript volumes which are handwritten or archival material. We have patent collections, maps, music scores. We're increasingly working on collecting sounds because sound recordings, the physical formats of deteriorating over time. We record television and radio news. We work on the UK web archive in collaboration with other legal deposit libraries. So we collect websites from the British internet websites. And we are also sent increasingly terabytes a day of ebooks and e-journals from publishers. So publishers no longer necessarily send physical copies. And I just wanted to note also that this photograph is from our Boston Spa site which some of you may know outside of Leeds or between Leeds and York. We add to our collections extensively every year. This means that discoverability across collections is a challenge because there are so many different kinds of items. They are all cataloged to a different extent using different formats. We work extensively with international communities on using cataloging standards. But the standards for cataloging television broadcasts are different than those cataloging 3,000-year-old Chinese Oracle bones. So it can be very difficult finding items across the collection. We are not a lending library so we can't lend you items outside of the library. We have reading rooms in London and also one reading room at Boston Spa. Not Boston Spa. About 4% of our collections are either digitized. So we image, photograph, scan or record sound, moving image from physical items or they're born digital. So that seems like ebooks, social media websites and sound collection, music collections as well. We work closely with publishers and with readers to ensure that we provide legal access and often this means that things are only available on site which is a challenge for a library with an international readership and serving a national read it as well. So for the last 10 years at least the library has been exploring how digital access can broaden the number of people who can use our collections, how and when they can use the collections. We can increase the convenience of using the collections but also responding to the fact that the way that researchers work is changing. So about 10 years ago the library's digital scholarship team was set up and our role is to make it easier to use digital collections for those mission-based purposes. So research, inspiration, creativity and enjoyment. There are several teams within the digital research digital scholarship team. I'll talk a little bit about the digital research team but I would also really encourage you to look at the work of BL Labs who do so much to enable and reward people using collections in interesting ways and also the endangered archives project which helps digitize at-risk collections around the world making our British Library collections truly, truly international. We generally work to sometimes as agents have changed. You can imagine that a library has a lot of different departments and levels of experience within it so we do a lot of work to invest in training for our staff and support scholarship. The digital research team specifically is a cross-disciplinary mix so some people train as librarians. I trained as a computer scientist but also as a historian and that kind of hybrid work that we all do means that we try and understand and support digital scholarship as best we can in the library. So we do a lot of training, we do a lot of collaborative projects with BL Labs, they do a lot of awards and events and competitions and we try and be as chatty as we can about what we're doing on our blog posts and on social media. We also, as you'll see with the English machines, publish our papers via the research repository of the British Library and where we can. We also share code as well as providing access to datasets, sets of digitized images and files. In many ways I describe my job as enabling the shift from reading pages to reading datasets. So the library is traditionally very comfortable with people coming into the reading rooms. They order the books up that they want, they might have a stack of three on their desk, they go through them. Increasingly they'll be reading physical books but also have other online reference sources next to them on a laptop. They might even be other books from the British Library's collections but it is a kind of process of reading through turning a page by page. So my job is to help the library think about what it means to read datasets, to turn collections into data and that's very much part of a wider international movement around thinking about how to make libraries more accessible, how to make dealing with the challenges of turning books, sets of images sounds into datasets that can be analyzed in different ways. One thing that we've found is that collaborative projects help the library move more quickly. Those of you who work in large organizations know that change is hard, it's slow. If you come out of this talk learning just one thing, it's that the library shown here in the St Pancras office was designed to look like a ship and of course ships are kind of notoriously slow to turn. So we find that collaborations are a little bit like tugboats that help that ship sort of nudge it along and in particular we find that working with academics means that we can understand their needs better. We learn more about our collections and that benefits all readers of the library. When I say readers of the library I really mean anyone who uses the collections. You don't have to be academic, you can be researching any topic you like. We run business support, we run lots of support for creative uses, artistic uses, teaching, learning, whatever. We try and bring techniques that we learn from academic collaborations and collaborations with others into our own practice so that we can then teach those methods across the library and support other researchers. And we also find that it's a great way of exploring new and emerging technologies at scale. So we can do small scale pilots with our collections. We do a lot of experimentation on things like methods for working with sounds, methods for working with maps, applying machine learning technologies to automatically caption images or understand the content of books but because of the size of the collections actually scaling up and applying those technologies at an operational level is immensely challenging because when you're talking about 16 million books it's a bit different than the kind of collection that other institutions work with. So living with machines really came out of that desire to respond to the growth that we were seeing of digital scholarship and data science methods. So we'd started teaching things like text and data mining, exploring different methods within that to understand how they might affect the library. And about the same time the Alan Shuring Institute which is the UK's National Institute for Data Science and Artificial Intelligence, we're moving into our St. Pankras building and it seemed like the best possible way to take advantage of their presence to perhaps complicate their ideas about what data science meant by asking them to look at material from the arts and humanities from the history of science as well as contemporary science across the whole multilingual international complexity of our collections in terms of the countries of origin, the languages, the concepts, as well as the different formats in our collection. We wanted to collaborate with subject and methodological experts and I think it's really important that in this project we are using really rigorous historical and data science methods and we can only do that by collaborating with others. But we also wanted to build on libraries expertise in research services and in public engagement. So we know a lot about how people use collections in the reading room through our work in digital research. We know a lot about how people use resources in teaching, how people from non-traditional non-computer science backgrounds might start to approach digital methods and the kinds of questions that data scientists or computer scientists or statisticians might have as they start to use our collections because there's a lot of contextual information that makes it easier. Looking at the machines is also a huge opportunity to understand the potential and challenges of AI and machine learning methods for cultural organizations and we are learning a lot about the kinds of scale that you get when you ask data scientists to work with really, really big data that might not have the same definition of big data that they're used to. It's often very messy, it's often incomplete, it changes a lot over time, it's very inconsistent and even working with things like cloud storage bills when you're working with petabytes of data when you unzip some of our digitized collections. And finally the project was also an opportunity to build on the digitization work that's happened across the library over the past decades and provide sort of worked examples that will help others think about how to apply these methods, not only in giant data science projects, but perhaps in smaller ways in their own research projects or in their teaching or their personal projects. So living with machines is approximately a five-year project. We are using it really as a case study because the long 19th century which is really a historian's way of saying a time before and after the end of the 19th century because these movements are never exact. If you think of the long 19th century as being the first industrial revolution people were dealing with masses of change. Technology was changing ordinary lives. There were new methods of receiving information, new methods of transport, new kinds of work. It changed how time operated, it changed where people lived. And we thought it was a very resonant theme at a time when new technologies like AI, robotics, whatever are changing the way that people are working now and how we're thinking about technology now. So this 19th century case study was particularly sort of fit in the sweet spot of the kinds of questions that we wanted to explore and in particular also the explosion of documents from the 19th century. So the Victorian era was roughly coincided with the rise in new statistical methods, a flourishing of newspaper publishing, a flourishing of other sources of printing and they're also sort of quite well collected. So there's a lot of material to work with which of course brings its own challenges. So our aims are broadly to generate new historical perspectives. So there's a lot of work being done on the history of the 19th century. But we wanted to see how we could nuance this or make it richer by using methods at scale by perhaps going beyond the kinds of work that historians and others have been able to do looking at page by page and how can we change their understanding by finding patterns that might be only evident when you can look at data sets rather than individual records. One of our main goals is to develop new computational techniques for working with historical research questions and one of the main challenges is turning those techniques into something that people without necessarily a full data science background can access. So making these methods more accessible to the everyday historian whether that's an academic historian or a historian working in the community as a hobby. We want tools and code that other people can reuse. We also wanted to help support the wider cultural heritage sector in using digital methods. There's a lot of work happening in this at the moment. Some of you might have heard of the Towards a National Collection series of projects. So it's a particularly ripe moment to be doing this kind of work. We always want to increase the usage of our collections and this seems like a really good way to do that. And finally we wanted to be part of the public conversation about data science, about AI and how the humanities and arts can have a role in those conversations. It's not just about maths and science and STEM. These quotes kind of express the complexity of the different ways that stakeholders see the project but I wanted to highlight Ruth's statement, she's our principal investigator, about creating both a data driven approach to our cultural past and a human focused approach to data science. And I think both of those are really fruitful and that intersection of creating ways for people to understand each other's disciplines, each other's ways of looking at the world, each other's ways of understanding valid research questions, how results are presented. That's one of the main challenges that we're taking on in this project. So really quickly some of the benefits for the cultural heritage sector, providing models for really large-scale research collaboration and partnership. I've spent years working in the cultural sector and have worked extensively with teams within organizations on exhibitions and digital projects but this is a huge project for us and there's been a lot of learning in terms of how you manage that. We think it's really important that people don't think of libraries as places where ideas go to die. We are leading digital innovation in some ways. There's a large piece of work around improving workflows, data processing, working with different forms of metadata. So kind of internal processes but really the behind the scenes stuff that makes the library work. We also wanted to deal with issues around copyrights and mixed rights access where some items are still copyright and others are in the public domain. How do we work with expectations about open data when it's not possible for us to grant access to all forms of data? And then finally we wanted to challenge the library slightly to think about how to incorporate digital content and data in the exhibition program and hopefully we'll be seeing more of that up in leads at some point. The team is huge. We have a large number of co-investigators so kind of project management board members and people hired to work on the project so a lot of postdocs or research associates, staff at the British Library and staff at the Turing and people seconded from different academic institutions. And this has only been possible with the support of our funders so the UK, UKRI and the Arts and Humanities Research Council. And it has been an amazing opportunity particularly for the library but I think also I hope for the rest of the team to learn together from each other about these different methods. So one of the first challenges we faced is how do you start a 20-something person five-year project? We began with some major research themes and sort of focuses of work. So looking at things like biases in sources, not only the sources that we wanted to collect but understanding how our source selection process and the sources available to us would shape the questions that we could ask. Thinking about how language is used in these sources so we're working a lot with computational linguistic methods. Looking at change over space and time which is a really important sort of historical approach. Looking at communities not only in terms of how the industrial revolution or mechanization affected communities however you define them in the UK but also how we could engage different communities in our work so things like crowdsourcing, public participation and how we could engage academic and other research communities in our work too and how we functioned as a series of communities within the project. There is a massive piece of work to integrate manage the infrastructure think about interfaces to these datasets. We're using a lot of cloud-based services but also things like Jupiter notebooks to provide relatively lightweight entry points into some of our data processes and the sheer process of acquiring data and wrangling it so managing the legal rights, managing credits, unzipping files that can take weeks to transfer between networks. There's some really sort of back-end questions that are really important in our work. Some of the sources that we are looking at include the full text of newspapers, trade and postal directories so these were sort of like the earlier versions of yell.com where if you wanted to look up the business you'd look and kind of it would be an index or a printed listing of what businesses were wearing perhaps street by street what businesses were on each street. We want to look at working class autobiographies as a way of accessing a voice that isn't represented in newspapers. We are looking at journals and diaries novels parliamentary papers because some of the legislation gives us a sense of where different areas were being focused on. A lot of the data that we want to use is tabular and that has its own challenges. Looking at birth death and marriage records they're highly structured if someone has already done the work of tidying them up for you similarly with census records they provide a lot of access to understanding the life courses of people affected by the industrial revolution. We're doing a lot of work with computer vision on things like ordnance survey maps, the fire insurance maps which give us a lot of detail about the composition of buildings and streets and we're also looking at using images in publications in different ways. These are mostly collections from the British Library but we're also negotiating access to other collections and we're very grateful to the National Library of Scotland for providing us with their digitized ordnance survey maps because that's made life a lot easier and hopefully they're learning from our project as well. And we're also grateful to find my past for access to newspapers digitized in the British Newspaper Archive. So I want to just spend some time thinking about how we are applying different methods to answer some of these sort of questions and as I said there'll be a lot of terms that might go by quite quickly but hopefully also some links that you can find out more. So one of the questions that we wanted to understand is there are a lot of newspapers published, a lot of them were collected, a percentage of them have been digitized. It's certainly nothing like the not even 20 percent of the collection and the selection process has been led by family historians so the newspapers are selected to give a sort of geographical sample rather than necessarily being focused on the number of newspapers produced in a different region just try and get one from each region. But there were at the time these newspaper press directories published that were sort of like media buying guides so if you wanted to know where you should advertise to reach farmers or big land landholders or the radicals you can look in these newspaper press directories and see for any given newspaper the kind of audience that they aim for, the tone that they had. Sometimes they'd give a sample of the kinds of topics that they'd cover so one of our researchers, Katherine Baylin, is using this method to he's comparing the printed directories with the newspapers digitized to understand the difference between what was available at the time and what's available digitally now and he's been applying various methods to do this work in collaboration with others and this will really help us understand how much validity we can give to different findings so if we say Leeds was leading in this is it only because we haven't understood that newspapers from Manchester were selected differently and focused on different aspects of subjects that society might be interested in. So this is really exciting work and hopefully we'll have a lot of use outside of the project. One of the main things that we've done that we didn't plan at the start is create visualizations with research software engineer Olivia Vane to understand what's in our collections, what's being digitized so that we can select certain newspapers for digitization to fill some of the gaps that we think might be there in the collection. So this visualization shows it's interactive as a video online what we have in the collections, what's being digitized it gives you a sense of the complexity of newspaper collections because they would change their titles some of them really often sometimes the same title would be reused with different publications over time so this is a way of navigating through a really really giant collection of newspaper titles to get a representative sense of what was being published at different times and how we can build a data set to be as representative as possible or to respond most closely to the kinds of research questions that we have. I mentioned extensive use of computational linguistic methods so they're looking at questions like how did people talk about machines and what was happening when they did talk about machines so looking at the source text to understand the social and cultural impact of mechanization they're dealing with issues like optical character recognition errors the process of automatically transcribing text with computers is not very accurate for earlier newspapers necessarily so you'll often get kind of misspelled words that can make life difficult they've been exploring methods like manual annotation to mark up the semantic structure of sentences under to understand where machines would give an agency and the easiest way to think about that is the stereotypical computer says no line from little britain where it's the whole point is that it's not really the computer it's a system set up and enabled by the computer that's saying no to people so they're doing really exciting work looking at geolocating mentions of places in texts so that if we can understand if we can differentiate between different places that place names that were reused we can get a stronger sense of how different areas were changing through mechanization the idea of lexicon expansion is that we might have a term like term like machinery or machines it will have some metaphorical uses so the machinery of parliament the machinery of god but also very literal uses in factories and other industrial settings so how can we find terms that people at the time use to terms like machinery to run the most accurate queries and find articles about machinery and then looking through all these methods at how did how were machines given agency by writers over time and place so how did that change and when this is just an example of looking for place names in text so new town i think there are 14 new towns in the uk so here we're looking for which new town is which based on the context of the text around the the use of the place name but also the publication i mentioned the large amount of infrastructure work that's going on they've had to do a lot of work to establish a secure environment to setting up our compute systems they have documented and developed data models and looked at the best practice in reproducible data-driven humanities research and spent a lot of time supporting that i mentioned i think that we're using Jupiter notebooks we find that because they let you document and comment on code and the code is also executable so it helps people run start to access code without having to have a whole kind of setup and computing environment and they've done the initial analysis looking at when we have newspapers from how many words tended to be in an issue so that we can understand the composition of the data sets and the weighting of different findings that we might have across the project i mentioned that we're looking to understand change over space and time we are using maps as a way of understanding change in the industrial landscape so we've done some work looking at automatically transcribing text from these maps we have a blog post that talks about some of the results of this so we're trying to locate places but also understand how land use changed across time and trying to sort of pull out signals from these data sets and it's quite new work i think to use maps as data sets in this way and we are applying methods that were developed for contemporary systems to work with historical collections and thinking about how we can integrate annotation processes into something like human computation systems that improve the data analysis methods while you're working and we're hoping that we can start to link these records to census records so we can understand how different industries were where they were focused, where workers lived in relation to their workplaces and how that changed over time we are working with the public in lots of ways increasingly this is a crowdsourcing task set up on a platform called Zooniverse where we have been asking people to classify accidents and then annotate them to help build data sets for analysis so on the right hand side it's a column of text from a newspaper i think it might be the Hayward Advertiser where one of the paragraphs is about an accident so we did a kind of keyword search in our data set of articles to find things that might be about accidents but then we needed to improve the accuracy of the data set so having people classify these accidents as either being about an industrial or a workplace accident a transport accident or some other kind of accident or perhaps the text might be unreadable in some way and this has given us a data set that we can then use to build further tasks the reason that we focused on accidents is because it's an obvious impact of mechanization so people who weren't used to working with machinery where it's sort of hard for us to think about it now but if you're used to animal driven machinery like carts and horses or manually powered machinery the fact that a machine can't see that perhaps someone's hair or clothing is caught up in a machinery trains run on tracks and take time to stop and can't swerve there's a lot of the kinds of things that we don't have to worry that we've learned to negotiate around that people were still learning at the time so there were lots of different kinds of accidents as well as accidents that were perhaps caused by not understanding the benefits of health and safety or not valuing the workforce perhaps as much as people value productivity and keeping the machines running so we thought talking about accidents has a direct resonance for our research questions but also it's a nice way into the project on a technical level we're thinking about how we can integrate the results of machine learning into crowd sourcing so we're not asking people to do more tasks than they need to but we're improving the data set all the time behind the back end and hopefully also finding ways to create more interesting tasks and improve that kind of process and we found that this has been quite successful so people are engaging with our research so it's been such an open question that people will post and say there was a fire mentioned does that count as an industrial accident and it's difficult because in some ways it does if a machine shaft sparked an accident a spark to fire that definitely does and you can also find places when fires in workplaces because all the machinery was centralized and work wasn't happening in individual cottages or homes meant that a whole community could be out of work but it's very much been a gray area and it's actually been really interesting to discuss that with participants and learn from their perspectives as well and I just have to include these comments because there is something amazing about the goriness of the process certainly makes you appreciate the fact that office work might feel a bit boring at times but it's certainly a lot safer than working with early industrial machinery so as an example of how we want to use these methods to work across the project with all these different strands of integrating the work of these different strands we think the classified articles could be used as the basis of machine learning and those lexicon expansion methods to find more relevant articles we also want to test whether we've got a list of names of people involved in accidents with any place names organization names and details of fair like it sometimes mentions their age or their address to see whether we can work with the public particularly family and community historians who are very expert in finding people in data sets to understand the longer term of these accidents on individuals their families and communities so questions like does the person's family move not long after an accident are they still economically active are they in the poor house trying to understand how though these changes in mechanization actually did literally affect individual lives but then also more broadly opening out again to look at how language around accidents changed over time so I've discovered that boiler explosions were immensely powerful and could injure people working in a field across from the factory could injure people passing by in the street outside so what was it like to be drip fed this language about how dangerous some of this machinery could be as well as language about how exciting machinery could be how it would make Britain more competitive with France and textile industries for example so we want to look at those kinds of questions as well and broadly want to organize talks and editathons for the moment mostly online but we hope also in local libraries working with the living knowledge network which is the British library working with local authority libraries around the country and that's a really exciting opportunity for us and we want to get out there and bring data science into tiny branch libraries so coming to the end I just wanted to share these are some sample recent blog posts to give you a sense of how varied the subjects that we're talking about are whether your interest is in the history or in the code or in the library processes we hope that there'd be something for everyone in this and we'd really really appreciate questions because particularly not being able to run public events it's difficult to know what people are interested in and how we can best help them understand the potential of this project for their own work as well so we have research papers and datasets available on the British library's research repository if you search for living with machines or the name of an individual researcher you'll find some results there we are increasingly making code publicly available on github and trying to do that in at the same time as making data sets available to work with so you can combine the two we have a mailing list which is very low traffic and just kind of points to the latest news our website is where our blog is if you have any questions please contact us at digitalresearch at bl.uk and we're on twitter as well and this week we'll be taking part in the day of DH which is day of digital humanities where we talk about the kinds of work that we do because our work is very much at the intersection of data science history and digital humanities so we're trying to reach all those communities so at that point I will stop and see what questions we might have coming in from Ken thank you all for listening