 Hi, everyone. Thank you very much for having me here today. So, I'll be discussing my use of transcribus in my PhD research, which is looking at legacies of race and slavery in the Encyclopedia Britannica, the first eight editions using a text-binding approach. So, first published in 1768, the Encyclopedia Britannica was published in Edinburgh, Scotland, and its approach to science, reason and the organisation of knowledge was a real exemplification of the Scottish Enlightenment and the wider Enlightenment movement. It was a reference publication with its first page claiming here as a dictionary of arts and science compiled upon a new plan. As well as entries in alphabetical order, as you would see in a modern-day dictionary, this new plan included longer essays to unify concepts that were in disparate parts across the Encyclopedia Britannica in the hopes that this would improve knowledge on certain topics, such as shipbuilding, anatomy, and other various topics that required quite a lot of information. It was published at a time where anti-slavery sentiment was increasing in Britain whilst the Atlantic slave trade was continuing, and that's what I'll be focusing on in my research. So, my research is looking at the first eight editions, which go from 1768 to 1860, and this presents the opportunity to research lots of different stages in the movement towards abolition, so the continuation of slave trade up to abolition and how attitudes changed after that. There's a huge amount of text to work with, and I'll give you a breakdown of exactly what's involved later on, but each edition of the Encyclopedia Britannica ranges from three editions at three volumes in the first edition up to 20 volumes in the final ones that I'm looking at in this period, so there is a huge amount of data to look at. The data that I'm using is from the National Library of Scotland's Data Foundry, which houses select digitised collections as data sets from the library collections as open access data collections that can be reused. The Encyclopedia Britannica data set is very comprehensive, and it's a really valuable resource, and I'm using some of the images that are included here. It has XML files, Met's metadata files at item level, and 155,000 images included. It also has OCR that has been generated from the in-house digitisation process, which, as you can imagine, contains a lot of areas. It was produced using industry standard tools, but that has brought me on to my research with Transcribus. Having good quality machine processable data is crucial in text mining, and the OCR just didn't quite cut it. Just to give you an idea of what I was looking at originally with the OCR and then how we can move from this with Transcribus. There were numerous errors, including misrecognition of the long F, so as someone looking at race and slavery, this was quite a big issue. OCR scripts can be written in code to fix these kind of errors, but it was including this and a lot of other errors. It was going to prove too much for my research. Here are some examples of the different issues. In green boxes we have correct recognitions in the original OCR, which can be found with the QR code here. There was also misrecognition of Flaves, which is not particularly helpful when you're trying to find certain words and text. Due to this, I decided to move on and use Transcribus to hopefully create a text with a lower character error rate. Here are a few other errors that appeared in the text. As you can see, it's quite difficult, so I was hoping that moving into Transcribus, I would be able to create a more usable dataset. I trained my own Transcribus model, and I hand transcribed 27 pages of text from the first volume of the first edition, so 1768, which totalled 17,898 words and 2,368 lines of text. As you can see here, this model turned a character error rate of 0.95% on the train set and 2.41% on the validation set, which I thought was quite a good result personally for a first attempt, so I was quite happy with the initial results. I will also note that I used the preset model for the layout analysis tool because it was my first foray to transcribe us, so I thought that was the safest way to go into it. After training, I decided to try out my model and see what the results were and if I could actually find some of the words that I would potentially be wanting in my research. I used the full text search function to see the return rate on words relevant to my research, so in this case, I looked up slave with an asterisk after it to look up instances of slaves, slaves and slavery, which are highlighted here. I know the writing is quite small, but you should be able to see by the yellow indications how many results I got back, so this was quite promising. I returned 13 instances of slaves, 10 instances of slavery and a couple of instances that weren't relevant such as slavering and slaver. I knew the text result wasn't going to be perfect, but due to the volume of text, I needed a process. My aim was to achieve a good enough low character error rate to use as a usable dataset. Although the text quality that I got back was significantly better than the original OCR that I'd looked at, the layout of the text posed a little bit of a challenge for transcribe us and we've obviously had a lot of discussion about layout difficulties earlier today, so hopefully we might have some answers in the future. Most pages in the Encyclopedia Britannica are laid out in two columns. This is a small section of one of the pages. As you read down the page, you read the left column first and then read down the right column. In some cases, the entire page was recognised as a text region as you can see here with the green box around it, which meant that although the text was split into separate sentences for each column, unfortunately this bled together all of the text, so looking at individual entries is quite difficult because they end up merged across the page. Although this is something that could be fixed manually on a small scale, for the scale of data that I'm working with, it's quite difficult, so I was looking into other options. In some cases, there were instances of lines not being split across columns, but you can see in the very top line of the entry has been recognised as one full sentence across the page, whereas it's actually two separate sentences, and the whole page was recognised as a text region. The image below shows a similar issue, although the columns have been split into two, so this is something that was being picked up in the layout analysis. To try and work around these issues, I tried training a baseline model and ran layout analysis on a sample of my text. This also returned mixed results, so I had high hopes for this, but I think due to the complete mismatch of some pages with quite tricky layouts, I just wasn't getting consistent results with anything that I tried, so if anyone has been working on text with columns and has any advice, I would really appreciate it. I've been in touch with the transcribers team and I've had some advice which I've tried to implement, but if anyone else has had a similar issue, please grab me after the talk, and I would love to have a chat. So what does this mean for my research? My transcribers model has helped to generate a corpse of text with a reasonable low character error rate, therefore providing a good quantity of data that I will actually be able to use in my research. As I have such a large quantity, my ability to process additions at scale is crucial, and transcribers offers the means to do this and to create a usable data set. Initially, to narrow my scope, I'll be looking at the first and seventh editions, which indicated here, so you can see the page count. These are the number of pages that include text, 2,659 pages, and then 17,047 pages, so as you can see, it is a very large quantity of text to work with. A massive advantage of digitally searching for keywords in the encyclopedia Britannica is that you can get results that you don't expect, and this is something that I'm really interested in in my research. There are entries for slavery, sugar, Jamaica. I could go on with a huge list of them where you might expect to see references of slavery, but it's where you get references that you don't expect that is really a point of interest for me. So this is just some very initial results looking at some keywords from the first edition, volume one. So I looked up instances of the use of slaves, and you can see here a list of entries that they appeared in. I also made a note of the context that these appeared in, so what the entry was talking about and found that of these, the majority of the entries were talking about slavery in the context of ancient Rome rather than discussing contemporary slavery. I've also done some close reading analysis of the slavery entries across editions and found that this is a recurring theme in the early editions, and it's only as the move towards abolition becomes stronger and legislation gets passed through in Britain that you see a more firm discussion of contemporary slavery in the sources. I'm still in very early stages of running my searches, so my apologies that I don't have a lot of nice visualisations like the last presentation, they were fantastic, but it's something that I'm hoping to improve and expand on as I move through my research. So by identifying where my selected keywords appear in the editions and how frequently I'm hoping to look at the networks created between the editions and across all of the entries. I'm also hoping to identify when new information is coming in into the editions as time progresses. With each new edition there is usually an expansion of the general word count of the encyclopedia Britannica, so I'm expecting to see an increased mention of certain keywords and hopefully identify that in relation to slavery. I'm also interested whether certain keywords are repeated across editions or if you start to see a decline in the use of certain words. So I'm kind of throwing keywords out in the wind and seeing what comes back and then trying to explore further from there using distant reading and more of a close analysis as well. I'm also hoping to use topic modelling to more closely examine entries relating to slavery and get an idea of the concepts that pop up in these entries quite frequently. I'm also using digital tools more widely, so I'm using our coding language. So I'll be exporting my text files and running them through scripts of code to do my text mining and create good visualisations that has a lot of flexibility for this. The use of digital research methods and resources opens up a lot of new avenues of research and hopefully this will help reveal the extent of legacies of race and slavery in the encyclopedia Britannica. I think there is so much information that is buried within these masses and masses of pages of text that it's quite difficult. It would be more than a lifetime's work to go over and identify this all by hand, so very fortunate that I have my text created by Transcribus to go on and do more detailed analysis that just wouldn't be feasible manually. I'm hoping to dig a lot further into this and will be sharing my findings with the Transcribus community as well. Thank you very much. Thank you very much for this presentation as well. I think it was very insightful. Any questions? We have time for two, I guess. Yeah, go ahead. The good thing is, if somebody is using the microphone, also the people joining virtually can hear everything. It's a red button. It's on. I was just wondering, Ash, because you had showed us the dates for each edition and there's quite a big gap isn't there from late 18th century to mid 19th century. Does the layout become more standardised for the later editions and could that reduce the headache that you've got in terms of recognising both columns? The short answer is no. Basically you have very complex layouts where most pages are just representing two columns of text. Many other pages have the longer essays included, so you have columns, headings inserted into the middle of pages and then further columns and it gets quite difficult trying to figure out exactly where you have to read which is intuitive to humans but more difficult to instill into machine learning. Shouldn't be impossible, so I'd like to hear any further solutions but it doesn't become easier with the late editions unfortunately, which is a shame. Right, we have a second question. Have you ever tried out the block segmentation? This is a special method for printed text and it should be able to work with the two columns. I don't think I have, so that's fantastic. It's something that I've been going around in circles with and I've tried asking a few different people. I feel it's a simple solution for your problem. Lovely, that's what I like to hear. It's a simple solution, that's great. It's in the layout analysis on top where you see sit a bit once, you open it and then you can select printed block segmentation and that's made for such problems. Lovely, I'll try that, thank you very much. Then we are happy that we could help. Thank you very much. Ash from the University of Edinburgh.