 I'm Maurice and this is Andrew. We're both from Victoria University of Wellington. I'm here from the Waitiata Press. We had an idea for a project and Andrew is here from the team that was assigned to do it. For the press it all got real, got physical in 2016 with the handover of a ton of Chinese type, which had arrived brand new in New Zealand in 1952. The types used to be carefully organised for use. The pages for each month's issue were typeset, sent to the printers, used to print the journal, and then the pages were returned. And the individual sorts, the metal of the type, was carefully replaced in the correct tray back in that office that you saw. The goal is to re-establish a Chinese scholar studio. We asked the team to help by analysing imagery of the newspapers. We put the imagery itself in FigShare, an online digital repository for researchers. In a private collection each issue is a FigShare file set, a set of high-res tiffs, and now I can hand over to Andrew. Hey, so over the entire publication is about 2,800 pages. This is a large amount of data to process, as each one takes around seven minutes for a reasonable computer to process. This has meant it would take about two weeks to process the entire lot of newspapers. We decided to make a system that would run in parallel. We ended up using a framework called Spark, which is used to process big data. It uses a cluster with a head node and a number of worker nodes. And we set each worker node to process one issue at a time. We ran this on the cloud using Microsoft Azure. This was because it was easy to set up and we got free $300 each month, as were students. To process a page of the newspaper, there are a number of steps required. We had to find out where the characters were that were printed by the type. We had to turn those characters into machine-readable format, and we had to do frequency analysis to find out how many there were of each one. The first step, layout analysis, was not trivial. As the layout of each page of the newspaper changed a lot between the pages and between issues. The pages also contained sections of ads, which also contained Chinese characters that were not printed by the type, so we wanted to ignore them. We tried some pre-built solutions like Google's Tesseract Loud Analysis Component and OCR Opus, but these had very poor accuracy. The approach we finally settled on was we used a deep convolutional network that was based on the GoogleNet architecture. This was trained to classify which section of the newspaper the small samples were from. We could then run class activation mapping across the output from the entire page to get an idea of where the different sections of the newspaper were, and then these combined to look something like that. We can see that the output is pretty good at being able to recognise the different sizes of characters. However, it's not so great with the different fonts in the Chinese text and the ads. This is not a massive issue as the OCR solution can recognise that it doesn't find any Chinese characters here and ignore them. For the OCR component, we ended up using Google's Tesseract. We attempted to create our own one that would work specifically better for the Chinese language. However, we were only able to achieve around 50% accuracy across 13,000 characters that were generated from Unicode. We potentially could have got higher accuracy if we had known the exact characters that were in the newspaper. However, it would have been unlikely that we would have been able to get more accurate than Tesseract that got around 80% to 90%. From here, it was pretty easy just to count how many Unicode instances of the characters there were and put them into tables for each size. Thanks for listening. The students have really helped us to understand the complexity of the challenge we gave them. Now, here, as one example, is a page of a type catalogue from the foundry that made the type we have. The code the team wrote used knowledge of the fixed page format and then Tesseract OCR to get Unicode values. Here is a challenge that the team's character checker gave Yaowen. Two variants of a character do these match. The Taiwanese Ministry of Education Dictionary of Variant Forms documents the character for Zhong, a crowd as having historically had 40 variant forms. But how could Yaowen know if Unicode glyphs, Unicode code points exist for those 40 variant forms? This is the model the Unicode consortium uses for variant unification of Han ideographic characters. The model expresses written elements in terms of three primary attributes. Semantic, meaning or function, abstract shape, general form and actual shape, instantiated typeface form. We have many better ideas of the complexity in our work now and I've just shown you one. The students have prototyped an infrastructure for us to scale out and begin action research into big data analytic approaches to what we're doing. So it's thanks from me, Sydney and Yaowen, to Andrew, with Deakin, Enshi, Rachel, Tabitha and Thomas and to NDF. Thank you. It's been fun.