 Hi. So my name is James Siddl. I work for Digital Science, which is a subsidiary of Macmillan Publishing. We develop tools and information services for scientists. So specifically I work for Shurechem. We annotate patents to provide chemistry data for drug discovery, and we automatically annotate patents. We also provide the technology that you find, that performs the chemical annotation for nature.com. A yna'r data set o'r lliwm o'r 10 miliwn cyfnodol a ddyn nhw'n ein 1,2 miliwn gwahaniaeth yn cael eu cyfrifio. Efallai, rydyn ni'n gwybodol i'r cyfrifio awtomau, a'r rhan o'r rhaid i'r cyfrifio sylwgr o'r systemau a'r wneud o'r hyn mewn gwahaniaeth. Rydym ni'n gwybodol i'r cyfrifio awtomau a'r rhaid i'r cyfrifio a'r gwrs i'r cyfrifio neu'r cyfrifio. Mae'r amser ar y cyfnoddol ychydig comradesio mewn ati yn buddiffyniatr ychydig er fyddwn ni'r cerdd a gael ei wneud ar y bwysigol, mae'r syniad cyrraedd, sy'n rhanfod i comingan anotais lleaf mae'r cerdd yn ymrydydd cyd-dweud chi'n gynnigwyd. Fyddwch ar y llain cyfnoddol ymgyrch ar draws i'w nesaf o'r cyfnoddol. Mae'r cyfnoddol sy'n nemu hwn ddefnyddio ymgrifennu cyffredin. – Oni'n pernous ar y gyd-dweud o'r gweithio, a diwrnodd o ffordd rhywbeth, gwahanol, gwahanol, gwahanol a'r ffordd pethau. Fodd y gwrwm y ddweud y maen nhw, maen nhw'n cymhysgwch, mae gweithio'r llwysau yn ei d naturallynyddio'r gwrs ac yn ddim yn ymgyrchol. Ond, mae'r lle gweld y gbadag yw ei hi ffordd o'i gyrraethau ac edrych. Ond, mae'n gobeithio'n gwrs? Ychydig ac mae'r gwrs ei fod yn gweld yr unig ym ystod fwynt hwn. Mae'n gofynodd. felly dweud ardech chi'n credu allwn hwn yn gwneud ymarferio, a hollwch'u ddangos iawn i gwneud eich gwneud digon a wnaall gweld yr hwn yn ei gwybod, yn gwneud o y ddigon, a'r cyffredinau cyfle a'r analysau yn ddullchildreny gwygot fel ydymaint, ac oes o'r adrod y dyfodol yn oed yn newydd maen nhw o maed nhw'n eich sefydlu'r ysbryd ac a oed yn gwneud dyfodol am y cyfwyr aeth. Ond arall mae'n cael ei fwyaf o ymwneud. Rydyn ymwneud yn rŵl. i gael gwahanol ym methau bobl i gweithio. A yna mor hwn yn dweud i eu gweithio cwmaint, a mae'n dweud i gweithio. Mae'n defnyddio yn ysgrifennu a'r adroddau i'r argymell panfodol sy'n g Republicau a dda'r problemau yn dech Hampelau ym Mheitherol. Mae'r ffordd o ddeutigtau i gweithgdd? Mae'r dweud gwiaun ym amwil yn gyfliadau yn gwahanol ffawr o'u ddeutat. Llyfr wnaeth wneudon nhw'n dibanol. There's also a likely to be resistance to opening and connecting information. So, there are some pretty serious challenges to achieving that goal. So, there is hope. There are opportunities that we can exploit to make it easier to automatically annotate. So, not everything changes at the same rate. Sometimes there are systematic patterns that we can make use of. Standards can provide stability if they exist. We can build annotation services that embrace and acknowledge ambiguity. We can utilise vast willing pool of human volunteers or paid human annotators on the internet to provide disambiguation and regulation of that automation process. We can employ vast cheap computing power for data processing, machine learning and for searching. And we can build dynamic, flexible, focused services that allow on-demand annotation to simplify adoption. So that's all a bit kind of hand-wavy. So here's a concrete example. This is the Sure Chem Annotation Pipeline. And I'm just going to talk through briefly how it works in practice. I'll point out the ways in which we've made use of some of those opportunities. And I'll describe how people, the role that people play in making automation possible. So first we push a stream of patent documents into our pipeline as they're released by different patent authorities. It's about 10,000 documents a week that we process. We apply entity extraction to every document, so natural language processing, machine learning, dictionaries, to systematically identify chemical names and positional information. Now the key things to note here are that we're focused on chemistry which is relatively static. We also utilise systematic patterns in what are mostly standardised chemical names to reliably identify chemical text. We use a large volume of human-curated chemical names for patents as training and test data. Having detected names, we then disambiguate synonyms, so you can have many synonyms for a single chemical. We utilise name-destructed tools that are readily available, some of which are open source, some proprietary. And that's to give us chemical structures. The good thing is chemistry has well-adopted standards around chemical representations, so it's possible to do that. After disambiguating, we actually go on to reambiguate because not every tool is capable of processing every name, so we apply multiple-name-destructure tools, meaning that we may get multiple structures for a name. Now we actually embrace that ambiguity and we expose that information in our dataset. Because we're processing bad data, we sometimes have to apply OCR, so Optical Character Recognition, correction. Given names that have transposed characters or misrecognised characters, that sort of thing. Now we use heuristics for this, but it's possible that we could use more sophisticated techniques. This is an area where we could do much better. Now in addition to text content, we actually process a second stream of data. That is, we retrieve chemical images and chemical attachments from the patent authorities. Chemical images are just structured diagrams and the chemical attachments are pre-annotations provided by the patent authorities. We detect the chemicals in those images and we process those chemical attachments. The point is we're trying to draw on as many sources of chemistry as we can to provide high coverage, though potentially at the price of precision. Having processed all of that data, we then have to standardise and check all the structures. Now because we're automated, we have such a vast quantity of data that we have to process, we have to filter and so we apply consistent filtering and safeguards to try to filter out the crap that we do get sometimes. So we use various medicinal properties that are clear indicators of spurious chemistry. All of the above we expose via a restful web service, making it easier to search, to display and to export. So we allow searching and exporting across all structures even when there's ambiguity. We provide this simple web service to ease adoption, which is in contrast to many incumbents in this area. Finally, we allow manual curation of the resulting annotations. Now this is a necessary step for certain key documents such as the chemistry on nature.com because incorrect structures will be unacceptable. This also allows feedback into the entity extraction and the structure generation process that we have. Now we do all of this in the cloud, which allows very large scale data processing and flexible scaling. So just to summarise the key points here, we're focused on a fairly static domain which allows us to automate. We utilise standardised naming and chemical representation. We disambiguate to the extent that we can but we don't hide those disagreements where they exist. We use large quantities of human curated data for training and for testing of machine learning models. We do all of this in the cloud making our data processing easier and cheaper to host and to scale. And we provide simple easy to use APIs to ease adoption. So where do people fit into this picture then? Well the six main areas that I've identified. They provide training and test annotations, so dictionaries, annotation examples, search terms, the things that people are interested in. And these allow the development of automated classifiers and the disambiguation of resulting annotations. They build the tools and the services to take the content and to classify all the things. So expertise in machine learning, natural language processing, image processing, specific domains such as chemistry etc. And these are foundational components of automation. They collaborate to build and apply standards. So we make use of smiles and in-she standards and catalogue numbers. These are core to the Shurechem product. And they support disambiguation and linking. They customize annotations to provide corrections and expert feedback. So this is providing alternative perspectives, interpretations and feedback on classification quality. They also provide feedback through brute force application of systems such as Mechanical Church which is something that we've experimented with but haven't applied in anger. So it's something where careful interaction design is needed and careful utilization of any captured data because niche entity types such as chemistry really do require expertise. So you need to know how to very carefully capture information in crowd sourcing. And finally they can pre-annotate. So the patterns authorities provide us with what are called complex work units which are essentially chemistry data that's provided for us. And these avoid the need for automation in the first place but they should be taken with a pinch of salt. Now the last thing I just wanted to mention was how you can go and apply some automation yourself. So it's tough so I'd recommend getting someone else to do it for you. Now we basically have a number of internal web services that Shurechem provides for example to Nature and also prototypically to some of the other portfolio companies related to digital science. But for wider consumption I'd actually recommend a service called SciBite. Have a look at this. It's actually really interesting. So what they do is they provide a web service that accepts arbitrary text and returns positional annotations and disambiguated entities. So they support life science entities such as genes, proteins, names of companies and various biological entities. And this is an example of a command that you can submit just to do some automated annotation right now. Last slide. So here's an example of the result that comes back. So you have, I won't try and go through this in detail, but essentially it's identified the three main important entities there. So two drug names and a company. So GlaxoSmithCline, Avandia and Fluvent. So in summary, automated annotation is hard but crucial to linking information together. And I'd recommend going out and trying it yourself right now. That's all I've got to say.