 Well, thank you very much for the very kind introduction. I'm not sure I lead up to the expectations after that introduction. I'll do my best. So today's talk is on data-driven materials discovery, and I'll begin by explaining the challenge of the problem and then show you how we're trying to solve that problem, giving you 4K studies from the energy sector, and then I'll finish with where we're going next. So here is a picture of Alexander Fleming, very famously discovering penicillin by chance. Now, we've come a little bit further than that over the time, but still actually we're discovering materials using trial and error in the main. And this picture here in the middle is taken from a biomedical website who are likened materials discovery to firing scalpers at a dartboard which has a moving target. Now, wouldn't it be better if there was a systematic way to discover new materials? And of course, I feel that there's an opportunity to do so using the recent advances in big data. So if you're going to do data-driven materials discovery, of course you need data. So let's think what would be the ideal source of data for data-driven materials discovery. Well, ideally you would want the entire universe of all possible chemicals, molecules that could ever exist, and for each of those chemicals, know what their cognate material properties would be. And that's because there's an inherent relationship between the structure, the molecular structure, and the function of a material. And people over the years have used empirical means to work out what these structure-function relationships are. And yet if we could encode those relationships, those patterns, about chemical and property space in a way that a computer could understand, then we could encode them and link them up to a search engine that we could then use to probe this chemical and property space to find patterns in the data that then could lead to predictions and ultimately experimental validation of new materials for a given application to suit your needs. So that's the dream. We don't have the whole universe of all chemicals and all of their properties. So what do we do? Well, to a first-order approximation, we do have that information, albeit in a highly fragmented form. And that form is the scientific literature. That could be academic papers, could be patterns, company reports, and so on. But there'll be somebody in the... lots of people in this audience who have written a paper on one material and its properties. There'll be another person in the audience who's written a paper on another material and its cognate properties. But if we could grab all of the information from all of the documents that ever existed and put it all together, then the sum is greater than its parts, we could then have to a first-order approximation all of chemical and property space so that we could predict new materials from all of that historical data. So that's the essence of my talk. And what we've done to do that is written a whole load of software that mines text the chemical and property information and extracts it, and also image information, chemical schematic information, and chemical reaction information. Here are four software tools, our primary ones. And having grabbed all of that information from the input documents that you give it, it also automatically compiles it into databases for you. So now you have a way of materials, databases that you can make for your own needs, i.e. bespoke databases, for your given application. And once you've got that compiled data, we can then essentially drive that into a design-to-device pipeline then using the data to predict new materials, using the machine learning that you see a lot now where they classify and optimize in data analytics, and then that will find patterns in the data that leads to your prediction of a new material, and then you can use that prediction and go forward with lead materials for the experimental validation. So that's the pipeline. And I'm going to focus just for time on chem data extractor, the text mining tool, just so you know. So given that, let's just have a look a bit further in how chem data extractor works. That's the text mining tool. So we've got input of scientific literature. So it can literally be thousands and thousands or even millions of documents if you had them. And chem data extractor will ask of the documents. It will always find, as far as it can, the chemical molecule, wherever it writes the chemical name, or a picture. And it will then also grab the paired quantities of properties of which you've asked, right? You ask the properties you need for your particular device, whatever it is you want to make. And it will then grab that paired information and put it into a chemical database for you. How does it work under the hood? Well, this is best explained by example. It's actually chemistry-aware natural language processing for those in the field. And what it does is it takes every sentence from every document of all those thousands of documents and it does this. So figure two shows the UV vis absorption spectra of 3A, bracket red, and 3B, bracket blue, and acetonitrum, a fairly typical sentence in science. And what it does is it takes the sentence and splits it up into its constituent parts, the words, the numbers, the punctuation, and so on. It then assigns a grammar to each of those parts. So figure is a noun. Two is a cardinal digit, or decimal. Spectra is a noun, but it's plural. And acetonitrol is what they call a chemical mention. So this is the chemistry-aware logic coming in. So acetonitrol is probably the solvent in this. You can probably read that as a human, but, of course, the computer has to think harder and use logic to work that out. Having assigned a grammar to that sentence, it then turns that sentence into a hierarchical tree. So it's a figure, and it's figure number two in that paper. It's a figure about a spectrum. And the type of spectrum is the UV vis absorption spectrum, not, for example, a UV vis emission spectrum. And it's a spectrum of something called 3A and something called 3B in acetonitrum. Now, very crucially here, you have to look at 3A and 3B because we don't know what 3A and 3B are yet, right? So if you put that all out into a database at that point, you would lose all of the chemical-named information and it would be a totally useless database. So what you have to do before you leave the document is resolve what the labels mean. And usually in a scientific document, if you track back in the paper, which the computer then does, it will usually find typically the first instance of 3A and you'll see just before it a long chemical name, right? So bracket 3A, or maybe 3A will be in bold or something like that. So if you train the computer to search for that, then you can resolve what 3A and 3B are and then put that in and then you can pull out the right chemical information and its cognate property information. So that's why it's chemistry-aware natural language processing. So there's this technology here and then we're using, let's say, standard machine learning these days for data analytics. The other actually pretty hard thing is when you get to the experimental validation. And I'm not going to talk so much about the experimental work today, but I do want to give you an overview of it because it often involves complex facilities and so I thought I'd give you a glimpse of the sites of facilities that we do experiments at. And so what we're going to do now is we're going to want to fly by of the UK's Neutrona Muon facility. There's a glimpse of the experimental world that mostly I won't be able to talk about for time reasons, but hopefully that gives you some idea. So we've talked now about the challenge, we've talked about the technology, so let's see how we apply it. And I'm going to show you now four case studies, all taken from the energy sector to see if we can discover new materials. I'm going to start with solar, the sun. Here's a picture of the sun in Cambodia, just for your amusement. And I want to discuss the idea of how we can apply chem data extractor to discover new light absorbing materials for photovoltaics. So irrespective of the type of photovoltaic device, all of them need some form of light absorption. So I want to think about the underpinning problem at the molecular level to do that. So let's look at this graph. So the black jaggedy edge here, that's the solar emission spectrum. So there would be your visible light as a function of wavelength. So that's what you see as a spectrum from the sun. What you want in terms of grabbing all of the photons under the area of that curve, that's your goal, is you need then light absorbing material that ideally would be something like that green line that pretty much grabs all of the area under that curve. So all of the photons. Now, there's no one material that really does that. So that's a problem, but people in the device technology often combine, say, two different types of molecules, say one that absorbs in the blue and then one that maybe absorbs in the red. And as a convolution of those two peaks, you will get more or less the green. That's the concept. So people often do that in the device world. So let's set ourselves the problem then of making a database using ChemData extractor that finds the underpinning molecular property information, say the wavelength maximum. So see where the peaks are of these. And we're also going to take, if it's present, the absorption information, the intensity here, which is going to be called the extinction coefficient. And we're also going to, of course, take the material, the chemical name as well. So that's the database we're going to build, material property 1, property 2. So we build that with ChemData extractor and that at the time was just under 10,000 chemical molecules and their corresponding properties. So we're going to build, grab from the scientific literature. And our goal is to get from 10,000 possible light absorbing materials all the way down to, say, five lead candidates that we can take forward for experimental validation. And the way you do that is you apply this design-to-device pipeline, like so, and we have to ask really carefully chosen questions that sequentially filter out 10,000 all the way down to just a few. So the first question we ask, for example, is remove all things containing metals. We wanted organic materials for environmental regulations. That actually made our case quite hard. Nonetheless, that's what we did. And so that went down to just 3,000. That's already a big jump. You want quite big jumps early on, so you don't end up asking far too many questions. Then we ask a question that's relevant to a device. I have a photovoltaic technology. We knew that, in the device, we wanted molecules that contained a carboxylic acid group. And that's because we knew that the interface between the light absorber and the semiconductor that makes the working electrode as a composite was working particularly well when we had carboxylic acid groups in our light absorber. So we fixed that filter, and then you see an order of magnitude reduction from 3,000 to 300, more or less. And then we go in and we say we wrote a mathematical algorithm that finds the optimum combination of those possible 300s you've got left to find, if you like, the kind of best combination of blue absorbing and red absorbing, or at least extremes. And then you go down to another order of magnitude to about 30 possible light absorbing materials. And at that point, 30, you can actually go in manually and do things. It's a manageable number. And actually what we did then is we performed some electronic structure calculations, a computation on all of those 33 to check what we call the energetic alignment of our light absorbing materials within the device. So this is the idea that we can predict all these new light absorbing molecules, but that's all in isolation. We've got to think about the device. You've got the electrolyte, you've got the other electrode, we've got to check that the energetics line up so that you get the right driving voltages. So we did that, and that brings it down to essentially a handful left. In our case, five lead candidates to go forward for experimental validation. And here are the five. And one good thing about the chem data extractor type of technology is because we've mined the data from the scientific literature, the people who originally made these materials, we can contact, right? Because we can track back, we always keep the DOI as we go through, so we can track back and find the author for correspondence. And all of these molecules, by the way, were made for purely scientific curiosity, synthetic curiosity. So none of them had any idea that may be applied for photovoltaics. And I wrote to them, I emailed them actually, said, hey, you know, we think your modern material might be useful for photovoltaic application. Do you still have some, or could you remake it, send it to us, and we'll put it into a device and test it for you, and we'll do this as a collaboration so everybody wins. And they all said yes. They all sent materials, and we put them into devices, and this is what we got. And this is a graph of voltage versus current density. For the experts in the room, remember, this is our lab work, but it's the relative differences that matter. And the black curve is an important reference. So the black curve actually corresponds to the industry standard. Now, the industry standard actually is a metal-organic material. We said we wanted only organic materials, but as an industry standard, we'll reference it to the metal-organic one. That's one of the best performing ones you can get. So if you can get your curve and your testing to get anywhere close to that black line, you're doing well. So I'm going to focus on X, X6 and 15. They were the original labels and the original publications. And if you test them just on their own in a device, you'll get the red and the blue curve for those two. So voltage is okay, but the current density is not so great, because you're not close to this black, for example. But of course the logic was to put those two together, to get blue absorbing, red absorbing. And so when you put them together, you get the mauve line, which actually is pretty close to the industrial standard, but we've only got organics in our system. So we were pretty happy about that. Actually it gets 92% of power of the industry standard. Other people were happy. It got onto the front cover of Advanced Energy Materials Journal. And so there's an example of data-driven materials discovery and action. And before I leave the solar topic, I just want to touch on one other thing, which isn't discovery per se, but I think it's relevant, that you can also use Chem Data Extractor, the text mining tool to help with manufacturing. And so because what you can also do is mine existing things that people know about. So why would you do that? For example, you can make nice histograms. So it is of open-circuit voltage, the current density, the power conversion efficiency, all the known materials for two types of photovoltaics, perovskite solar cells and Dyson solar cells. There's more on this one because it's much more mature technology, nearly 200k of data on perovskite solar cells. So that's what's known. This isn't discovery at this point, it's not just the device properties you can predict on the molecular properties, but you can also grab information about the actual manufacturing process. That might be quite important to you to know what type of solar simulator is used, the testing device, or the active area of the test sample that you use for the solar cell. That type of information can be very valuable if you want to optimise the manufacturing process. So there are side uses of chem data extractor as well. So that's the first case study. I now want to move to the second case study and we're going to look at heat. This is a picture of a volcano erupting in Guatemala. I took the photo from a safe point here. You're laughing, but wait till you see what happens next. When it erupted, I was actually there on the ridge. This was a video idea. You don't want to be ever closer than that, by the way. Let's just ready to wake you up and show you some heat. Anyway, so I'm not going to do volcanic applications, but I am going to do thermoelectrics, and this also gives me an excuse to play with Lego, and I made a little thermoelectric device, which I'll pass around in a second, but just to explain then how it works for those who are not in the domain area. What you have now here is a case where you have a cold surface on top and a hot surface under the bottom, bottom top, and therefore you have a thermal gradient, and I'm going to super simplify this, sorry, all physicists in the room. So just like in the atmosphere, if you have a thermal gradient, you have convection currents that cause all the weather, and here you have electronics essentially creating a convection current, in this case with a thermocouple, the P and N semiconductors, and so you can create sort of electrical currents and therefore voltage is driven between because you've got the thermal gradient that stimulates it from the top. So then that creates, because the thermocouples are all in series, that creates an electrical circuit, electric. So I'll just take these around. I've got two because I've already got particularly happy with playing Lego. Please send them back to the front, the people at the back, because otherwise the mother-for-labs will kill me for stealing all their Lego. I promised that I'd send it back. Cool. So that's how a thermal electric material works. And in equations, here is the figure of merit, so that's ZT, that's just to understand it as your proxy for that, you know, a high ZT means a high-performing thermal electric material. S, that's just a coefficient, and sigma is your electrical conductivity, you want that high, that's the thermal electric, you want the thermal, sorry, the thermal electrical conductivity cappa, you want low, right? And that's because you want to keep that thermal gradient without interfering with the electrics. You want the temperature, high, right? And so if we can then use chem data extractor to mine all of these properties plus something called the power factor, then we get this distribution, right? We've literally just published this I think eight days ago. It's a new database, and just of our interest, you see all these spikes, you're probably wondering what they are, right? These are really experimental data. It's not an artifact. It's actually really real, and what they are is actually rounding up of people's results, right? People will quote things to say one decimal place, right? So then you get things spiking up to 1.3, 1.4, etc. Right? And of course we present the data faithfully. We could, of course, average it and put sort of gaussians over each of those spikes because probably they mean it was a bit more and a bit less than 1.0, right? But this is real data, right? So that's why that you see spikes. And you can do things like look at the highest ZT material, right? People may know that anyway, well, okay, but what else can you do with this? You can forecast with it. So you sorry. You could take the average, the year of publication. Remember they come from academic papers and you can check their average ZT, the figure of merit for, and then you can say, well, roughly, if that was a linear trend, by 2052 the average ZT that would be reported in papers would be 1.5, for example, right? And if you're thinking, well, should I work on thermoelectrics? Well, if this is the year of publication and the number of records that have been occurring is going up like this, it's probably quite a good field to go into right now, right? So it's good also to help you with forecasting is the point. So that's heat. Now let's look at hydrogen. There's been a lot of media interest, of course, that people have seen with hydrogen fuel cells, for example, as a technology. And therefore you need a way of producing hydrogen, right? And people mostly produce it using water. And they do something called water splitting which unsurprisingly means you split water into its constituent part, hydrogen and oxygen, right? It's good because you generate what you need, the power, the supply of hydrogen. Whoops, sorry, going back. But you also generate a clean side product, oxygen, which is also helpful. So no problem there. And there'll be a catalyst that may help. So this technique is therefore called water splitting. This is a picture of me water splitting for no reason. And here is the question, how do we apply chem data extractor to help? So, well, there's no point in mining for water, or something like that. But we could mine, for example, for all catalysts that had ever been produced for water splitting and unsurprisingly we are making a database or we've made one, haven't published yet of all catalysts for water splitting. So that may help thinking about optimising things, finding the right catalyst for the hydrogen process of water splitting. But there's another use that we could work with which is thinking about the hydrogen. Once you've made hydrogen, it's a gas, right? So we've got to find a way of storing it. Now, what you would really want, actually, is to condense it into a liquid form so that you could store it in a nice, small, contained little vessel. Now, there's not really a good way of storing hydrogen in the liquid form. Partly because the liquefaction temperature for hydrogen is about 21-ish Kelvin. That's minus 253 in a bit. Right? Degree centigrade. So that's really cold, right? So how do we find a refrigerant that will cool the hydrogen down, put it in its liquid phase and keep it cool? And we're going to say, can we find a material like that? And what we're going to use for this purpose is we're going to try and discover a material that's a magnetic refrigerant. So this is the idea that so some materials will be heated up when you apply a magnetic field to them. And when you take the field away, it will cool down, right? So that can be a way to potentially make a sort of, this is just my avatar, if you like, of magnetic refrigerants. I'd not really like a fridge, but to give you an idea. So what are the properties that make a magnetocaloric effect material? They're these three, don't worry, it's just equations. We have relative cooling power, RCP as one of the parameters that govern the properties, the change in entropy and temperature. So just remember we have three parameters if you're not a scientist. And what we're going to now do is then do our chem data extractor thing. We're going to mine the literature and find all materials that have reports of any or all of those three properties. Right? And we'll see what we get. So if we do that, there aren't actually that many. But there's just under 3,000 materials and each of them have those number of the individual properties. Tc, by the way, is actually now the curie temperature. That's because we wanted to find permanent magnetism right down at that temperature. And so that's what we got. And actually in that sense while we got this database that didn't give us a brand new material that was suitable for the liquefaction temperature to store hydrogen. But that's okay because what we can do is having got the database we can at least make a regression model because we've got data now with structure and property information and so we can apply machine learning of making a regression model to link structure function relationship for magnetocaloric effects. So that means that if we've come in with a new material then we can use this regression model to predict the magnetocaloric effect properties. Right? That's okay so now we need a new material. Right? So how do we get new materials? Well we turn to our machine learning world. Super geeky moment coming. And we apply something called a conditional deep field consistent variational autoencoder with UNET and crystal graph convolutional neural nets detector. If we're glad to know I'm not going to explain what that is. So what you essentially do is a way, it's a generative algorithm that essentially creates new 3D crystal structures hypothetical ones that could then be fed into this prediction. And the way it works without really going into really detail once the speech me offers I've been explaining to them is imagine you have in fact the imaging world kind of do this. You've probably seen on the internet you have a picture of a human face and then it morphs into some other human face. It's a bit freaky sometimes. It's not quite like that but it's sort of like that. In essence because all you do with molecular structure is an image at least you can treat it like an image just these three dimensional coordinates. Don't worry about bond lengths and bond angles and all those things. It's just see it as an image and if you treat it like that where you can make essentially a data distribution that's representative of molecular structure or crystal structure in this case and then you feed in if it's something that's close to your target material so for example with magnetocaloric effect materials we knew that say hoysler so a type of structure like hoyslers or perovskites or cubic materials tend to make good magnetocaloric effect materials. So we can say well let's take one of those that is known and then just sort of send it through this kind of what they call the latent space and then sort of go through this sea of sort of representative data distribution and then it will sample it and subject to a standard deviation depending on if it's a big standard deviation it will allow the input structure to be perturbed a lot if it's a small one it won't be perturbed so much and then it will output a whole load of new hypothetically generated crystal structures that most of which by the way might be totally unrealistic because it doesn't know anything about what's realistic as a bond geometry but that's okay because a lot of those will then get screened out because you check the energetics make any sense and if they don't you just throw it away. So we've now got this way of making hypothetically generated crystal structures and once we've checked them for energetics they could possibly be real and they're different from the input you put in by enough that you're happy that that's different and new and so then with these new materials you do some further screening to check a few things you check its ferromagnetic for example you can check its phase stability that sort of thing and again this kind of inverse pyramid pipeline that goes from say 1,000 down to 30 and then with those leave materials you can then apply that regression model that I talked about that we got from the experimental data with chem data extractor because that links structure and function so we can take now our new hypothetical molecules sorry crystal structures and predict the magnetocaloric effect properties and once you've done that you've got a prediction of new material and then we have to seek out a way to synthesise it and depending we can do that we can experimentally validate that material as a magnetocaloric effect i.e. a magnetic refrigerant so here are the results the blue are the known ones from the literature that we grabbed these are the two of the properties the relative calling power on the vertical axis and the curie temperature on the horizontal so that's the things that were known in blue and the red ones are predictions brand new materials one of these is this one and if you know your periodic table you'll know that Pm happens to be the one length allied material that's radioactive that's annoying isn't it so chem data extractor doesn't know anything about safety by the way so we can't do that but we could substitute out the land slide maybe and we were quite keen because that was quite a nice and nice temperature range for liquefaction of hydrogen otherwise but you could also go down here and pick something slightly different and so we're now thinking about making or trying to make that and here I am at the neutron and muon facility waiting with my sample stay ready for it to go on but we haven't quite synthesised it yet and so that's really all about I have to say about hydrogen we were up to on our discovery end from this project but also just so you know with other predictions we could think about room temperature type magnetic refrigerants because they could be used for a different type of fridge and we've got predictions there which we're also trying to make so that's number three out of four and now we turn to batteries and this is a picture of a mountain where you've had electrode in the sky and electrode on the ground and you've had an electrical discharge obviously that's just lightning this mountain's been actually struck by lightning and it's on fire it's a few years ago now does anybody know where it is? is it obvious to people? it's interesting I'm having lots of shaking heads from scientists as well because that's it bigger any further thoughts from the scientists particularly the neutral of the synchotron people? I think I heard it's going over that's the mountain here that's on fire and this is the European neutral facility and this is the European synchotron radiation facility next door to it anyway just to give you an excuse to show you the European facilities and so nature's form of electrical discharge but let's now think about batteries so we're going to do you're getting quite bored of hearing this probably now we're using our chem data extractor text winding tool to mine the literature in case you haven't got the message by now and in this case we're going to mine materials and the devices information right so we're going to mine the capacity, the conductivity the voltage, the energy, the columnic efficiency and we get nice distribution grass like that and again you see this weird rounding effect with the unitary things coming out again because it's real experimental data now in this case we grabbed all the material and device property information but what the computer couldn't do or what chem data extractor couldn't do was distinguish between whether the material was anode, a cathode or an electrolyte it got lots of materials and all the device properties but it didn't actually, it couldn't classify which ones they were because that was never really programmed into chem data extractor so we then thought well how are we going to do that and so what we did is having got that database of nearly 300 ish data records from chem data extractor we went back and we sort of reverse engineered it so we took, we had a scientific corpus that we fed in right that chem data extractor process and we took from it the papers that had a successful hit from chem data extractor things that contained batteries information to make our database and then we siphoned off that corpus now we've got a very battery rich corpus right of data and we then fed it into a different pipeline and then we explained what the pipeline does so these are our battery rich data corpus and we then are going to make something called a data model so this is a different way of mining data because what you do is you, this comes actually from a technology that's come out of Google AI in fact it's often the basis behind the Chrome search engine actually and what you do is you essentially you make, you turn you take those sentences that I talked about that's how you remember how chem data extractor works where you have these sentences and then you pull them aflart and all these things but instead of doing that what you do is you make vector representations of them so imagine a phrase like the cat sat on the mat and I write down the cat sat on the mat two times identically and then now you'll know that the subject and object are the cat and the mat so that has a high correlation between the cat and the mat so you can build a bipartite graph so you can link those two with a high correlation and so you can start to get contextual information from that sentence by essentially relating relating the information that you know is highly correlated together and the computer can do that by making this vector representation and having made the vector representation it will then feed it into a neural net a very deep learning neural network which then subject to all sorts of weightings will train all of that with all the sentences that you put in right so in some sense it's different from chem data extractor in all sorts of ways I suppose but in one particular way which is that instead of extracting the chemical and property information you keep all of the sentence information all of the corpus all of the words everything you put it into a model right now use that word very carefully so now what we've made right we've trained a network to build a model of all of that corpus with all of the context of the information from the sentences and so it's now what they call a language model right simplified a bit but that's basically what you're doing now that's really important distinction between that and a database a database is a very static thing right which you look at and you can do analytics on but now we have an interactive model right that's why I'm calling it a model and the way to think about it in my head at least is to think about it as your computer right so you have a motherboard right and that's the core kind of operations of your computer and you have lots of peripherals like mouse, your printer keyboard and so on and so these are the peripherals and you can put things together and the mouse has a different function right so I can plug that in and make my motherboard do different things and it's kind of like that as a sort of concept of framework so we've got this model right this data model is a language model and I'm now going to ask I want to ask questions of that model right because it's interactive I can do that so I want to say of course what is the anode is it an anode is it a cathode or is it an electrolyte from the material standpoint so what I do is I make a whole load of questions up just that have designed answers like what is the anode what is the cathode, what is the electrolyte and so on I can make more complex sentences of course and so you make another database just of questions and answers and you make it for the domain of interest the materials domain of interest the battery domain and then you can also mix that up with lots of generic question and answers pairs right just about normal English language and then you put it all together and then you can ask the model questions and so you in this case we want to know what the anode cathode electrolyte are as a classification so we ask those questions about that and then we grab the answers and then we can then classify whether the materials that went in the original database are an anode cathode or electrolyte hopefully that sort of makes some sense and this is what you get and you get kind of what you'd expect from what is the anode, what is the cathode and what is the electrolyte that are so that's classification and we can go one step further than that with this data model concept and this is really a paradigm shift I think between that and databases and we can not only have a question and answering module but we could make a natural language processing model a text mining module right so what if we actually blended chem data extractor, a chemistry aware natural language processing model a text mining module with this new technology from Google AI and make battery data extractor you'd say I haven't got much imagination with my names by the way and so this then just published like eight days ago or five days ago or something, I get lost, sorry this is the first property specific text mining tool for auto generating materials databases and we've done it on batteries and as I say it blends two different technologies to hopefully get the best out of both worlds it's also good I think because it's got a probabilistic nature about it because it's a model you can actually get confidence scores you can actually get probabilities of each data record that you obtain to know how likely is the question that I ask going to be right now that's really important because you know how does somebody know whether they should trust my database that I've got that was auto generated with chem data extractor right I mean we do experimental validation obviously but you know to get that trust we need to think about real probabilistic confidence scores for all of the data records so I think that's a good thing the fourth thing there is that we have now a new way to interrelate material and property text during data extraction right so imagine you've got this is a really big problem by the way imagine you've got a scientific paper you want to mine the chemical and property information but let's say the chemical name is on page 1 of the document and the property information is say on page 5 and in between on pages 2, 3 and 4 we've got a whole lot of other chemical names and a whole lot of other property type things how do you know that the chemical name on page 1 and the property on page 5 are related not something in between was related to either of them right so what you can do because you have a model not a database anymore you can ask if you're careful the right sequence of two questions that by definition interrelate them by essentially using a sort of analogue version of Bayesian statistics it's a conditional probability if you see how it works let me give you an example so let's imagine I want to find a material with a voltage of two volts just for argument's sake so I am going to ask two questions in this sequence what is the value of the property name property name is voltage and you get an answer you're just dealing with the property at the moment right so I said the answer was two volts so you can find that because that's easy it's just looking for one area in the document but then we say which material remember we want that link between material and property which material has a property name the voltage of the answer to the previous question so there's your conditional that's your Bayesian bit and the answer to the previous question was two volts so the question becomes which material has a voltage of two volts and that puts that into relationship into analogue sort of a geeky very geeky way of doing Bayesian kind of statistics and so that was published recently on the front cover of chemical science which is the Iwasa society of chemistry's flagship journal so leading on this is really then thinking about where we're going with all of this and you can see that you can build up all these different modules having different peripheral functions just like the mouse and the printer and so on that have different functions with the spectral motherboard of your computer in this case the analogy is the data model and you can then make it do lots of different things right so that's I think where we really want to be going moving really away from actual databases and moving into this more interactive type of zone so that's the software side of things and I just want to close on thinking forward also about the experimental world this is a picture of the new new Ray Dolby centre in Cambridge University which is nearly finished and this will of course allow us to continue doing the mod cons of this is where the physics department and others will be housed so we'll continue to have the mod cons of the experimental validation side for example and this is the Rutherford afternoon laboratory to which I'm 50% seconded and it's already got these facilities and others and there's Diamond, the X-ray source the Isis, the Neutron and Muon source the central laser facility here and in future it will also I believe have all these extra things so there will be a new development in a laser facility there's already nine new instruments being planned for the existing Neutron and Muon source and I believe there's a possible option of a new Neutron and Muon source called Isis 2 over here maybe of some other location possibly so that however it reminds me to say that this site like many big organisations across the Rutherford afternoon laboratory is particularly large but like many big organisations they have a plan to go next zero by a certain year that's a really massive challenge and I've talked today about a design to device pipeline but that's a linear pipeline right but if you really want to go to sustainability then you've got to adopt the circular economy right actually it's more of a butterfly shape so our design to device is kind of here but we've really got to build an ecosystem everybody talking you know talking to people on the county council dealing with the recycling sort of geeky scientists and engineers who are designing and making new materials we've got to build that ecosystem that's a really different way of working that we've got to adopt so you can tell again to the end of my lecture I'm going to get a bit more light hearted now we can call the user support with our bat signal that's us in the new facility the lasers reaching out and they will call on the user community which has an international reach and we can make our circular economy and working even more closely together across the facilities we can continue to do Olympic science for a sustainable future and now for the grand finale I want to thank people and I want to thank people properly because really this is a celebration of everybody with whom I've worked and so I wanted to do it properly rather than just put aside so I made you a short video to thank everybody so here goes it's fantastic thank you so much it's really special and different anyway questions for Jackie there are microphones and remember there's an audience out there as well so I'm going to stand in front here so I can be seen thanks Jackie a number of your databases have 100,000 papers or so how long does it take a computer to read 100,000 papers and is it a laptop or is it a supercomputer so we've got two different ways of answering that the time taken to read or the documents is not so long what takes the time is actually the data cleaning process once you've mined the data you don't just get what you want you actually kind of get lots of educations things that don't quite work and so you have to spend a lot of time then actually cleaning the data so I don't want to pretend that here the data extractor is something you can just press the button and out pops the database to give you a real example of it in practice a fresh PhD student from say my group would take probably about two years to get a database that's in published work so that's the sort of and obviously they get quicker because they've learnt at that point but that would be coming at it fresh so we do use supercomputers for sure and certainly it's a lot faster and for the really big runs we will use a supercomputer but you could in theory as long as you've got time you could do it on a regular desktop computer in front Thanks Jack it was a great talk you talked about extracting data from the scientific literature I guess within our facilities nationally and globally there's huge amounts of data and I just wonder what your thoughts are on extracting that data in terms of challenges and opportunities because clearly a lot of that data doesn't end up in scientific publications so I guess it depends on what condition the data are at least what stage the data are in their production so you've got raw data which we couldn't realistically mind because you've got to process it first so that's where you have to reduce your data and analyse your data so that it becomes something meaningful that you can relate to the rest of the world so that's the first thing once it becomes processed data then you would be in the position where you could maybe publish it or you could not if that data were available to the likes of us then we could actually take that data but we would have to make it into some sort of internal framework I think that we would pipe that into the database format but there's no reason you can't put a side arrow into that pipeline right of course you're on a better ethically than me how ethical is it to take somebody else's data that they didn't really write up and then use it without maybe asking them but after three years I believe it becomes open access anyway so we get into sort of policy issues and even possible ethical issues so I think we have to think about that and there is something called data site which is the sort of framework that actually they've been very active at the Wella Fflaffon Laboratory in making that actually allows you to source if it's old enough more than three years you can actually go in and find even the proposal that people wrote to do the experiment the metadata that was used in the log files if they're electronic, if they've written if the users wrote them down and then the actual raw data and if you really wanted to after three years you can go in through this meta site database and you could actually process it yourself but again we get back to that ethical question somebody else did the experiment maybe they were doing a PhD they finished, they left should you go in and process it yourself and publish it and who's the publisher and who's the author raises them interesting questions but yeah I could actually ask one myself as a computing person I could see quite a number of tools that a computer person could construct from our end of things which might help so for example a programming language designed for this sort of domain does this happen or would this be unusual I know it would be helpful because I can think of quite a number of things which would make your life easier so it depends what you mean by that programming languages should be the thinking of other things we program everything in Python that's partly just because that's what a lot of people know and because we want the consistency across the board but for example we could think about probabilistic programming languages so the likes of Julia for example and that I think could help it might make more things efficient and certainly if we get to actually it relates to Russell's earlier point if we're using supercomputers maybe we can make it more efficient so that we could do it even better on a desktop ultimately everything's, as you know, Moore's Law exponentially it seems increasing I think there is interesting stuff but anyway any more, Andy Thanks Jackie, that was marvellous I'm thinking about the underlying physical mechanisms that make materials exceptional so you can identify or point towards an exceptional material you had some in your plots can your system give you any insight into what is different about the mechanism in those materials that makes them exceptional or is that still a job for humans? So yeah one thing I would say is I don't think that humans are going to become redundant I think there's some of the really novel stuff we will never get because we're predicting based on trends so we would never predict to the new quantum technology for example in fact there's not enough data now to make head roads in that but I think I think there are edge cases and you could look for outliers you could use chem data to find all the regular stuff and then say what's this thing over here and is it just a totally duff piece of data or is it actually really really special? At the moment we might look at the outliers but we sort of pre-programmed a little bit to kind of think that they're outliers and therefore they don't count but we could look at that differently for that purpose Over there, thank you very much Wondering how reliant you think we are about changing trends in how data is actually published into the scientific literature are partly standards of English and language but also aspects like now a lot of papers have far more material in the SI than is actually in the paper and I don't know whether you are mining the SI in the same way So by the way there are quite a few of the publishers in the room so it's a very relevant question I think there are differently ways that regulation could help if you take the historical example of crystallographic data and it's been for decades in a situation where it's been mandated that you can only publish crystallographic information if you include a SIF the crystallographic information file as part of your submission that's not the case for almost anything else and so there's a lot of data that are hidden that actually can be really useful that's one thing to say that with the crystallographic information so of course it's a burden on the publishers but then potentially if it was regulated people might do it the chance of anybody doing it voluntarily I think are actually quite slim despite best intent and I can speak from personal experience when I publish a paper the last thing I want to do when I'm looking at the submission I was like I've got all these files and now I've got to produce a crystallographic information file and I kind of curse a little bit but I still do it because it's mandated but it wasn't and so for all the best intent in the world I think we have to find a way to regulate that sort of process with regards to supporting information you're right a lot is increasingly going into supporting information and the problem with that from my standpoint as a data extractor if you like is that it's all PDFs and the problem with PDFs is they're really hard to read we've just made a PDF data extractor and if you saw it in the clip as it was running through at the end we made that because we can't access most of the supporting information and so we want to be able to do so so PDF data extractor is actually a code that will actually go into the front end of the PDF data extractor so that PDF extractions will be better it's still not great because the PDF extractor is of course optimised for markup language extraction because it's easier to access by far so we have to think about that and I'm always very happy to talk to publishers to see if we can work together to find a way through to improve that process for everybody