 And it is a session on Utilizing Data in Computational Materials Discovery. And it is something that nowadays is being described with the catchphrase Big Data. And I have put on the board a couple of examples of Big Data. So two big pieces of information. So in that catchphrase, it's for those of you that don't know much about Big Data, but know a little bit of Latin, that comes from, it's the plural of datum, which is a given. So a piece of information. So we are not going to be talking about Big Data as this, we're going to be talking about Big Data sets. So this is something that as ever in science, technological advances motivated by other things have given up new opportunities. In this case, as you probably know, there are big warehouses being built in the desert of Nevada, storing lots of computers, lots of computers and storage capacity. And there you can do the numbers yourselves. Imagine that of the order of a few billion people, half of the order of several and quite a few gigabytes of storage. So that is an enormous technological change that has happened during the last years. I don't know how many decades you can do in the last decade probably. Most of it, of course, is photographs of cute babies by very proud parents or grandparents. But the technology itself is what we're taking care of here. And also within the effort that is being developed in those warehouses, what you have is lots of information about whatever we're doing, each one of us every day. And there is extracting of information from that, learning useful things, some of them not so useful, but useful things, trying to extract things from that amount of data. So that has brought about a branch of science of the manipulation of large data sets, which is a brand new branch of science which is called statistics. And I really liked very much the quote of Gabor yesterday about computational statistics. So we do have statistics and we do have an enormous amount of computer power. And computer power in the sense of high throughput would be the one related to that kind of effort and quite a lot of managing of information. So then we have to learn how to learn from that information. And so, but the other thing we can do is of course use that kind of technology to store our kind of information. So the kind of simulations we have been doing for decades is a kind of simulation in which we normally stored very large files into disks that used to be called Scratch because we would store a lot of information, extract what we needed for our particular science at that particular moment, and then erase most of it. So the idea came about that with this new opportunities of technology, we could store quite a lot of information and we produce a lot. We just think about the coefficients of wave functions for a large system in a molecular dynamic simulation with a long time scale, just immediately you have a huge amount of information. So the idea of storing systematically quite a lot of that information allows us to think that one could extract much more than the information than the science that was extracted originally when it was first calculated for a particular purpose. So you can repurpose that kind of information. So but that of course brings challenges that relate to how to extract things from there, how to learn. So in one part, so we saw a couple of already very interesting talks by Gabor and Gabor Roksanyi and Michele Ceriote yesterday about how we learn from data. But we also have to face the challenges of the data themselves, of the data itself. And today we're going to see a first talk in which we will be essentially facing how do you describe some of the properties of how do you characterize things in that data. Going back to my very useful example at the very beginning, I was talking about photographs of cute babies. So how do you characterize cuteness of the babies? You just have enormous amounts of bitmaps. So you have to extract from there something which is not trivially extractable. So you have to be able to describe what you want, descriptors. And so that is going to be the first talk. And then in everything I've said so far, everything was about quantity. But whenever we're facing and working with something, we have the double side of quantity and quality. And that will be the second talk of today's session. So without more information and after my brilliant presentation of big data, let me introduce the first speaker today. So Shobana Simhan is just going to tell us about descriptors from small data, simple yet successful descriptors for self-assembly of organic molecules on surfaces. Good morning. I would like to thank the organizers for inviting me to give this talk. I've called it descriptors from small data because usually nowadays when you hear about descriptors, it's from big data. My descriptors are from very small data indeed. You will see how small. My collaborators are my students, Sukanya Ghosh. The experiments were done by George Thomas at ISA Trivandrum and his PhD student Pratap. So let me introduce, I guess you all know that one of the hardest problems in our field is that of structure prediction. So for example, if you had two atoms, a yellow atom and a purple atom, suppose they add up to an octet in valency, if you want to know whether they would form the rock salt structure or the zinc blend structure or the word site structure, it's very hard to predict. I mean, of course, we can do a DFT calculation and say which would be lowest in energy. But can we say just looking at the atoms alone, can we see a pattern in the which structures are lowest in energy? And the answer is, of course, no, it's very hard to see a pattern in that. I'll get back to that problem later in this talk. But what I want to talk about is the analogous problem for self-assembly. So if I'm given two molecules, this yellow molecule and this violet molecule, I want to know, can I say something looking at the individual molecules alone about the geometry of the self-assembled architecture that they will form? So obviously, one is interested in the geometry of the self-assembled architecture. For example, here, there are two pictures. In this one, the molecules have self-assembled so that they're cavities, whereas in this one, they're self-assembled so that they're tightly packed. And you would like to know which of these two it would form, because here in these cavities, you can put in other molecules here you can't. So this is an interesting question for applications. And I want to find descriptors for this. I think most of you know what descriptors are, but I'll say a little bit about it anyways. There's some combination of physical properties of the system that correlate well with the property of interest. But the important thing about them is that they should be very quick to compute. They should be faster to compute than either doing the ab initio calculation or doing the experiment. They may not be as accurate as either of those, but they generally help you narrow down the space in which you're looking for candidate materials. And so they help you save a lot of time compared to doing the experiment or doing the DFT calculation. And how do you develop them? Well, initially, people were just developing them with physical intuition. And nowadays, increasingly, people are just developing them using machine learning techniques, for example. I asked my students for an example of a descriptor to give here. And one of my students showed me this paper, which was published in the New England Journal of Medicine, which is, of course, a very prestigious journal. And this shows that you have a correlation between two rather unexpected things, which is the number of Nobel laureates per capita in a population of a country and the chocolate consumption in that country. And you see you have a rather good correlation between the two. Now, I show this for a reason, which is when you have descriptors, you often find rather unexpected correlations. You see correlations between things that you really wonder why the hell are those two things correlated. And now you want to understand it. Now in this paper, which I actually went and read, the authors say that chocolates contain compounds called flavonoids. And flavonoids affect your intellectual capacity, they affect your brain and your neurology. And that's why these two are correlated. Of course, you can think of other explanations. For example, if a country is wealthy, then it can spend money on science, and it can also spend money on chocolate. So the systems and the methods. This is my data set, and that's why I call it small data because it's really a very small data set. So I have three host molecules, which I have chosen to represent as these bone shapes. So they have a backbone, they have phenylene ethylene, they have these backbones, they have COOH groups at the termini, and then they have these alkoxycide chains. Now the three of them differ in important ways. These two have the same length of the backbone, this one has a shorter backbone. These two have COOH groups at both ends, this one has it at only one end. These two have four alkoxycide chains, this one has only two alkoxycide chains. This is important, these are important differences between them. And then the guest molecules are naphthalene, phalanthrene, et cetera. And the important thing for them is I can see, I can look upon them as angular fragments of coronine. So this is a geometrical view of looking at them, which I think is useful. So let's look at the host molecules first, and the host molecules self-assemble into architectures by forming hydrogen bonds between the COOH groups at the termini. And they can form in two kinds of patterns, we know this from experiment. They can be either linear or they can be hexagonal. Let me show you that again. So this is an atomistic picture, the red lines show how the hydrogen bonds are arranged, I'll show you that again. So this is one molecule, then the next molecule, then the next molecule, and that's called the linear pattern. And this is one molecule, and the next molecule, and the next molecule, and that's the hexagonal pattern. And in both these patterns, there are cavities inside which the guest molecules can fit. So the question first that I'm asking is, can I predict the relative energetics of the two patterns by looking in some way at just the isolated host molecule? Okay, so that's the fundamental question I want to ask. So how do we do it? This is a standard methodology when you're finding descriptors, except that it's usually done on much larger data sets. You have some experiments to guide you, you perform DFT calculations, you assemble a DFT database, then you do some analysis. And this analysis is some combination of intuition and you do regression instead of machine learning because it's such a small set. You develop descriptors, and then to check if your descriptors are accurate, you predict something and you verify it. And these are my calculations. The DFT calculations are pretty standard. They were done with CESTA, and the experiments were done with scanning tunneling microscopy. Now, using the DFT database, I calculate free energies for the hex patterns and the lint patterns. I need chemical potentials because the ratio of host to guest molecules is different in the two patterns. So let me show you some results. Sorry. So the important thing is because of these subtle differences in the three host molecules, the length of the chain, the number of head groups they have, etc., they form different patterns when they're deposited on graphene. So the first molecule forms the hex pattern. This is the experimental STM image. This is a simulated STM image. The third molecule forms the lint pattern, the linear pattern. And the second molecule doesn't form an ordered pattern at all. It forms what's a glassy pattern. And then when I introduce the guest molecules, for example, when I introduce coronine, in some cases the pattern remains basically unaltered. So this is in the absence of the guest, and then when I introduce coronine, you can see coronine just goes and sits in these cavities. So you have these blobs here. Whereas in other cases, the pattern changes drastically on introducing the guest. So this is PE4B. It forms the linear pattern in the absence of the guest. And then when I introduce the guest, you can see the symmetries change drastically and it goes from a linear pattern to a hexagonal pattern. So to summarize what happens, I have, when I have just the host molecules, one forms the hex, one forms the glass, and one forms the lint. I can do DFT calculations to check the difference in free energies between the two. And a negative number means the hex is favored, a positive number means that the lint is favored, and indeed that's what DFT gives you. And if the difference between the two is small, what happens is that there's a competition between the two phases, and that's why it forms a glassy phase. And here is what happens when you introduce the guests. For this molecule, it stays in hex. For this one, initially, it is glass, glass, and then you have a disorder to order transition, and it becomes a hex. For this one, it's linear, linear, linear, and then it becomes hex, hex, hex. And we can do it, we can reproduce all this with DFT. And in every case, experiment and theory agree about what should be the favored phase. And we can also look at these structures. We can look at the charge densities. These are maps of how the charge density is redistributed upon introducing the guest into the cavities for one of the molecules, which is PE4A. Each of these, so a red blob means that there's an accumulation of electron density. A blue blob means there's a depletion. And every one of these lines of alternating red and blue represents a hydrogen bond. So you can see clearly that the hydrogen bonds formed between the host and the guest and depending on which guest I've introduced, I either have four bonds, six bonds, 12 bonds, etc. And I can easily calculate also the energies of those bonds. I get a nice linear graph and I can calculate the energies of those bonds. So I have 0.082 electron volts per bond. So that is all fairly standard stuff. So now comes the part that we're doing which is a little bit different. So now we want to find a descriptor that will tell us whether I will have the hex or whether I will have the lin. And the three hosts, so this is the descriptor just for the host molecules. And they differed, I told you, in the number of COH groups they had, the number of alkoxy side chains they had, and in the length of the molecule. So we look for a descriptor which has this form. It's a slightly non-intuitive form and I give full credit to my students who can hear for coming up with this form. So it has a form number of COH groups times the length of the molecule. And then divided by one plus the number of alkoxy side chains. And there's a power alpha here and a power beta here which is unknown and which remain to be determined. And now to get the alpha and the beta you do a regression. And what's plotted here is the difference from DFT between the hex and the lin versus the value of this host descriptor. And now the power alpha is turned out to be one. And the power beta is found to be 1.8, 0.125. And when you do this you get a nice linear regression. Okay, so far maybe it doesn't seem, it's like what's the big deal. But then is this useful? So what does it say? It says that if this quantity is positive you are in a lin structure. If it is negative you're in a hex structure. If you're close to the boundary then you're likely to have a glassy structure. If you're well inside one of these you're likely to be in the lin or the hex. So let's see if we can predict anything. So now we look at other molecules which were not part of our initial database. So we take four test molecules and they also have features that are different from our original ones. For example, this one has no head groups at all. These two have no side chains at all and we just apply this formula blindly. And it takes a few seconds to evaluate this expression for each of these molecules. And we do that and we end up with these four stars. And now, based on where these stars are positioned, we can predict that this one will be a glass, this one will be a hex, these two will be lin, etc. And then you do the experiments and you find, okay, so we could only find experimental data for three of them but they're indeed correct. This one, it's a prediction which remains to be verified, okay? But we can do even better than that. We can go and do DFT on these structures and we can actually get the energy difference between the hex and the lin on these structures. And then we find, so these yellow squares are the results from DFT for the energy difference between the hex and the lin for these. For this one, because it doesn't have any head groups, it can't form the hex at all. So we can't get energy difference between them, it can only form the lin. And you see that the stars which are predictable from the descriptor and the yellow squares from DFT fall almost exactly on each other. So this descriptor works. So now what about the guests? So for the guests, we have a geometrical descriptor. You look at the guests, you consider the polygons formed by the hydrogen atoms on the periphery of each guest molecule. Now you draw this polygon, then you draw a circle through this. And you look at the number of vertices, the maximum number of vertices that lie on this circle. And that number is just the guest descriptor. So in this case, it is 4, 6, 8, 10, 12. And that number alone is the guest descriptor. And now you can plot the same energy differences versus the guest descriptor. And the black lines are for the hex and the energy, free energy for the hex and for the lin. And now again, you can see if the hex is favored, the black line is below. If the lin is favored, the blue line is below. And if glass is favored, then these two lines lie very close to each other. The only thing is you have these some sudden jumps. That is because in certain structures, because of steric hindrance, you have a jump because you have a phase segregated form. Now you do what is standard when you work with descriptors. You plot a phase diagram in descriptor space. So you have a host descriptor and you have a guest descriptor. And you have a phase diagram which tells you that if you are in this blue area, you have a linear structure. And if you're in the gray area, you have a hex structure. And the circles are the 18 host guest combinations we've considered. And the colors represent the results from DFT for the difference between the hex and the lin. And you see you have a nice color progression from here where the lin is strongly favored to here where the hex is strongly favored. And this is very standard when you're computing descriptors. You have a clustering of properties in descriptor space. And when you have that, you know that you've found the right descriptors. And so now if you have any other combination of host and guest molecules, you just have to compute the host descriptor and the guest descriptor. And see where it falls on this phase diagram and then you know what is the structure that you will have. That's all very well, but it's a little dissatisfying. Because you may think it's all just numerology, what does it mean? I went to a talk last week by a famous mathematician. And he defined mathematics as the art of finding patterns and then explaining why those patterns exist. So I have found the pattern, but I haven't told you why it exists. Can I say anything at all about why the pattern exists? And unfortunately, the answer is mostly no. So this is very typical of the descriptors that are emerging, especially those that are emerging from machine learning. So this is going back to the first problem I told you about. About whether if you have octet compounds, you have the rock salt structure favor, the worksite structure favor, the zinc plan structure favor. This is a paper by Luca Giringelli and Mathiaschefler where they looked at many, many octet compounds. And then they applied machine learning to find the descriptors. The colors tell you the difference in energies between competing phases. And you see you do have a separation in descriptors phase. But the descriptors that are coming are really weird, okay? This is the radius of the S orbital of the A element minus the P orbital of the B element times exponential of the S orbital of the A element. Here this is ionization potential of B minus electron affinity of B divided by the P orbital radius of A squared. So these are really bizarre descriptors that are coming out, which you can't easily give physical or chemical interpretations to. So I do want to jump way back in history to tell you. I mean, I'm sure you all recognize what this is. This is Mendeleev's periodic table in 1871. Which we may not think of it that way, but what he basically did was he found descriptors. The rows and columns in his periodic table were basically descriptors. And he found patterns and he didn't have explanations for them. Because it was only in 1904 that the electron was discovered, then there's the Rutherford model and the Bohr model, okay? And only then did his descriptors make sense. So I'm not claiming that I found something like the Mendeleev periodic table. But it is possible that at some point we will understand what are these weird descriptors we're finding out. Why do I get this one-eighth power and things like that? I do have some understanding for the guest descriptor. In the hex phase, the guest descriptor is the number of hydrogen bonds that are formed between the host and the guest. But in the Linn, it is really weird looking. And even there it's not very clear that it is the number of hydrogen bonds. And they're certainly not identical bonds. Whereas here they're clearly nice six identical hydrogen bonds. So I'm done. So I just want to tell you that for the first time we've succeeded in identifying descriptors for self-assembly of molecules on surfaces. These descriptors can be computed at zero computational cost. Depending only on the geometry and form of the isolated host and guest molecules in gas phase. I do want to say of course what we computed is probably for a limited class of molecules and would have to be generalized for other kinds of molecules. Thank you.