 Welcome to What's Next, a seminar series presented by IBM Research, in which we spend time with some of our scientists and researchers learning about the exciting work they do. I'm Shaheen Parks. I'm a content strategist here at IBM. And I'm so pleased today to be joined with not one, but two guests. Next to me I have Dr. Thomas Ash from Ivonic Industries. Thanks for having me. Glad you could join us. And we have Dr. Ji Chen, who is part of the MIT IBM Watson AI lab based here in Cambridge. Today, Dr. Chen will be talking with us about using graphs and grammars to automate drug and material discovery. And one of the key benefits of the methods that he'll talk about today is that it requires a very small amount of data to be effective. When we compare it with other methods that you might use to achieve similar results, those methods would need orders of magnitude more training examples to get to even similar levels of effectiveness. Dr. Chen has degrees in computer science and mathematics, and his research interests include statistics, machine learning and scientific computing. Dr. Ash is a background in chemistry and material simulation. And at Ivonic, he is responsible for data science in the research and development organization, and he's part of the effort to move towards a digitization of R&D. Today, Dr. Ash will be able to provide a very complementary industry perspective on the research work that Dr. Chen will talk about. With that, I'll hand it over to Ji. Thank you, Sharin. Hello. I'm here very excited to talk about a joint work with MRT recently that generates molecules by using a very small amount of data. So this is an exciting innovation by using graph and grammar. I'm going to talk about the details of that. Drug design and material discovery amounts to generating a good candidate-self molecules so that we can screen them and find the ones that we need. So generating molecules is an important step in the past decades. Deep learning methods and AI methods have been flourishing in this regard. This method called generate models, they are able to learn from asset molecules and generate new ones. A general limitation of these methods is that they require a large amount of training data. So the method that I'm talking about today would be using a very small amount of data and it is very data efficient. For example, it is able to use only a dozen chain-net examples that can already generate molecules whose quality are as good as deep learning methods, which actually requires maybe 100,000 chain-net examples. So in these methods, we will be achieving a molecule as a graph and then learn graph grammar to do the job. If you think about getting molecules that are of good use to us, this is like finding a needle in a haystack. Because all the possible amount of molecules are actually a lot very vast. So this such place, according to certain estimation, would campaign 10 to 60 molecules or jobs. And if we talk about other applications like materials and formulas, then such space would be even larger. So the approach we are taking here is like a circle. So we will start with cheating data like molecule here as a graph. Then we will be learning a graph that can generate these molecules. So the graph grammar here serves two purposes. One purpose is that it is a generate model and generate new molecules. But the other use of it is that it can also give us a very useful tool to get representations of a graph, so that we can use them to get better computer predictions of the molecule properties. Now predicting these properties are actually also an important step here, because otherwise they have to be done by human being who goes to the lab and do tests or do simulations. The computer predictions would actually speed up this process. Furthermore, with the computer predictions of the properties, we can now apply numerical simulations, sorry, and numerical optimization to find the regions of molecules that are good candidates. So this would reduce the amount of candidates that we need to screen out when we find the ones that we need. So our work here focus on the first step here, which is to learn the graph grammar by using a small set of molecules. Yeah, actually, that is pretty interesting, G, because I mean, you mentioned the entire space of small molecules. It's like 10 to the 60s, something like that. But actually, if you look into the chemical industry, we are very limited in what we actually would like to do on large scale assets that we have. So we need to already focus at the beginning to generate those molecules that are of interest to us, and we definitely see that these methods can be helpful there. Exactly. So the grammar we are doing here is actually a good innovation in deep learning, because this is a tool for us to narrow down the search space and try to find the good things that we need. So I think grammar is actually of no strange to us. When we learn a new language, we learn a grammar, and when we learn to write a program, we also learn grammar that can specify how a program should be. Effectively, a grammar is a set of rules that tells us how to compose a complicated object, either a sentence or a computer program. It turns out that graphs, which are actually the things that we use to model what we use, also have grammars, and that is the gist of our working. So when you say that the grammar is essentially a set of rules, is that similar to how we might describe an algorithm? Yeah, exactly. So a program contains a lot of statements, and these statements are in certain order, which actually forms the algorithm that tells us if we do the first step and then the next step, then eventually we complete the job. So the grammar here also provides us all these elements, and we just need to figure out what are the steps in order to compose a graph that we need. And actually, if you look into the education of chemists, it's basically the same. So you learn a new language, but that language is not meant for communicating with people, but it's meant for communicating with scientists. So you learn chemistry basically as another language, which is foreign in the beginning, but once you get, let's say, a better feeling of it, then you are somehow fluent in chemistry and you can actually speak that language basically the same as you speak English. So the chemist is learning the grammar. Exactly. Yeah. I love that metaphor, right? So chemistry is also a language, so let's learn it. All right, so how do we learn that? First of all, we look at a graph, we look at a molecule as a graph. So the adepts here are the graph nodes, and the bonds are the graph edges. Now, of course, for molecules, it is a little bit special because there are certain structures here that we definitely won't contain. So for example, there's a ring here. So to generalize the edges that contains only bonds, we actually also use so-called hydra edges to contain things like rain. So these hydra edges are actually a generalization of edges that contain only two nodes. But for simplicity, I'll keep using the terminology of edges when I'm describing this work. So now that we know that the molecule we'll be looking at, we'll be looking at this by using the graph representation. So the next step is what exactly is the grammar? Here I show an example of the grammar, which contains whole production rules. So the first rule, which is here, starts from an empty state X, which says that nothing. And that will be replacing this empty state by using a ring. Now, inside this ring, there are some nodes that we know what they are. For example, these blue nodes, they are the specific atoms. But we also have the open nodes, which are the R star, the little black square here. These are open nodes, and we don't know what they are, but they can be replaced if there is a match by using some other production rules. And as you can see here for all of the production rules, the left-hand side is something that contains the open node R star, and then the right-hand side says that we'll be replacing that by adding edges and bonds. So this is exactly the way that we compose a graph, which is to edit the free adds, nodes, and edges, and then eventually the entire object we have. So let me give an example. Now, here we start from the empty state X, and it will simply be applying certain rules to expand the graph, and then eventually there is no only open nodes. For example, in the first step, as you saw in the past slide, we'll be using the first rule to replace the empty state by a ring. And afterwards, we see that there are two open nodes in a ring, so we'll be applying another rule that replaced these two open nodes by another ring. So now we have two rings, but still inside the rings there are some open nodes. So we'll be further applying applicable rules and expand the graph and do these multiple times, and eventually there's no any open nodes in the graph. That is the graph that we want. So, G, I imagine this is a very simplified example with four production rules. How many production rules would you expect in general, in like a real example? Now, so that actually entirely depends on the training data set. Okay. For example, for this particular example here, we're actually using a training data set of Isosynics, and there are only a dozen molecules. From there we can actually learn a grammar that contains the production rules to generate all these Isosynics, and the number of rules there roughly five or ten. So it really depends on where you learn from. And maybe just jumping in on that. So in the first step, you basically come from an empty set, and then you have like some scaffold of molecules. So I expect that there are like more of these rules, right? So it's not just one that can generate something from an empty set, but more of those, right? Right. So the initial state where the rule that starts from an initial state, so this is actually a very special rule, and depends on the training set and depends on what eventually we learn about. It is possible that there's only one rule that says that that starts from specifically the star state. So in that case, it basically says that every time I'm going to start with this range here, but it is actually also possible that the grammar we learn eventually would contain multiple rules that contains the star states. And then in that case, we actually have to choose which rule to use when we generate molecules. So there is actually a very natural question, right? So how do we get these rules? How do we get a grammar? Now, the way to get this grammar is actually the reverse thinking. So rather than starting from the empty state and get the final graph, then we'll be actually starting from the complete graph and then trying to strip off the graph component by component until eventually nothing remains. So for example, what you see here is that we start with a molecule and then in the first step, we'll be removing these four nodes here and replace it by an open single R star. So because of this reverse thinking, the production rule actually says that we'll be replacing the open star that the open nodes are start by these four nodes connected by the address. So every time we strip off a component of the algebra house, we make a production rule and then eventually we make enough production rules so that nothing remains. So this is the way that we can get a grammar which if we apply reversely, we get a graph. Yeah, it's actually amazing because what you can see here, and I mean this is also an example that pits to the set that we showed, it actually has rules that generate real chemistry like chemical functions or chemical entities that have a function. Exactly. So later on I'm actually going to show you some specific experiment results and from there we can see that certain functional groups for a certain class of molecules are actually appearing in these production rules that we learned. Right, so now that we talk about how do we learn production rules, but now we actually also have an open question because the way that we define what components to strip off and the order of stripping these components are actually arbitrary. So what would be the best production rules or what would be the best grammar? So we inject the machine learning here in order that we can get an optimal grammar. Now the way we do this is that we'll be imposing probabilities for the edges because these probabilities will be turning us with what components are formed and what components are removed. So then we'll be doing an optimization rule. We start with an existing grammar that can use the production rules and then use the grammar to generate a large set of molecules and then we measure how good these molecules are. There are multiple indicators, for example, we can measure how many of them can actually be synthesized or we can measure whether or not they are diverse enough to cover the entire space. So whatever they are, we'll be using these indicators and go back and specify an update in better edge weights. So these edges will give us another set of grammar rules and then we iterate and do this until the optimization converges. So when it converges, the grammar we get will be the one that generate molecules that are synthesized and there are synthesizable maximally and there are also maximally diverse. That's actually great and I want to jump in on that because generating molecules is easy, right? I mean you can just draw it on a piece of paper but generating molecules that are actually synthesizable, that is the challenging task because you do not want to spend like hours in the lab until you eventually find out that you can probably not make that molecule but you want to have molecules in a set that are predicted somehow that you can actually make in the end because it doesn't matter if a molecule or a hypothetical molecule has brilliant properties but you cannot make it so it's probably just something that you can just throw away right away, right? Exactly and I think this is actually also one of the great things about these methods because if you look at existing methods in literature many of them, when they learn the generative model they only consider how well you can reconstruct the original training examples but now we are actually not considering how well we can generate existing training examples because the grammar itself actually guarantees that so in addition what we do is to look at other indicators and one very important thing is the synthesized bit so we'll be using that as an indicator and try to optimize our model so that eventually we can generate as many molecules as possible which are synthesized So maybe let's look at some experiment results I hope that I can actually convince you further that we're actually doing a great job So what I'm showing here is that we have certain methods to compare with and certain metrics that we also compare with So the bottom one here, the DEG is our methods which stands for Data Deficient Grammar On the top, some state-of-the-art AI methods used for generating molecules So actually many of them are also in the style of encoding a molecule into an equilibrium-laden space and encoding them back to the original model So when we compare them then we look at various different metrics For example, how many of them can be synthesized which is definitely one thing that we focus on but additionally we also look at other metrics For example, when we look at existing generative methods they are actually cheating the training set as a distribution So when they build a model they want to generate data that forms a distribution that is aligned with the original distribution So we have all these four metrics that measure how well such alignment of the distributions are And then, unsurprisingly, we also have a metric that just tells us what is the proportion of the molecules that can actually be synthesized Yet, one more interesting thing is that we have another metric which is called membership So this membership is particularly talking about if our training set consists of a class of molecules So there is a specific class here So when we generate, then do the generating molecules also pour under this class I think this is actually one of the very rare things that we also compare when we are evaluating our methods And for all these metrics obviously the higher the better So you'll see that for many of the metrics we are actually achieving the best And even for the 1th year which is possibly the 2nd but it is actually still close to the best So this is for the ISO setting class And what is not shown here are actually also some other training sets and some other classes and the results can include similar findings It's good to see that actually in action because I mean, you briefly touched like there are other methods that can generate valid molecules Like you get a molecule that actually is a molecule and does not have like a carbon atom with 5 bonds attached to it or something like that And in that region we see that there are methods that are probably doing equally well like we do, right? 100% valid molecules Brilliant thing, good to go But the interesting thing that you touched at the end like the synthesizability and the membership we can see that this method is actually outstanding there because getting to I mean this RS value of 27% basically means operational score So that is actually one of the tools that we use to measure the synthesizability of the molecule Exactly, good that you have that I mean it's on the bottom, right? synthesizability and that's actually something that we want to have as I already mentioned like it should be synthesizable but also many of the challenges that we face are in the last column that you have here like they should be of a similar chemistry because usually if you have like a certain chemistry in your company you do not want to generate arbitrary molecules that somebody could eventually make you want to make molecules that you can make and that fits to your portfolio, right? And that's actually very nice to see that we can actually have membership scores that are amazing Yeah, it's just like I mean give me some isocyanides and I'll generate isocyanides for you but not something else Exactly Do you have a question for you? What do you think it is about this method that is leading to these strong results that's different from some of the alternatives that are listed here? I would say so one important thing is about the data efficiency as I mentioned at the very beginning, right? So a lot of methods here like the graph MVP or GTV all these methods are deep learning methods that require a large amount of training data so as a matter of fact when we are doing these very particular data sets isocyanides I mean so there are only a dozen a properly less than a dozen training examples here and it's just because human being actually don't know more than that so then when we are applying when we are trying to do with these data sets it is very natural that existing methods which can only work with large units would likely fail That makes a lot of sense Perfect So then maybe let me also show you some other examples just more than numbers, right? Yeah What I'm showing here is that for different data sets we can generate new examples so all these are newly generated ones and I don't know, so Thomas you are actually experts Do you actually is it possible that some of them you actually haven't seen them before which really says that we are generating something that you are curious about? Well, you can always draw them, right? But the interesting thing is is mainly that you put in like that little amount of data only 12 examples of a particular molecular class and you actually are able to learn what you have, like the language of that particular data set and so what I love to see here is that you actually generate isocyanates if you throw in isocyanates and that doesn't have to be the case if you just use some other method, right? I mean, you can basically draw each and every molecule so that's what I meant with it's easy to just draw them on a piece of paper but those seem, I would say, at least reasonable, right? So you have like the functional group that you are expecting and you have something carbon-based in the middle, right? Speaking of isocyanates, so let's look at what exactly these production rules are because we're actually all curious, so why do we exactly generate isocyanates? Now, we played with three data sets a game here so now we show some of the examples of production rules that we learned from the training sets and if you look at, for example, the top rule so this is for isocyanates and if you look at the first rule and the third rule we explicitly see the functional group here which contains the nitrogen, the carbon and oxygen connected by double bonds so these are the exact things that we expect to see for isocyanates and what actually is the production rules? Actually, we learned these production rules and we learned these production rules by not hearing someone else telling us that, hey, you really need to help and see an all-way connected by double bonds but we really just look at the training data sets and they're all good and can't figure it out. Yeah, and if I look into that rules that you generated just for the isocyanates as an example it's also nice that you don't get any other functionalities because if I want to have an Easter on Accrelate or something like that, I would put it in my data set and then I can generate those but I only put in isocyanates so I'm basically generating only isocyanates and that's actually pretty amazing. I'm glad that you like it. Yeah, I think there is actually pretty much the work that we've been doing recently I mean, so one thing that excites me about this is that the methods that we're using or the method that are actually developed is making a very innovative uses of graphs and grammar and this allows us to train a model that uses only a very limited amount of data so it is super data efficient. So we hope that this method will actually significantly speed up the process of identifying more useful graphs and materials and because in real life we just don't have that many molecules to stop it. Absolutely and so Thomas I want to ask you that question in real life do these methods support the work that you're doing? Yeah, hopefully. So traditionally if you generate molecules you always rely somehow on the creativity of a chemist and we definitely want to embrace that creativity that a chemist has but we want to support it with some kind of I would say artificial creativity so that's what we do, yeah. We learn a language, we learn the language of a particular dataset and we can give our chemists just pictures of new molecules that they probably never thought about and if we add then the property prediction part of it that you also mentioned in the beginning that becomes a very powerful tool to quickly identify which routes can I actually navigate in my chemical space which would give me superior properties of what I actually already know. So it sounds like both are really valuable pieces of the process, the generation but also the property prediction? Yeah, definitely, definitely. And also it becomes in handy that this method is actually pretty data efficient so we can also in principle describe the molecules in a very efficient way which gives us probably a little bit more breathing space to do the property prediction on very limited training data because I mean if you have data for 12 molecules that's not much but if you look into the chemical industry it often happens that you do not have billions of molecules but more like in the probably tens to hundreds where you can actually effectively generate data. Right, and is that a question of being able to generate data or actually even just having knowledge of the molecules existing? Well, what we always have to keep in mind is that generating data in chemistry usually is not like pushing a button and then you get a result for a molecule you have to think about how do I make that molecule then somebody has to go out with it and probably perform a multi-step reaction to get a molecule and then perform some other reaction on it to get for example a polymeric data and that can be actually quite broad so it is actually also something which has to take the cost in mind because it's very costly for us to generate knowledge on molecules. I think this real world perspective is so helpful because I know that a lot of the work we do here is how we use publicly available data sets or images or words and things that unlike molecules maybe are a little bit easier to come by so while we have not infinite data we certainly have often more than 12 examples so I think having that perspective of knowing this type of use case there's a real natural bound to the amount of data we'll ever have really highlights the importance of having methods that are able to address that. Yeah, that's definitely true I mean it's easy to take a picture and if you train something on cat pictures you can just take more cat pictures but it's probably a little bit harder to generate data on molecules so that's definitely true. Yeah, absolutely. As I mentioned at the beginning our next step would be looking out how can we protect the properties by using a better method and just getting back to this data efficient paradigm so their efficiency is actually not only a benefit of our methods when we generate molecules it is actually also something important when we are trying to develop a prediction model for the properties that is something in our agenda next and we actually will be so excited to see how we can use this grammar and particularly the production rules and the ordering of using the production rules and also a good representation for the molecules when we are building these predictors. Yeah, I'd love to see that. Well on that note I want to thank both of you for taking the time to talk with us today and I also want to thank our audience for listening in. It's been great and we will be back again soon so stay tuned and we'll see you all soon.