 All right. Hello, everybody. I'm Yuanxing Wang in Kudera Lab, in Rostel, Kettering, Kansas Center in New York City. And today, it's an absolute pleasure to be able to share a little bit of findings and understandings we discovered in this little project, GraphNet for Learning Molecular Physics. So this work is done jointly by me, Josh Fass, Taya Sturr, and John Kudera, in this group. And we also got a little help from a very, very good friend, Colin Law, in a small starter company in Shenzhen, China. So just to give you a very short introduction in case you are really, really busy. So GraphNet is a network that operates on the topological space of molecules, and it consists of three update and three aggregation functions. So this network structure is capable of predicting per molecule, per atom, and per bond attributes, and also for student parameters. Maybe you're interested in that. So let's start with a brief introduction of what is GraphNet and what is GraphNet. So here I'm going to define a graph as a set of sets. So it is a set of the sets of edges, vertices, or nodes, because vertices is such an awkward word to say. And universal attributes, which is sort of the master node, the node that you put on the graph as a whole entity. So for proteins and molecules, we model them as undirected node edge and graph attribute it unlabeled graphs. So here it has to be undirected, because chemical bonds has a symmetry, it has a mirror symmetry, so it cannot be directed. With many field exceptions, I guess we can model some kind of complex bonds as directed bonds, but most in general speaking, for a simple graph, we model the nodes undirected to graphs. And the others are just design choices. So once you know the definition of graph, then a GraphNet is simply a set of functions that propagates information back and forth between the nodes, the bounds, and the graph as a whole or master node, so to speak. So the three functions on the left are update functions. The three functions on the right are aggregate functions. So update functions update the sort of entity in each time step. So for example, Phi E is the update function of edge, Phi V is the update function of vertices or nodes, and Phi U is the update function of the master node. So Phi E takes EK, the cell, the RK, the SK, which is the, well, the notation is center and receiver, but things we don't have a directed graph, so these two should be interchangeable. It takes the two vertices that it is connecting and the master node and update itself. So vertices, the update function of vertices is taking the sort of aggregated edge and the hidden stage of the vertices of vertex itself and the master node and aggregate. And it's pretty much the same thing with the global update function. So the aggregate functions is basically saying, well, you have a sort of, you have a set of entities and you use a sort of smashing function to push it to a summary of the entity. So one character of this, one necessary character of these functions is that it takes a set of elements of certain dimension and returns something of that dimension. So we can also include hyper edges. So hyper edges, for example, inspired by MD calculations, you can have this hyper edge that connects three atoms, which is an angle or four atoms, which is a dihedral angle, and it updates and aggregates in a similar fashion. And another very interesting thing to do is pairwise-free that you can model the interaction between two arbitrary nodes by using sort of this attention mechanism. So attention mechanism, well, this seems complicated, but it's really just a linear transformation on your input, another linear transformation input, you transpose it, and then you do a matrix multiplication on these two. So each element is sort of able to attend each other or to see each other and interact with each other. So Google is using this sort of model in their translation apps. And there's a funny fact that there's a fun fact that it's super expensive to change, and it costs a lot of CO2 permission. Well, if you take a look at this, you will realize that attention mechanism is permutation equivalent, meaning that if you do a permutation on your input, then, well, there's sort of weird permutation on the large matrix that you will end up getting. And for a translation model, because you smash the large matrix back onto a single 1D vector, then the input and output end-to-end should be equivalent. So we can use this little property to test whether Google is actually using attention mechanism itself, because, well, it is a permutation active variant. So if you select a word that is symmetrical in any given language, and you translate it using Google's translation app, then whatever the word you are getting from the other language should also be symmetrical. But it's that cell. So test it. If you just find whatever Latin language, you translate the word bling-bling, which is a symmetrical word, then any sort of Latin, average, Spanish, French, the Italian, it's the same. If you go to Korean, it's symmetrical. If you go to, this is Russian, it is still symmetrical. The same thing with Japanese. This is for syllables. I don't know if anybody knows how to speak Japanese, but if you go to Chinese, then this is not symmetrical. So the hypothesis is proven to be incorrect, and they are not only using attention mechanisms, although they claim attention is, I will need, apparently, which is not I will need. All right, so moving back to hypergraph. Actually, quick question. If you go back to the previous slide, is that a paper that the hypothesis of Google? No, it's just the text from it that I did. I mean, then it's not really Google creating anything, right? Oh, attention is, I will need is the title of the paper that Google released. That Google released. Oh, that's what it's asking. Yeah. Let's skip this bit. So the beauty of graph.net is that a graph here plays a double role. It is both the problem you are trying to solve as well as the space on which you're solving this problem. We'll talk about this a little bit later. So the most interesting or most common usage of graph.net is to model a social network. Let's say you have a little gang of three people, three friends here, and you want to model their interactions. So the graph notation should be the set of all the vertices, the set of all the edges and a global attribute that belongs to the graph as a whole. So the first step is to update the edges because normally in the social network, so to speak, the relationships between people or the friendship or bond between people is defined by those two people. You do not have a relationship first and then have two individuals as a result of that. So edges are updated by itself because this is a time dependent model by the edges. So by two vertices connected and the global attribute. And the second step is to aggregate all the edges that's connected to a node and update the node based on that, based on the node itself last time step as well as the global attribute. And finally aggregate everything and update the global attribute. So this is what it is if you want to express that in the sort of code. And we have implementation of this in this link. So speaking of the package, just one minute. In terms of your back, my graph net concept is pretty poor, but if you go to the pseudocode, could you maybe explain the same thing in the pseudocode? Sure. So for all the edges, you're updated by the edge itself, by its connected nodes and the global attribute. Right? And in this case, the global attribute is like a global loss? No, it's a global attribute you put on the graph. For example, it could be or if you're modeling a group of people, you can model, for example, if this is a team of worker, their efficiency as a attribute. It could be a proxy. It could also be a sort of hidden variable or hidden state. It could be not explicit as fine. How are you representing that global attribute as a vector? It's a vector. Is it like a one-hot encoder? Not necessarily. So, oh, by the way, you can do this in the synchronous or asynchronous manner. You wouldn't do that in the synchronous manner. The reason is that, well, if it's faster, it's cheaper. You use a lot of GPU. And also, if you do that in an asynchronous manner, you just determine the order in which they're executed and you might break the invariance. So, a little conversion time for the package that we are developing is called Gamma. So, it's graph inference molecular typology. Everything is written in Python and TensorFlow. It's open source. It's under MIT license. And we even have our own input and output pipeline for small molecules. So, if you don't like RDKET or something like that, you can use TensorFlow implementation of all those pipelines. And it could just be put onto GPUs or Power Machine, whatever, all sorts of supercomputers and compiled and executed in parallel in TensorFlow graphs. Just, by the way, the most aspect of this project that I'm most proud of is the design of the logo. So, the first bit of the logo is G-Germany. There's a graph, but there's also a chemical graph. It is a, what's that called, propane oxide, rotated 90 degrees. And it is a cocktail glass, Gimlet glass. It looks like something like this. This is a simple mixture of, I think it's gin, lime juice, a piece of lime, lettuce, syrup, and a major flag. All right. I think, yeah, it's just a few more words on this. So, urlay, the update functions is just the feed forward network and the aggregate function. Sometimes you can use just some, you can also use max, you can also use average. These all have the feature that I just talked about. It's a function that operates on a set of elements and returns something that looks like in terms of dimensionality an element in that set. So, a few flavors of graph nets. This is a four genome block. Everything is connected and everything is executed in the manner that I just talked about. If you delete some connections, you get a message passing your network. And if you delete all the connections between the entities and just have the star at the end and the entity is connected to itself, it is just a recurrent neural network. So, that kind of tells us that this thing itself is sort of a recurrent neural network. It is the same weights, the matrix multiplication using the same weights is carried out multiple times during one run of message passing. So, the name of the message passing is based on the idea that it's kind of like that your neighbouring notes is writing a message to you and you are updated by their message. And here is just some paper, a message passing your network, although it is not as sort of expressive as the general four block GN. So, there are a bunch of paper that discuss the different sites if you use all sorts of functions as your update, as your aggregate and readout function. By the way, I think I did not introduce what is a readout function. The readout function is just the function that you use at the very end after all rounds of message passing. You take the trajectory or just one snapshot of your, all of your entities and you come up with an answer. So, for example, for predicting the efficiency of a group of workers, then after a few message passing rounds that models their interactions, you have all the trajectories and you have a function that takes all of that into consideration and tells you, predicts a number. And you use that in your back propagation step and you minimize the losses if you have data. All right. So, let's move on to the first part. The previous one is part zero. The first part is about discriminative models per graph attributes because per graph attributes or per molecule attributes is sort of what everybody is predicting in the molecular machine learning setting. So, just a quick showcase of the results. So, if we use just the traditional full block graph nets with the residual connection in our readout step, it is already doing pretty good. It is on par with if not better than most of the state of the art models reported in the molecule net paper. However, it's not doing as well in the prediction of something like lipophilicity and we're going to talk about the reason. So, just by looking at the distribution of these properties, you can see that, well, usual, which is water solubility, it's the function of the size of your molecule, of course. The free solve is some sort of salvation free energy. And it could be when you have an empty calculation, it could be break down into the terms that could be associated to atoms. So, it's sort of atom additive, although not exactly so. Whereas lipophilicity, the distribution has very little correlation with the number of atoms in your system. So, that's the reason why if you use a mean aggregation rather than a sum aggregation in your last step, it gets slightly better because it is strictly not atom additive. So, our question become, is there an association between the sum in the hidden space and the sum in the physical space? In other words, if you have a extensive and you have an intensive property, do you need to make adjustment in your architecture? So, to answer this question, we prepared two very simple toy tasks. The first one is molecule weight and the second one is mean atom weight. Those are the simplest thing you can ever predict, right? Mean atom weight, by this I mean just the molecule weight divided by the number of atoms. So, if you use a sum to predict the sum of atom weights or molecule weight, it works perfectly. If you use a mean to predict the mean of atom weight, it kind of works, but there is a paper discussing the fact that mean aggregation is not as powerful as some aggregation itself and it's harder to change. However, if you use the upset, it's kind of miserable. So, I guess I'm going to talk about this later, but just to give you another extreme example, if you just want to predict the average and sum of atom weight of these set of molecules, it's cyclopropane, cyclopentane, cyclohexane are the way to cyclic and atom ring. So, in the language of graph theory, this is a cycle graph. The correct answer of the average atom weight is R12 if you have implicit hydrogen and a correct answer for a molecule weight in this case, it's 12n and it's the number of atoms. So, this is kind of obvious, right? But if you use a graph net to solve this problem, it gets a little trickier. At T equals zero, all the nodes and all the molecules in this set is initialized to have the same attributes. If this is now obvious for smaller graphs, if you want to include more features for your atom, your local environment of atoms, this is kind of obvious for larger graphs, right? The carbon atoms in large graph and even larger graph are almost identical. And we use this recursion to say that if at T equals T, this set is locally isomorphic, meaning that all of the nodes are the same, all of the edges are the same, or all the molecules in this set are locally isomorphic. And then you do another round of message passing, because you're updating your edge based on the edge yourself, the connecting nodes and you're updating the nodes based on the nodes itself and the connecting edges, then at T equals T plus one, it is still locally isomorphic. So, the question becomes, well, it is locally isomorphic at all time, right? So, after no matter how many rounds of message passing, you now want to use the readout function to get an answer and to simplify the case, if we just take, if we just look at the last frame in this trajectory, then you have a set of exactly identical nodes and exactly identical edges. So, if you use a sum function, it's going to give you different answers. And it's proportional to the size of the system. If you use mean aggregation, then it's going to give you the same answer. So, if you use the sum function to predict the average weight, it's going to be wrong. If you use the mean function to predict the sum weight, it's going to be wrong. So, this tells us that, well, although, of course, in a realistic situation, you can have a neural network that predicts them both and add them together or can add them together, whatever. But it tells us that if it's a simple structure, then it is important to know whether your identity is intensive or extensive and choose your aggregation function accordingly. So, this problem also tells us that graph nets when using a mean aggregation is not able to even predict ring size. This is in contradiction with the results of locus that it can do this many things just with a large enough in a layer and with enough rounds of message passing. But the thing is that this is proven on the basis of a labeled graph. If you have a graph that's unlabeled, then it breaks a lot of properties here. But you also want a unlabeled graph because you want it to be permutation invariant and equivariant determined by what kind of property you're predicting. And the easiest way to do that is to use the unlabeled graph or discard the label as some stage and perform the node and edge aggregation in a synchronous manner. And I'm also going to talk about a little bit of the discriminative model per node and per edge attributes. So, this is too theoretical I think. All right, so let's talk about the applications first. Charge protection. So, charge is a very important parameter in the empty calculation because it determines the energy and all the forces in your empty calculation at each time step and charges are humanly predetermined at the beginning of empty calculation and fixed during the entire course. But a charging method, the charging methods that we are using right now is either super expensive methods or they're unreliable like the empirical methods. So, we're thinking can we use a graph net to approximate the results of current derived charges? And we trend the graph net onto a data set. The first one is a set of charges derived from density function theory calculation. It's accurate, but it is sort of also the function of your conformer. And we take the lowest energy conformer as the charge associated with that molecule. Well, of course, charges are a function of conformer, but you are in the empty calculation at the beginning. We pretend it is not and determine a charge for an atom regardless of its conformations. So, to address this problem, we are also generating charges using MMPCC-ELP-10 method that is thought to be the Bonesman-Average Charge for several lowest energy conformers. Well, just by looking at the problem, it is kind of intuitive to predict this. Just put it on the node and minimize the loss and you get the charges. But this does not work very well. Because this line of constraint here, it could be zero, it could be positive, it could be negative. And generic neural networks doesn't like that. Well, if it is positive or negative all the time, you can just have a softmax function to constrain it. So, the solution is to follow a paper down by Michaelton and predict the second and then the first, the second order derivatives of the potential energy with regard to atom charges and use that to reconstruct your charges. So, here's the idea. If you expand the contribution to the potential energy by the atom charges as a Taylor expansion and only take the first few terms. So, this actually is how electronic activity and the hardness are defined. So, for a long time, this has been a standard practice where you just measure or use quantum mechanics calculations to determine the ionization potential and electricity data and fit the hardness and electronic activity of all the atoms and it's proved to be working. So, we use the same idea and we use this charge equilibrium method proposed by Gilson et al. And form this as a double optimization problem. So, you first come up with the prediction of the electron activity and the hardness. And once you have that, which is predicted by your neural net, then you also predict the charge that would minimize the potential energy. And luckily, the second step is easily solvable through a Lagrange multiplier. And this solution is analytical whose Jacobian and Hessian are trivially easy to calculate. So, if you put this end to end, you can flow your gradient through all the layers and it's working kind of well. So, the discrepancy RMSE is around 0.02, which is sort of in the range of the disagreement between M1BCC and DFT calculations. So, now we already have a method that is almost as accurate as M1BCC, but 500 times faster. If you delete some part, so for example, if you do not do the Bonn-Other thing, it's kind of the same, right? But if you directly predict the charge, it's miserable because of the constraint that I just mentioned. If you also take a look at the predicted hardness and electron activity of all the elements, the distribution in each element kind of correspond to our understanding of that element. For example, carbon is here. It's very hard and kind of electron neutral. Hydrogen is everywhere. The halogens are more on the electro-negative side and phosphorus and nitrogen is more on the electro-positive side. So, this animation is to show what is this latent space look like at each round of message passing color coded by different type of elements, by hybridization and by predicted harness and electron activity. So, when t equals zero, we initialize this message passing, you only have a little few dots. I apologize for this is not very clear here. You only have a little few dots because it's only what do you know when you initialize the message passing, the element type itself and nothing else. But to one step of element passing, you know, we're neighbors. We are direct neighbors. So, it's scattering and the more you take, it's further spread out. And it seems that in the latent space, it can already distinguish the different hybridization types, although it did not feed it to the structure directly. It can determine whether it's aromatic or not. And you can see that it's pushing the atoms with different predicted answers to different corners of the latent space. How many times steps do you run? This is five times steps. This is one of the parameters that we need to tune. And ULA, anything larger than six, doesn't make a difference because you can think of charges as locally determined by the neighbors. And once you go past six, it's not making a difference. Interestingly, if you say like a time step one and already figures out it's neighbors, that's actually pretty good. And so, like, you don't need more than even like two time steps even just looking at them. Right, because, yeah, because the information from neighbors is aggregating to notice. Actually, another question that I was curious on is the structure of the and what type of parameters you have. Because it seems like, you know, like, I mean, I don't know how overfitting looks like in this space, but it seems like it's a very complex model, but probably two main parameters. Well, the structure of the architecture itself is kind of complex. But for each function, each composing function, it's only like 30 to 64 dimension. It's already sort of sufficient. Right. But the connections itself, though, if you have more connections than the training set, then you're always kind of overfitting. Right. But we just tested it. It's not counterfeit. No, just curiously architecture. Right. Well, if you look at it, just a pure number of parameters, how many numbers we have in your model, it's not terrible compared to like a traditional CNN model. Okay, so this graph is showing the literal regression of your next step based on your previous step, which is to say how predictable the next step is. When t equals zero, you don't know, right? You don't know anything, but when you have like more knowledge, the knowledge Ganyu has by allowing the further neighbor to pass its message onto you. The Ganyu is marginal. The last thing I'm going to talk about regarding this model is the scalability. So I was making a bet with Josh about the scalability of this model because it seems that this is the locality is determined by the just number rounds of message passing and so on and so on. So maybe it will scale poorly with regard to the size of the system. So we were hoping that the absolute area will be something like a not very steep increasing trend. But to our surprise, it's actually decreasing. We're still looking into that. So this red zigzag is just a very ugly version of violent plot. And the reason we think that it is decreasing might be a bias in our dataset because when you do your DFT calculation, if you have more aromatic rings in your molecule, then it could be more accurate. And if you have more rings, it tends to be heavier. All right. And I'm going to talk a little bit about the work in progress. And that's it. So the one thing that we're doing right now is to enter hierarchical multitask learning. So this is kind of something new. We want to look at how this model is predicting the physical properties. Is there any sort of correspondence or agreement with how physics how the properties themselves are determined? So we try to look at, can we do per atom, per bond, and per molecule prediction all at once? Is there any improvement in terms of the performance compared to training them independently? Is better or worse? Or can we provide some of these attributes and let a model predict others? Can we provide, for example, these two and predict that? Can we provide these two? You can do all sorts of combinations. So the per atom attributes we're having minus just charges and harness, I don't know, charges pretty solid for a step. And for per bond attributes, HIA is working out the prediction of a Weiber bond order, which is a bond order that is not discretized into numbers. So it is this float number, but also corresponds to the understanding of the bond structure by chemist. So you could have like rather than just 1.5 and 2, you can have any number in between. And this could be determined by quantum chemistry. And it is a pretty good indicator of the energy to break that bond. So we are interested in predicting that as well. So a good thing is that you can derive all of these in one calculation, the 1AM and 1BCC calculation. You can determine the formation energy or the atomization energy of the molecule and predict if you get those two. And we are now generating a molecule data set of roughly 3.5 million molecules from enemy that we're going to aggressively do this prediction on. And we're also thinking that maybe we should just release this data set as well. A name that's pending could be a molecule, a charge net or something, so that the machine learning community could use those sorts of architectures to do this inter-hierarchical learning. Because right now, most of the models only focus on permolic attributes. And another very interesting thing that we're trying to do as a team, is to use this model directly to predict the parameters of force fields or rather to come up with the machine learning potential that is easy to sample but pretty much as accurate. So here's the idea. Well, so first of all, if you are familiar with any or anakin, which is a neural net that takes the entire coordinates of your components of a system into a large neural network and sped out the energy. And this is doing pretty well. It can predict QM energy as well as other properties with high agreement. But the only thing is that it is 15,000 or 1500 times slower than the MD calculation. And you cannot use that to produce any trajectory slower than nanoseconds. Because you need to do the forward and backward passing once in each time step of your MD calculation, which makes it kind of as a rule. But it's very accurate. So we're thinking, can we preserve the traditional separation of bound angle torsion and non-bounded terms, but use some machine learning something like graph net to determine all the parameters? So this is just the traditional MM force field. And there are good things and bad things about it. The good thing is that separation itself is good. The harmonic functional form is good. This kind of corresponds to the physical understanding of the markers and the proteins. But all the parameters are bad or evil because you need a lot of time to fit it. I guess the entire vision of open force field or part of the vision of open force field consortium is to come up with better ways and clever ways to fill in all these parameters. And there are also this 12 daughter term that's necessary to evil because it does not have any physical meaning. So we are thinking, can we preserve the good equation here and the separation but predict the bound angle torsion and pyramids energy as a polynomial function of which the parameters are filled in by a graph net. So this is where the hyper edge coming to play. The parameters you put on the vertices, the edges, the angles and dihidros comes directly from the readout function in your graph net. And for example, if you have a 6 or 12 or whatever, how many other polynomial you use a graph net to get a number for each term in that polynomial and you calculate the energy and you compare that with the data we have in QC archive. And you can do not only energy but also Jacobian and Zenihashians. And this is something that we're still working on. And of course, nothing is dead. Even if you do not have this higher order polynomials, if you just have two and it is harmonic that's centered on some equilibrium value, then what you get is exactly a force field but just with sort of more clever atom typing. Each atom in a system has its own type. And of course, when designing those polynomials, you can fill in whatever understanding of this physical force into the structure of the polynomial. For example, we know that non-bonded terms should be zero at infinity. We know that the bonded terms should be sort of harmonic with regard to when it's near the equilibrium value. All of these could be easily encoded in the functional forms. Of course, we know that the torsion and angle should be periodic. So maybe you just put a cosine or sine function on the theta before a feed adding to the polynomial. And that's pretty much it. And I do appreciate the support from Josh and Haya and John. They're fantastic. And my friend from Hewley's Statistical Learning and Foaming sources and great HPC infrastructure from MSK. Thank you. Thanks for your time. Questions? Yeah. Can you go to the slide back again? Once the slide before that. Okay. So which dataset would you use to derive those? QC archive. So QC archive is an essay of effort by open force for torsion and mostly that does a lot of calculation using different levels of quantum chemistry, a representative collection of small molecules that they believe it's useful for deriving force field parameters. So the dataset is just quantum mechanical energies and it's Jacobians or forces and Hessians. And secondly, you can calculate the first and second other relative of all of these from a neural network and you just match. You can do force matching and you can also try to match the energy. But there is a small issue that there are offsets in your QM energy. So it is offset by a little and you cannot express that the entire QM energy just using the force of terms. And the strategy we have to conquer that is to predict offset as well using a graph net, using another graph net. It could be the same, just the more you ask, but you predict the offset as a function of the topology only and not the geometry. So now you can have a equilibrium term so to speak or offset or geometry independent term plus all these geometry dependent terms. So the bottom line is if we just use second order derivatives here, then it's a clever atom typing plus a traditional force field. And even if you have higher order, higher order polynomials, it's trivially easy to just collect those numbers in a platform like OpenMM. You can customize your bounded and non-mounted force by using a string to represent the energy of the force and then just simulate that. And the speed we did not do experiment yet because we're still fitting all those parameters. But we reckon that the speed will be roughly the same because the most expensive part is to determine the energies, the determinant distances, the angles and torsions. And the actual calculation of the energy by using just the polynomial is not as expensive. So you can imagine that it will be a little bit slower than traditional MD and a little bit inaccurate than any, but you have this happy middle ground and you can actually use this to similar things and hope them to be in agreement with quantum chemistry. Yes please. This is me being the ignorance of the topic and all that. So the bad side is a crucial, I mean some of the work that we've done in the past come down and so I mean I know you spoke about like asynchronous, asynchronous, you know updates, what have you. So you have a lot of problems obviously. I didn't know whether I couldn't plan on asking you, but the batch size actually, you know, does to the model overall and your results. Maybe you can touch base on that. Well so graph size is just another hyper parameter that we tuned during the very aggressive hyper parameter tuning. And but it is very important how we batch those things because well monohills, they have different number of atoms. And if you're just pad them to the largest, largest molecule in our dataset, they are going to waste a lot of that dummy atoms. And so this is sort of an active research field. Our solution is to diagonally concat the adjacency map of different molecules onto a larger graph and just pad to the rest. So for example if you say I want to have a batch size of 128 atoms, then you get a 10, get an 8, get a 10, get an 8 or whatever, how different your size may be. And when you reach a point where if you grab another molecule from a dataset, this is going to be more than 128 then you pad to 128. And so by this means you only waste of the few dummy atoms. And it is a efficient strategy. It's proven to be a efficient strategy compared to well you can also ban it for example. You can have one net that deals with the molecules with fewer than 10 atoms and one with 10 to 20 and 20 to 30 and 30 to 40 and so on and so forth. And just allow all those network to share parameters. And this will be like 20 times longer in terms of chain inference with regard to just pad them and treat them as a larger graph. Right, but the application of that is that once you have a chain model you can perform very quickly. I don't know if you've done that benchmarking as well like you know like on a test set point inference time limits per sample of what happened in this case. I think I did it. It's 0.002 seconds. So that's pretty quick. But I think one way I don't know if you've already looked at it but it's like you pad for a given number and the difference of what you want to accommodate in your batch size. Have you played around with increasing the padding and decreasing the padding and seeing if you can see if there's more efficient ways to do it? There is a difference in efficiency of course but there's also this problem of how stochastic is your stochastic gradient decent is. Right, so it's batch size also affects the last trajectory. We played with this and just included as a hypergrometer and there's some search. I guess this is the only way we can do that because it's so black boxy. How about batch mobilization? No, we did not do that. Okay, I mean just make sure because I think the batch size plays a crucial role in having batch mobilization enabled but that batch mobilization helps you get to your overall accuracy a little quicker. Right. So maybe that's one of the things. Maybe you should try it because you're talking about like you know the time and all that. Any questions from people from zoom? I guess no. In this case, thank you. Thank you.