 izgledaj smo, da vam začnem svoj skrin. Zato je ozretil in zelo čutno vzvečenje, nači tezavstvih a zelo skerznih tezavstvih. Zato sem zelo vzvečenja, kaj je izomor, tako še je dobroj čutno v Valeria de Paiva. Zato bi se začnem, da se čutno tezavstvih je, iz niriже nege začujem zelo bolj nekako začutiti komendlje, nekako začutiti komendlje, nekako začutiti komendlje, nekako začutiti komendlje, nekako začutiti komendlje. Mislim, da ne bom bilo pohvarjovan, svojo solucije je, da pošliči vse teorems na zelo, zelo pošliči, in pošliči vse neuralne netore, kaj je universela proksimator, da se da se začela si vauti, v holjnjehaske formule je začela. Na Substituciju isomorphizmu pa tudi danes stačnje obrov'jo. Tudi, ta logicallyja, ki povzaj kogako je istrihodnog propozistika in tudi tudi line teria premajno위čno zaj, da je spetak, ... protikaj vz 94 toga, kaj na dvej vse je, da pronekimo, ne lahko je postajniti v pustilite vsega. Pokretim, dobro ju nečem opetke, dialetje, kaj ležite, je jeo ta je razstavena, a je več neč delajovana, začalala mi, da je zelo brzivna, neč vzelo, neč zelo neč delajovana, neč začal. Selo bilo, The last part of the lecture is aYes, I will give you a brief presentation of the questions and topics in the future. And now we will continue, but here we will have a little break in the lecture. We will be back soon with Q2, and after this cycle, I will give you a brief presentation. Thank you for yourrant. Thank you. OK. Thank you. I will present a few questions upgraded features are 50Cr in 60RN. So, the somewhat surprising thing here is that we got very, very high successor rates. Of course, it is the right kind of neural network and I will over view a little bit, what kind of neural networks are out there izgledaj, če v eksperimentalnih sensu, je zelo problem, kaj je tehnika potrebena za nekaj logičnih. Tako, najbolj je, da je to nekaj logičnih, zelo smo videli, da je tukaj tukaj izgleda, kaj je zelo izgleda. je to skupnje proverbovačno, in je poznane početno. Zato je skupnje je zelo, dobe zelo, da je bolj početno. Zvom, da je to izgleda, da je bilo, da je skupnje, da je zelo, da je izgleda, da je dobe izgleda, da je dobe izgleda, da je dobe izgleda. the fullProPositionally intušenstike linear logic is story complete even in its propositional form, but then if we restrict it to the implicational fragment, that is decidable in intuitionally propositional logic. That's not the case, they stay p-space kompleten, in actually Richard Stadmann provejte p-space kompleten, as a result on that fragment, and that's equivalent with other things like QBF, quantified Boolean logic, in terms of complexities, and allegedly it's above NP, we kind of believe that. And what we want here is to have polynomial algorithms for generating the theorems, because when we turn them into test sets, then by combining tautologies and their proof terms, that can be useful for either correctness or scalability tests of linear logic theorem provars. When turned into data sets, of course, they would help training neural networks that are focused on those neurosymbolic computations, which are getting very trendy in the neural net conferences. They are somewhat related to either theorem proving, or more specifically to reduced fragments of logic programming, like data log, for instance, and some reasonably good practical combinations of neural network and symbolic computations are happening in the last two years, NEAPS conferences. So, let me show you very quickly, instead of writing down the spec, how formula looks in this implication of fragments for linear logic. This is the lollipop operation, which is dash and o approximated here. So, there's the formula. And this is the lambda term corresponding to it, with l marking the lambdas and a marking applications. Not the brewing form here, just plain variables. And variables here, but denoted as numbers. So, zero is x, one is y. And then this is a binary tree with leaves in a set of variables. And there is a canonical form for that that's relatively restricted from putting all the combinations here for variables. We can just restrict it to something that's counted by the bell numbers. We will get to the sequence in the encyclopedia of integer sequences. So, not all those formulas are as nicely mapped as these two here. For instance, for the formula here, we have lambda xx, the identity function, and this one has here some redundancy. So, they are not necessarily mapped one-to-one, unless they are in normal form on the two sides. So, the Curry-Howard correspondence is a correspondence between computations and proofs. It's partly because the axioms are just isomorphic on the two sides. In its simplest form, it connects the implicational fragment of propositional intuition and stick logic. We'll call it IIPC here for easy pronunciation, with types in the simply typed lambda calculus. A low-polynomial type inference algorithm associates a type when it exists to a lambda term that requires unifications with occurt check in general, but in the case of linear logic, that's not the case because the mapping is one-to-one between the binder and the variable it binds. On the other hand, we have harder p-space-complete algorithm that associate inhabitant to a given type expression with the resulting lambda term that's ideally normal form. If it's not, we can normalize it, serving as a witness for the existence of a proof for the corresponding tautology. So we can use then combinatorial generation of lambda terms plus type inference, which is easy, to solve some type inhibitation problems, which are in that sense harder. Now the formulas and the generators that we ended up, that's covered in our ICLP20 paper, that's about a month ago, basically we had a few steps to get where we are, and we will see where we got in the next slide, but the process to get there was basically deriving better and better programs for the purpose, and first we start with the IP formulas directly, so we have binary trees of size n, and they are counted by the catalan numbers, and then we label variables using set partitions, so we do not have too many redundant combinations of variables. It's reasonably smaller than the exponential explosion if every variable can go in every position in the situation, and that's counted by the band numbers. We have n plus 1 leaves for binary trees with n internal nodes. So linear lambda terms, which would be the proof terms, will be derived first by introducing something called a linear skeleton of mod skin trees, binary unary trees with constraints enforcing one to one mapping from variables to their lambda binders, and then we derive a closely linear lambda terms, then we derive them in a normal form, and after that we eventually end up with a very good generator in the next slide. Pierre Lescan had a very nice way to count how many of those terms of a given size exist, but his high-skill generator will go about two orders of magnitude below this, because of high-skill's memory inefficiency. Here we generate them on backtracking, and then we can go as far as we want, and we can even parallelize the generation, so we can compute those terms relatively easy. That's a good size for a neural network. Billions are usually enough for training them, even less, and then of course we have also their types, because types are easy to infer, so that's then the dataset that we will eventually generate, and the generator itself is relatively compact prolog code. The way we do it, actually, I'm just giving the main ideas here. We do not need to use unification with occurs check for checking that two variables corresponding to the same lambda binder would have the same types, unifying them and not forming cycles, because there is only one, so that's relatively easy, and what we need to check is exactly that there is exactly one binding, and this is after we pick a binding, we have a check binding constraint here, that will be very easy to test. You initially use a variable, and then you bind it, you can check if it has been or not bound by using null var, so with that you can find relatively efficiently the correspondence between the binders, the lambda binders, and the variables, and for that you don't need to use De Bruijn force, so this is the final derived generator, and now if we look a little bit at lambda terms that we have here on the left, we have some similarity that kind of shows up, that's at least empirically how we kind of found it out that they kind of look very similar the term and its linear time, and then the same applies for a few larger terms, you can notice already some symmetries here, and if the size definition is fine tuned, it looks like, that's first empirical conjecture here, that they would have the same size, the type on the right, and the lambda term on the left, so this is the little Eureka moment, it looks like we see some interesting symmetries in the pictures, there are exactly two occurrences of each variable, both in the theorems and their proof terms, of which they are the principal types, theorems and their proof terms have the same size, count it as the number of internal nodes, so that's the size definition that we need, and then if we can solve the problem of generating all the IPL tautologies of size n, what this reduces to generating as I showed in the previous slide, all the proof terms of size n, and then the bijection between those two will make it extremely easy also for neural networks to eventually learn that because this is a very simple correlation. Now, after reading a little bit about the theory behind everything here, Zilbargar attributes this observation to Grigori means, it is that there is a size preserving bijection between linear lambda terms in normal form and the principal types, but he has nice little paper where he sketches a proof basically for that, and it's by exhibiting a reversible transformation of oriented edges in the tree describing the linear lambda term in normal form into the corresponding oriented edges in the tree describing the linear implication formula, which acts as its principal type. So, with that we have obtained the generator for all the theorems of implication linear indiciality, propositional logic over a given size as measured by the number of lollipops, and we did that without having to prove theorems, which is definitely good, because our first attempt was to reuse my intuitionistic propositional logic prover, which is a p-space-complete algorithm, and then restrict the resulting proof terms to be linear, which works, but compared to going the other way around via the Kirihauer isomorphism is much easier computationally. Now, this is a Goldilog situation that points out the very special case that implication formulas having linear logic and equivalently linear type theory, and probably that's why a lot of talks study it because it's a very, very nice mathematical mapping that also relates combinatorial maps and types and lambda terms all together. So, the data says that we generate now. They also have latex formulas, by the way, as comments, so if someone wants to display or have a very large example set of lambda terms, linear lambda terms, and their corresponding proof terms that's usable, the few billions of them there, it can use for correctness, performance, and scalability testing for the theorem provers, and then they have this structure here formula and then proof term, those are the pairs, and they are good for also testing deep learning systems on theorem proving tasks. Now, to make this a bit more interesting, some of the tests that I will show you also add non theorems, because you would like to see it distinguishes between theorems and non theorems, rather than just providing the proof of a theorem when it exists. So, some encoding issues here always needed for easily feed the neural networks. The encoding is a prefix form here, lollipop is zero and so is application, lambda is one, variables would be uppercase letters on the two sides and we'll use a separator between them, and then the formulas look a little bit like this in prefix form with that encoding. So that's now relatively easy, because they become strings, but they have some structure, and the neural networks that we will need to use for that are neural networks that are good at processing sequences, mostly borrowed from natural language processing tasks. So, let me very, very quickly overview the minimal background in neural networks that we need for what's coming next. So, theorem provers are computation intensive search algorithms basically, some of them are touring complete, some of them are p-space complete, or NP complete, the languages on which they work. There are two ways that neural networks can help for the proof search. Well, first would be to fine tune the search by helping with the right choice points. There are plenty of choice points, basically, which of the rules should be applied first, and there is quite a bit of work on using them for that purpose. Another thing is to use them to solve low-level perception intensive tasks. For instance, when you work with learnable ground facts labeled with probabilities, and the deep pro-block is a system that does that reasonably well, where they're invited to by look there at ICLP that described how well they perform on some vision task, computer vision task. So, the third way that we are looking at here is can neural networks replace the symbolic theorem prover given a large enough training data set. So, that's really cutting to the problem as directly as possible. And some of the key machine learning concepts that we need to watch for, of course, honesty, that matters in everything. So, when we split the data set, we need to keep it very clean. That's also for medical trials and so on, okay? Training and validation. So, training is on which the neural network trains. Validation is a little bit of feedback independently. They're usually randomly split, but then a completely clean and untouched test set is also good to see from an outsider's perspective how well the neural network performs. So, that's pretty standard in machine learning. Two things to avoid overfitting. Works on training but fails on validation and then even more on testing. And then something else, which is relatively deep and serious. If we have high Kolmogorov complexity, so, generally random data, then hopes for neural networks to work on that are very low because they are an approximator that tries to encode regularities in the data. And if there is no such thing, it's pure noise, then they won't work. So, the neural network's key element is a linear matrix times the state of input at the previous stage plus some kind of bias. So, that's a linear operation. They cannot be trained if they're just linear so called activation function needs to be applied to it. This used to be sigmoid or hyperbolic tangent, but these days a very simple one seems to perform really well. It's a max of zero and x vector wise. So, apparently that's also very friendly to the GPU training of neural networks. It's very, very simple to compute in a vectorized form. So, another very important thing is that we need differentiable functions. The gradients that are computed on back propagation, they would be the feedback that we send based on the loss function that measures how far we are from our ideal result that needs to be back propagated stage by stage and it's the derivative plus some kind of learning time, some learning rate that needs to be used to make sure that the back propagation is not too aggressive because then it might flip or go into a local minimum and otherwise it might not converge too fast. So, if we start training our theorem provost, we have the Karihauer dizomorphism, what we do, because both of the types and the lambda terms are trees, we represent them as prefix strings and we adjust the size definition as I mentioned before to have exactly the same size. Now, more than available neural networks, one of the very good ones is sequence-to-sequence networks. We used originally in the translation French-to-English-English-French natural language, so a sequence on the left, a sequence on the right and then training them on a parallel text, very large data sets, translated text, aligned sentence by sentence. So, in this case it would be aligning the proofterm and the formula together as if it was a translation from French-to-English or something like that. Another important thing is to use a recurrent neural network, LSTM is a particularly good one in this case. What they do is that they have the long distance dependencies and that's of course important also in natural language processing. Paul, could I just ask a question here? So, you're trying to learn a function from formulas, applicative formulas to the proof terms, which are linear lambda terms. That's in the high-level function for giving an applicative formula, try to produce a proof term. And just a question, and just a technical question, are the formulas, do they have to be balanced in the sense that the variables occur exactly twice or, I mean, exactly once negatively and once positively, or can they be arbitrary, can propositional variables occur arbitrarily often in the applicative formulas? One day occur arbitrarily often, so the mapping is not a bijection between the lambda binders and the variables, then we get into IPC, implication fragment of IPC, and it works on that too. I will show you the graphics in a minute for the two cases actually, how well they perform in the two cases. So, let me then proceed with the thing itself, because I think we are getting a bit close to be out of time. Let me show you right away the type of neural network. So, I said it would be LSTM, it's a recurrent neural network that handles relatively well the internal dependencies between the, well, in a trezine prefix form, and they also, as I said, have some other properties that are good, for instance, vanishing or exploding gradient as you propagate those derivatives over 100 steps backwards in a deep network. Two things can happen, either the gradient explodes or it vanishes, and then one of the tricks that LSTM uses is that they are feeding also unchanged values, so they have internal loops and they have a memory where they keep the previous state. So, it's like an extra feedback from the previous state of the memory, and that propagates over the deep chain of connections. So, let me, those are some links here in the slides for where those are on the web, the data sets, and also the actual code for everything. And this is the very positive surprise for the accuracy curve over 100 epochs for IP, for the intuitionistic propositional linear logic. And here is the loss function, usually they go opposite, the loss is diminishing and when the value loss and the loss are very closely following each other, it means that honest generalization has happened. Now, this is a very easy case, so I said, well, let's look at making it a little bit harder in which we would throw in also unprovable formulas, and then it has to recognize which is provable and which is not, besides generating a proof type. So, that went also very, very well, same convergence over 100 epochs, relatively quickly maybe one hour of GPU, and this is now the harder case. We went to intuitionistic propositional logic, where a lambda binder can have multiple variables that it covers, and this is the general case. It performs also reasonably well, we had a surprise here, when probably the size of the formulas changed radically and then it dropped, but then it recovered relatively well, and it recovered both in the validation set and the training set. So, that's the dual, the loss curve for the tool, and my conclusions are that we obtained the generator for theorems of a given size, without needing theorem prover. We sketch the use as a data set for training neural networks, turning them into reliable theorem provers for the harder inverse problem given a formula in IP, find the proof for that. This is the data set that we have, and now as future work and experimentally speaking, open problems, can we extend these two full fragments of IPC or LL, because IPC and LL in implication form seem to be well covered, and of course, also experimentally something that needs to be seen, how they perform on the larger human-made formulas, the repositories of those, and also how it works, if we use random generators for formulas, either as trainers or as test sets for the results. Let me show you very, very quickly how this actually works here. So, what you see here is the honest test on the test set for the relatively easy set, which is the one for for the linear case. This is the intuitionistic case here. Well, you will see that things are not as great when you do the final testing, but they are reasonably good nevertheless. So, now this is the neural network working exclusively in inference mode. The training has happened, and we are just doing the forward pass. We are not back propagating anymore, so that is relatively good. The pluses mean that we actually did well. We found the proof for intrinsic formulas. If you look here, you would see that the sizes do not much anymore, so we have definitely more difficulty here, and the formulas are a little bit larger as well. The correlation between the size of the type and the size of the lambda term are not predictable. Well, there are some worst case scenarios in which one is exponentially bigger than the other. It doesn't happen so far this size, but we got a reasonably good success rate in this case as well. So, that's my talk, and ready for questions. Thank you very much. It was very interesting to listen how this most modern tools, like neural networks, can be used in research and logic. Are there any questions or comments or suggestions? If I'm here, I would have a question for more a comment, actually. Paul, I'm wondering if you are aware of this line of research where people try to synthesize tactics for S&T solvers, and they use neural networks to train and to get to automatically find out those tactics to solve different S&T problems. Have you looked into that, or how does it compare with what you are trying to do? That's a very interesting aspect when you have the proof assistants, like Agda or Koch, in which you have to find as automatically as possible the next step, rather than having to feed the proof assistant with that, and tactics are the macros that help you doing that. So, then if it's the choice between tactics, and there may be a dozen of those, that's a reasonable thing to try to train the neural network with, now synthesizing one, that definitely looks like a harder problem, and I believe that the way to do that would be similar in terms of training data. The training data would be then the proofs that they have in terms of which tactic succeeds for what kind of formula pattern. So, I believe that if you have a very large library, then they are building some of the libraries for mathematical knowledge, then if you correlate the formula type, the path, some kind of patterns in it, and then the sequence of tactics used to prove it, there is something where neural networks can help, but I didn't see yet very, very strong results, even in the easier case of logic programming, the famous family program with inheritance, there are some papers trying to get close to that Joffrey Hinton started that in the 80s, and then once in a while someone else comes back with a different method, they are all grounding based, in which you replace variables with all possible constants that you can have, and then you try to train on those. Other very, very interesting things are related to trying to mimic unification, for instance, instead of having actual constants, you would have embeddings, vectors representing those constants, and then some kind of similarity between those vectors, not Euclidean distance, usually it's mostly Cosino similarity, would be what replaces unification, and there is a lot of work centered on that as well, but what we are doing here is going directly when we are in the propositional world, right, like with linear propositional logic or retitianistic, it's no need to bypass any of the simplicity that you can have, the equivalent of grounding would be generating the formula and proof pairs directly. Now what's open here is how well that scales for more difficult problems and larger problems. If you train on small problems, would that naturally scale on gigantic ones? Okay, are there any other comments or suggestions or questions? About neural networks, do you know about neural networks that could work directly on trees instead of sequences? Yes, there are some very good attempts, not just trees, but actually so-called graph neural networks, in which there are encodings for graphs. The tree-to-tree algorithm, for instance, is an alternative to sec-to-sec here, but for propositional formulas, it's, I think, probably an overkill to use, for instance, graph neural networks. Now the tree encodings, the idea would be when you need them, when you have interesting labors on the trees, reaching properties, like you would have for, let's say, a chemical formula or something like that, but in this case we do not have, and when you do a tree-to-tree mapping instead of sec-to-sec, you would try to propagate from the leaves towards the root neighbor by neighbor, some of those properties associated to each of the nodes, but because here we do not have rich properties that we would propagate, that would not be needed for this case, so the simplest encoding in just prefix form worked relatively well. If I can ask a second question related to the first one, do you measure the part of the effort of the neural network to guess the right tree structure behind the sequence? I mean, how many times he missed the tree and finds another tree? This is the honest measurement only on the test set, and this is for the harder one for the intuitionistic logic, so the plus is here, would be when it finds the right tree, the minus here would be the case when it gets it wrong, so for the intuitionistic logic we get about 90 out of 100, right, but for the case of the, let's say, the linear logic, or even the linear logic with non theorems added, and just marked with a question mark, the results are also almost as good as the ones for the formulas that are just formula and proof pairs, so here it has also to know that this is a not a provable term when the question mark is generated, and it's still the set to set algorithm, the non theorems correspond to the question mark symbol in the training set, and then I think it might get also 100% on this because it's the easy set, the linear logic set. Okay, thanks, any other questions, comments? If not, let's thank Paul.