 Let's talk with Pedro Domingos. Hi Pedro, how are you doing? Great, how are you? Fine, fine, everything okay? Is there any movement in the White House? Can you see it from there? Well, I'm actually in Washington State, not Washington, D.C. So you're a little bit far, you're a little bit far, yeah. Yeah, I'm even not so soft on Amazon, so tech not policy. Yeah, well, Pedro is a professor in the University of Washington and his keynote today will be about Microsoft Logic. So Pedro, thank you for being here with us. Let me encourage people to make questions. You have the platform to make questions, of course, in any language and we will translate it for Pedro. So Pedro, thank you again and let's hear your keynote about this Microsoft Logic. Alright, let's see here. Can you see my slides? Yes, we have them. Okay, very good. So I'm going to talk about unifying logical and statistical AI with Markov Logic. The intelligence systems need to deal with the complexity and the uncertainty of the real world. And the language of choice in AI for deal with complexity is first order logic and the language for dealing with uncertainties probability. So we need to combine the two to make progress in AI and that's what this talk is about. How do we combine first order logic with probability and equally important once we have that unified representation? How can we learn a reason efficiently using it? And the basic idea is very simple. And it's the following. A logical knowledge base is a set of hard constraints on the set of possible worlds. If we violate even one instance of one formula, the world becomes impossible. And that's what makes it so brittle and that's what we want to avoid. So let's do the following instead. Let's make the formulas soft constraints such that when a world violates a formula, it becomes less probable instead of impossible. And as you violate more and more formulas, the problem just degrades gracefully. And we're going to give each formula a weight that corresponds to how strongly we believe in it. Something that we really believe in has a high weight and something that we're not very sure about has a lower weight. And then the probability of a state of a world is just the sum of the weights of the formulas that the world satisfies, exponentiated and normalized. So that's the general intuition. Let's make this a little bit more precise. So a Markov logic network, or MLN for short, is a set of pairs FW, where F is a formula in ordinary first-order logic with usual syntax and W is a real number. Together with a set of constants representing objects in the world, a Markov logic network defines a Markov network, an undirected graphical model, with one note for each grounding of each predicate in the MLN. A grounding of a predicate is replacing its variables by constants, so then it talks about specific objects. And one feature for each grounding of each formula in the MLN with a corresponding weight. Now that's a bit of a mouthful, so let's see a simple example to illustrate these notions. Smoking causes cancer and we'd like to get people to stop smoking, but it's hard to do that because people are influenced by their friends and if their friends keep smoking, then people are likely to keep smoking as well. So let's start with a couple of statements in natural language. Smoking causes cancer and friends have similar smoking habits. We can translate this easily into first-order logic as follows. For every x, smokes of x implies cancer of x and for every xy, friends xy implies that smokes x is equal to smokes y, i.e. they're either both true or both false. Now this was easy, but the thing that's a little odd here is that the statements in natural language were true and these statements in logic are actually false because not everyone who smokes gets cancer and certainly not all pairs of friends have the same smoking habits. Now we can make this be true and useful again by making these statistical statements by adding weights to these formulas and now making them formulas in Markov logic. And the formulas that we believe in more will have higher weights. So in this case, smokes implies cancer will have a higher weight. So this is a very simple MLN with just two formulas, but what does it really mean? What does this represent in terms of the real world? Well, let's suppose we have a world with only two people in it, Alice and Bob. So two constants to represent them. And now let's follow our recipe for building a Markov logic network. So we're going to have one node for every grounding of every predicate. So for example, we're going to have smokes of Anna, which is just the Boolean variable that's truly if Anna smokes and falls if she doesn't. Same thing for smokes Bob and same things for cancer and Anna and cancer Bob. Now what about friends xy? Well, we're going to have friends Anna Bob, another Boolean variable. We're also going to, that's true for friends. We're also going to have friends Bob Anna because friendship is not symmetric. Bob could be a much better friend of Anna than she's of Bob and that actually happens a lot in practice. And we're also going to have these degenerate cases of friends Anna Anna and friends Bob Bob that maybe have to do with their self-esteem. So now what we have is a set of Boolean variables and then the MLN defines a probability distribution over that. So what is the probability distribution that it defines? Well, the definition of MLN says that there's going to be a feature for every grounding of each formula. And when you have a feature in a graphical model, what that means is that you have a direct connection between the nodes corresponding to the variables involved in the feature. So for example, we're going to have an edge between smokes Anna and cancer Anna because there's a formula connecting the two and same for Bob. What about the second formula, slightly more complicated? Now what we have is three predicates, right? So what we're going to have is connections is a triangular clique connecting all three predicates in that formula for every possible combination of instances. So we're going to have a clique between friends Anna Bob smokes Anna smokes Bob and so on. And now what we have is just, you know, a graphical model a Markov network over these, over these variables. And what is the distribution that this represents? Well, the probability of a world is just the sum over all formulas of the weight of the formula times the number of times in the data that that formula in that world that the formula is true. And, you know, not exponentially the normalize. So now this becomes an ordinary log linear or exponential family model or Markov network. Notice that the Markov logic network by itself does not represent the distribution. It's more like a program for creating distributions and that's actually what makes it powerful. And Nemolyn is a template that represents many different distributions, some over a very large world some of very small ones, depending on what constants you apply it to. And at this point, you might be thinking, well, this is nice but this is going to be very inefficient, right? Because there's going to be an explosion. There's going to be an exponential number of variables and this isn't even going to fit in memory. And of course, what we're going to see is how we're going to make this all efficiently enough to be usable in practice. But the first thing you can do that is very easy but goes a very long way, is to just have typed variables and constants. If you have a predicate like works4xy you only need to replace x by people and y by organizations and that already gets you a long way. Also, in this talk, I'm just going to talk about the very basics of Markov logic. But bear in mind that the full range of contracts of first-order logic like functions and existential quantifiers and infinite and continuous domains are all possible even though I won't cover them yet. So how does Markov logic relate to first-order logic? Well, actually very nicely what happens is that first-order logic is the special case of Markov logic that you get when you let the width scale to infinity. When the widths go to infinity the constraints become hard and you go back to the case where just violating one of them makes the world impossible, which is what you want. Of course the more interesting case is what happens when the widths are finite, that's what we're really here for. And here there's also something interesting that you can say which is if the knowledge base is satisfiable meaning if you can make the formulas all true at the same time and all the widths are positive then the satisfying assignments meaning the assignment of truth values to the predicates that makes all the formulas true that is a mode of the distribution. So the worlds that first-order logic likes are still there in Markov logic they're just the modes of the probability distribution. And as you move away from the modes the probability creates gradually which the behavior you'd like to see. Even more importantly however the knowledge base does not have to be satisfiable. In logic if there's a contradiction in the knowledge base then anything follows and basically things fall apart which makes it very hard to build large knowledge bases or knowledge bases that are combinations of knowledge from multiple sources. In Markov logic there is no problem at all with contradictions. If there's a contradiction you just add up the widths on either side and you get the resulting probability. What about the statistical side? The other property of Markov logic is that essentially all the types of statistical models that we know and love are special cases of Markov logic. Things like graphical models including both Markov networks and Bezier networks, logistic regression, hidden Markov models conditional random fields many of the deep architectures in use today are simple special cases with Markov logic and we can just write these down with often just and the key restriction that you have to impose to go from general Markov logic to the statistical models we know is that we have to assume that all the predicates are zero area meaning they have no arguments or they actually have one argument which is equivalent. What this means is that you're assuming that every object exists in its own separate world so there are no interactions between objects technically that there is IID and then if you think about is a very strong limitation it doesn't like you model things like social networks or metabolic networks in biology or the web with Markov logic it's very easy to have interactions as we saw in the example of friends and smokers you just have relations with more than one argument and then waits on the corresponding formulas. So that was the representation it's nice because it's general and it's simple because this is the things that we wanted to encompass but of course this is not going to be very useful unless we can do inference with it efficiently so let's look at inference now for a moment what is inference in Markov logic? Well in some ways it's no different from inference in a probabilistic model you want to compute the probability of some query given some evidence and the MLN and the constants that it's being applied to and this is equivalent to remember when you have constants that just combines with the MLN to produce a Markov network and then when you have evidence what that does is it replaces the evidence variables just by their fixed values so they disappear so at the end of the day what you have is you want to compute the probability of a query given a Markov network and now for this you can apply any probabilistic inference algorithm that you like but for example Markov chain Monte Carlo belief propagation and so on there is a very big problem with this however which is that the ground Markov network that you create when you replace the variables by constants in all possible ways is going to be too large in fact it's going to be exponential in class error so if you have a clause with say three arguments and even just a thousand objects in your domain you already have a billion variables so this is not going to scale at the time it's not even going to fit in memory so we need to do something else and what is that something else well this is where the idea of lifted inference comes in in first order logic you can prove theorems about infinite domains like the real numbers in a finite number of steps because you do inference at the level of whole sets of objects as opposed to one object at a time so what we would like to do is bring that all the probabilistic reasoning so let's think about how this might happen the probability of a query is just the sum over all the worlds where the query holds of the probability of the world so the probability that in a smokes is the sum of the probabilities of all the worlds where in a smokes now this is very easy but of course the problem is that there is an exponential number of worlds so you can't compute it this way now if a world can be divided into independent sub worlds so for example every family is independent of every other family or every country is independent of every other country then the probability of the world is just the product over the sub worlds of the probability of the sub worlds so this is already a big simplification this is kind of like what is exploited in graphical models but now further if we can group the sub worlds into kinds of sub worlds that all act the same way so there's families of one type there's families of another type so the probability of a world is just the product over the kinds instead of the probability over the worlds of the probability of any sub worlds in the kind raised to the number of sub worlds in the kind and this is another exponential improvement so with the combination of these two exponential improvements now inference can actually be tractable now of course usually in an MLN you don't have independent sub worlds because what would be the point but the key thing is that once you start conditioning on evidence your graph starts to break up into independent sub graphs and that's when we can apply this so to go back to our early example of friends and smokers with our two constants Alice and Bob you know here's a big complicated graph now if you condition on smoking right so first of all the smokes on Anna smoking so first of those nodes disappear right as smokes and smokes Bob but also because they disappear the edges that go into them disappear as well so now all we have is a bunch of independent sub worlds in this case each containing just one predicate and now further if Alice and Bob either both smoke or neither does then there's only two kinds of worlds there's cancer cancer and cancer Bob are one type and there's all the friends predicates that are another type so instead of computing the probability of this rule by multiplying you know all of the individual ones you can just if you look at the lower right corner here take the probability of cancer and square it and take the probability of friends and raise it to the fourth power and so notice now how quick this has become of course this is what happens on a good day on a bad day for example Anna smokes and Bob doesn't or vice versa so we actually have four kinds of worlds and in the worst case this kind of little difference will actually not buy you anything just as in first order logic what will typically happen is that you're somewhere in between the two in the best case you're the the complexity of inferences order of the size of the MLN which is really good because it's small in the worst case it's the order of the size of the ground mark of an hour which is bad but typically in between the two and there's actually there's typically a lot of lifting that you can do and there has been a lot of work in this in this whole direction we have managed to synthesize it all into a procedure that we call probabilistic theorem proving which is a generalization of both theorem proving as that implies and inferencing in probabilistic graphical models and the nice thing about PTP as we call it is that it generalizes it encompasses essentially all the major types of inference that people do in AI so for example here on the lower left hand corner of this you have a propositional theorem proving which of course is a special case of this which is equivalent to satisfiability testing and that's you know where people started out back in the 60s and if you want to count the number of satisfying solutions to formula instead of just saying if it's satisfiable or not then you get the model counting problem if you add weights to the formulas then you get weighted satisfiability which is the same as inferring the most probable explanation so this is what is called MPE inference in a graphical model and if you combine those two you actually get weighted model counting which is what probabilistic inferences now all of these can be lifted to first order and for example when you lift propositional theorem proving you get ordinary first order then proving when you lift weighted model counting you get lifted weighted model counting and if you combine all of these you get the full probabilistic theorem proving procedure which is really just lifted weighted model counting what about learning? this is all not going to be very useful if we don't have some way to learn the MLNs you might be able to write down formulas but they're probably being complete and partly incorrect and also coming up with the weights is something that people are not very good at so we definitely want to be able to learn these things from data well the data in this case is not going to be a single table as in traditional machine learning it's going to be a full relational database and we're going to learn from that we're going to in this talk I'm going to make the closed world assumption which is that a predicate that is not in the database is assumed to be false sometimes this is not the right assumption in that case you can use EM versions of the algorithms that I'm going to talk about but I won't cover that yet so there's two main tasks as usually in machine learning there's learning parameters i.e. the weights of the formulas that can be done either generatively or discriminatively and we're going to look at each of those in turn and there's learning the structure of the model which is learning the formulas themselves and then there's other things like transfer learning and discovering latent variables and whatnot which have also been done in Markovogic but I won't go into here so let's start with generative learning and the good news is that this is actually surprisingly simple we actually thought going in that because we're removing the idea assumption this is going to be very, very hard as it turns out the math is exactly the same we want to maximize likelihood we can maximize it by gradient descent it's a convex problem so there aren't even any local optimites and if you look at the expression of the gradient it's actually very intuitive the partial derivative of the log likelihood with respect to a weight is just the difference between the number of times that that formula is true in the data and the expected number of times that it's true according to the model so if the formula is true more often than the model says then its weight needs to go up if it's true less often than the model says then its weight needs to go down and once they all match we've reached the maximum likelihood solution so this is all very straightforward there's one very big snack here which is to compute the expected number of true groundings of a formula according to the model we have to do inference and of course inference in general is intractable and we have to do this at each step so most of the time this is not going to be feasible so what can we do instead? well we can use a strategy that was first proposed for Markov Networks by Julian Bisak back in the 70s and that is to use something that is similar to likelihood but yet tractable to compute and what he proposed and is used for MLNs as well today is what is called pseudo-likelihood so the pseudo-likelihood is just the product over all variables of the probability of the variable given its neighbors in the evidence in the data and this of course is tractable because computing the probability of a variable according to its so-called Markov Lancet is a simple operation and this is also a consistent estimate meaning that as the amount of data that you're laying from grows the estimates of these probabilities of these conditional probabilities go to their true values it's widely used in areas like vision and special statistics but of course it has some limitations the main one is that this tends to work well for short chains of inference not surprisingly because when you're doing pseudo-likelihood inference is one step away from a variable to its neighbor what this means is that when you have longer chains of inference often pseudo-likelihood will give you very poor results so what can you do in that case well you can do discriminative learning which is actually usually the best thing to do because it tends to give better results throughout machine learning and here as well so the idea in discriminative learning is that we know a priori which of variables why we're going to be querying on and which one's which variables X are going to be evidence and then all we do is optimize the likelihood the conditional likelihood of Y given X as opposed to the joint likelihood of all the variables which is a much harder problem and now everything works pretty much the same way the nice thing is that as you condition on evidence a lot of the modes of the distribution tend to disappear this is what makes probabilistic inference hard is that there's usually a large exponentially large number of modes and they're widely separated but once you start conditioning on evidence they start to disappear and often it winds up being the case that the answer can be well approximated by just the probability at the peak, at the single biggest mode so instead of having to do this probabilistic inference we can just find what the most likely state is which instead of being sharpy complete is only NP complete and use those counts to give our answer so more concretely how can you do this well there's a number of ways but the simplest and the oldest is to use an algorithm that was first proposed by Mike Collins for training hidden markup models and that's the structured perceptron so the structured perceptron works as follows it assumes that the network is a linear chain because that's of course what an HMM is and what it does is it starts out with all the weights being zero and then it does the following sum number of times it finds the most likely state of the hidden variables for example what words you're speaking given the observations for example the sounds that you're hearing this is of course a typical application and of course there's a famous algorithm for doing this called the Viterbi algorithm so use the Viterbi algorithm to infer the most likely state of Y and then you just do a gradient descent step where your gradient is the difference between the count of each state in the data and the count of each state in this MAP Y that we infer we multiply that by learning it we add that to the weights and that's the step of gradient descent and at the end we return the average of this over all steps why the average and not the last result because you tend to generalize better that way this is both seen empirically and there's theorems to that effect now of course what we want to do is to start with perceptron for instance in Markov logic what we actually need to do to make that happen is very simple, is to just allow the network to be an arbitrary graph and now instead of Viterbi we use probabilistic theory improving so we infer at each step we infer the most likely state of Y using PTP and the rest of the algorithm works as before and this is quite simple it's very effective and it's probably the most widely used algorithm for learning MLM weights so there's some more sophisticated ones that tend to be faster what about learning structure we wouldn't be able to learn the formulas from data, not just learn weights for predefined formulas and if you think about this this is the problem that generalizes both the problem of feature induction in Markov networks and inductive logic programming that deals with inducing logic programs from data so there's a lot that we can draw on here there's also some significant differences one is that in inductive logic programming you only learn horn clauses because that's what a prolog program is made of and the goal here is to use any clauses so you need something a bit more general also in inductive logic program you typically use some kind of accuracy of information gain as your evaluation measure but here we're learning a probabilistic model so it should be a likelihood and then if you think about how this is going to work every time we modify a formula so we have a new candidate formula we need to relearn the weights and this is potentially a big problem because relearning the weights is itself not very fast I know we're going to have to do this for potentially millions of candidates so you'd think this would be a big bottleneck surprisingly it actually turns out not to be if you do a couple of simple things one of which is you start the new weight optimization at the old weights because when you change a formula most of the weights don't change and if you use a fast optimization algorithm or other quasi-Newton method type like LBFGFs or such you typically are converging just an iteration or two surprising the bottleneck turns out to be something else it's counting clause groundings just counting the statistics that you need to go into this actually turns out to be the bottleneck because actually doing that surprisingly is itself a sharpie complete problem and when we were learning weights we only had to do that once at the beginning but now we have to do that for every candidate formula fortunately here there is also a simple solution which is to subsample you don't actually need to count every single grounding of the formula friends xy blah blah blah you can subsample maybe a thousand groundings and from that you can infer within some bounds what the actual answer would be which is enough to do the learning so with these using these various things learning Markov logic is about as fast as traditional in that logic programming or learning features in Markov which is not super fast but it's as fast as the things that we unify and later we will see how to do things that are even more efficient so of course to be more concrete when you do structure learning there's a number of choices that you have to make the first one is the choice of initial state if you want to learn purely from there you can start with unit clauses so isolated predicates and then you start adding things to them or you can start with the hand coded knowledge base that somebody wrote down and then you can revise it by learning and often this is actually the idea of use of Markov logic is you put in your knowledge and then you revise that knowledge using structure learning now you need search operators the obvious ones are of course having and removing a little from the formula also flipping the sign of a little because people often write down good formulas but the implications are in the wrong direction we also need an evaluation function the natural one to use of course is to the likelihood because it's the most efficient but we need to add to that some kind of prior to combat overfitting and the obvious thing to do here is the same thing that people do in graphical models which is to have us a prior that penalizes divergence from the initial network so every time you either remove a little you pay a penalty and you only do that if the gain in likelihood exceeds the cost of doing that so very simple but very effective and then finally of course you need to choose what kind of search to use people have you know tried all kinds of things beam search which comes more from the ILT side shortest first search that comes more from the Markov network side and a whole bunch of others and you know it's in some ways a matter of taste which one you want to use okay so we have Markov logic as the unified representation we have some fairly efficient inference and learning algorithms for it are we done or actually no we're not done by any means because this is still not industrial strength to have something that can be deployed in industry it has to be very scalable and it has to be very reliable anything that requires approximate inference you know is a little bit dubious to using production so what we need to do is actually what people have done in classic AI which is come up with tractable subsets of Markov logic such that inference in them will always be efficient and exact and then you know we can really deploy this and be confident that it will work as expected and that's exactly what we've done we've developed something called tractable Markov logic which as the name implies is a tractable subset of Markov logic summarizing this language is very similar to the classic ones you know basically it has objects and sub parts and class hierarchies and it's by exploiting them that the inference is made efficient so in tractable Markov logic there are three types of weighted rules and facts sub class rules as the name implies are rules of the form a family is a social unit a sub class fact is something of the form the Smith-Sara family then there are sub part rules things like every family has two adults notice that not every family has two adults but that's okay for us because at the end of the day this guy is going to have weights and just going to be statistical regularities and the sub part fact is something like the first adult in the Smith's is Alice and then of course finally a relation rules like you know in every family every adult is a parent of every child again not always true but true a lot of the time and we can learn to wait for it and then finally of course just simple facts like Alice and Bob are married and the remarkable thing is that inference in this language is always linear in the number of rules and objects it's not just tractable it's linear and in fact in practice when you actually do the evaluation for a query it's sub linear linear would be looking at all the rules but it's typically only to look at a small subset of them so what happens at the end of the day is that in tractable Markov logic you can actually do interactive querying answering a query typically takes less than it takes a fraction of a second and it gives you probabilistic answers that are exact that are not approximate why does that happen well the reason is that by design the structure of the knowledge base mimics the structure of the computation that has to happen when you're computing probabilities so in particular the probability of an object given its class is the product over the subparts of the object of the probability of the subpart given the class so this is one thing that makes things tractable and the other one is that the probability of anything given a class is a mixture model it's the sum of the subclasses of the probability of that thing given the subclass times the probability of the subclass so what's going to happen is that for every object node there's going to be for every object in the knowledge base there's going to be a product node in the resulting sum product network of the computation and for every class there's going to be a sum node summing over the subclasses and as an application of this we have developed what is to our knowledge by far the largest probabilistic knowledge base ever built we did it by extracting information from classic web sources like dvp, yago, and mel and merging them all together which of course is something that Markov Logic is very good for and at the end of it we have a knowledge base with millions of objects and billions of parameters but as I mentioned you can return exact answers to queries in fractions of a second this of course is just one of many applications that people have done Markov Logic has been applied widely in natural language processing and in information extraction which are obvious applications but also in things like link prediction, social network analysis in robotics and vision in computational biology psych the famous largest knowledge base ever built parts of it have been made probabilistic using Markov Logic also personal assistance the famous DARPA-KELO project that Siri grew out of used Markov Logic as the core representation in infant language and many others Markov Logic has been widely researched in academia in fact this body of work is one of the most widely cited bodies of work in AI of recent times it's also been used widely in industry not just by the large tech companies like Google and Facebook but by many others as well and research and progress continue so to summarize I would say that at this point we have largely succeeded in unifying logic and probability using Markov Logic there's of course many languages to do this but Markov Logic is the most general and simplest and also by far the most widely used and the most developed and what Markov Logic does is just assign weights to first order formulas and then treat that as the features and weights of a probabilistic model we and many others have developed powerful and inference and learning algorithms for Markov Logic and I believe that we now have a good foundation for modern AI where you don't have to start with logic and then you know deal with and certainly using hacks or start with graphical models and deal with relations using hacks which is what people did before there's of course still much more to do if you're interested in finding out more about Markov Logic I recommend you take a look at this article that came out last year in the communications of the ACM there's also a book that's a little bit older there's things that we talked about here that aren't covered in the book but it still covers a lot of things that the article doesn't there's also a nice website Alchemy the URL is right here that contains both the Alchemy open source implementation but also pointers to other open source implementations and also papers, pointers to the literature, MLNs, data that you can learn from and so on thank you and I will take questions now so thanks so much Pedro let's give some minutes well, minutes seconds to get some questions people is a little bit shy today so maybe there's no questions but let's see no questions since there's no questions so thank you Pedro for being with us today take care and keep in touch maybe people are shy now but maybe later they drop your line or something like this so that's the idea to keep in touch and begin the conversation so thank you Pedro thank you