 Okay, welcome. It's good to be here. My presentation today is on Bayesian networks and specifically some tools for using them in Ruby. As a day job, I crimp cables, if you weren't here earlier, you would have seen me doing that. I also like to work with Ruby. First, I'd like to set the stage by talking about some of the problems they solve and how they work. Much early research in the field of artificial intelligence focused on expert systems and first-order logic. Although nowadays, they're much more sophisticated. Most of these fledgling systems attempted to solve problems by storing numerous facts about their area of expertise and deducing the answers using predicate logic. This is an example of a first-order logic or predicate logic. These sentences state, for all x, if x is a person, then x is mortal. Socrates is a person. Therefore, Socrates is mortal. As AI became commercialized in the early 80s, the sector initially had great success, but the so-called AI bubble burst towards the end of the decade as companies failed to deliver on their extravagant promises. Perhaps part of the appeal of this emphasis on logic was that it relied on things that computers were really good at, like storing a lot of information and quickly executing many operations. But it has some profound limitations. Consider, for example, this rule. For all patients p, if patient is experiencing a toothache, then p has the disease called cavity. The problem is that this rule is wrong. Not all patients with toothaches have cavities. You could try to fix this by changing the rule to accommodate for all possible causes of toothaches, such as gingivitis, abscess, maybe he got punched in the mouth or something. Unfortunately, the list of possible causes is pretty much unlimited. You could also try reversing the causal order to say that cavities cause toothaches, but this isn't true either because some cavities don't cause toothaches. The problem with logical agents are, it's beyond our capability to make a complete list of all possible antecedents and consequence needed to ensure an exceptionless rule. And it's too difficult to even use such a rule if we had one. Even in cases that aren't beyond our capability, there's a point of diminishing returns after which it's just not worth spending any more time on. In other words, just solving the problem isn't enough. We also have to solve it in a reasonable amount of time. And I would say probably the lifetime of the universe wouldn't be a reasonable amount of time. There are no useful areas of knowledge that have been completely mapped out by a grand unified theory of everything. In fact, Goodall's incompleteness theorem tells us that in any useful, consistent formal system, such as first order logic, there will always be statements that are neither provably true or provably false. In other words, any useful system can never be both complete and consistent. And even if it were possible to know all the rules, we might be uncertain about a particular situation because we don't have the time to run all the necessary tests or something like that. It's interesting to consider though that people deal with uncertainty all the time. Never being able to know anything with the likelihood of 100%, we are forced to take these leaps of faith every day. In fact, just getting out of bed in the morning requires us to make a number of assumptions. Whether we realize it or not, we employ these strategies in every decision we make. Our knowledge can at best provide only a degree of belief in the possible outcomes for a given decision. The main tool that experts use to quantify and express these degrees of belief is called probability theory. Stewart Russell and Peter Norvig in their monumental texts on artificial intelligence, which I'd highly recommend, they articulate this well. They say, probability provides a way of summarizing the uncertainty that comes from our laziness and ignorance. I probably wouldn't be inclined to be quite so hard on ourselves as to call it laziness since there are often good reasons that we don't or can't always obsessively optimize our decisions. But let's assume we're good Calvinists and we enjoy self-flagellation. Laziness it is. I hope you'll bear with me for just a minute while I take a few minutes to go over some of the basic concepts in probability theory that are used in Bayesian networks. I think it'll help lay a foundation for understanding them a little better. First, propositions are statements about degree of belief or assertions that such and such is the case. This proposition says that our level of belief that cavity is false is .9 or 90 percent. In this proposition cavity is called a random variable or some aspect of the problem space that can take on multiple states. Random variables can be boolean, meaning they can only be set to true or false. Discrete meaning they can take on many different distinct states and boolean random variables are a subset or a special case of this discrete random variable or continuous meaning they can take on any value from the real numbers. Each random variable has a number of different states that can take on. The list of all possible states for a random variable is called its domain. The domain of the variable cavity is true or true and false as you can see here. There are a few fundamental pillars or axioms of probability upon which everything else rests. First, every probability must be between zero something that is never true and one something that is always true. Second, which I basically just said is that anything that's always true has a probability one and anything that's always false has a probability of zero. And third, the probability that either one proposition or another is true is equal to the probability of the first plus the probability of the second minus the probability that they're both true together. So to understand this, you can think of it like this. We have one event. We add it to the probability of the other event to get the probability that either event happened. However, we've counted that intersection in between the two events. That is the times when they're both, when they both happen together twice. So we have to subtract the intersection once to get the right probability. We can also derive a few other truths from these axioms. Clearly, A or not A is a certain event. Either one or the other has to be true, but both can't be true. So a shorter way of saying this is that the events are mutually exclusive. From axiom two, which says that a certain event is equal to one, we learned that the probability of A or not A is equal to one. Since A and not A are mutually exclusive events, we can use axiom three without having to worry about the possibility that both A and not A might be true simultaneously. So we can reduce the axiom to just the probability of A plus the probability of not A is equal to the probability of A or not A, which we have just learned is equal to one. From this, we can also learn that subtracting any probability from one gives the inverse of that probability or the probability that the proposition is not true. Bear with me for a few more minutes, won't take too much longer on this probability stuff. The prior or unconditional probability of something is the degree of belief given to it in the absence of any other information or based on information that is outside the problem space. This proposition says that our prior belief about the likelihood of a cavity before we gain any other information is 10%. Notice that we can remove the cavity equals true and we just say cavity inside the parentheses and we're simply talking about when it's true. This raises the subject of an interesting sort of taste great less filling debate that's been going on for a long time among statisticians. Where do probabilities come from? The frequentist approach is to say that they come from experience about the world. That is, we count up all the times that we found something to be true and divide it by the total number of samples to get its probability. The Bayesian approach expands probability theory by allowing us to reason about beliefs under conditions of uncertainty. It recognizes that there are some outcomes we have never experienced that still have some probability. For example, what are the odds that Barack Obama will win the 2008 presidential election? We can express a belief about this event even though it has never happened before. Furthermore, even events that have occurred before may not have occurred frequently enough for us to accurately determine their real probability. For this reason, probabilities are usually expressed as degrees of belief with all of them being influenced by something called a subjective prior whose effect is diminished the more observations you make. Another interesting aside is that some cosmologists and philosophers have observed that the so-called laws of physics themselves are probabilistic and that such things as time and the speed of light are not actually constant throughout the universe. Pragmatist philosophers like William James and Charles Pierce believe that truth is what works and that physical laws are like well-worn pathways that only become so after a long Darwinian sifting process, similar to the notion of a subjective prior that gets modified as we gain more experience. Sometimes we want to talk about the probabilities of all possible values of variable and that's called a probability distribution. And this example above shows the prior probability distribution of the variable weather. So to shorten the notation, we can use a bold non italicized P to indicate that what follows will be a vector of probabilities for all the states of the given variable rather than a single probability. And sometimes we want to talk about the combined probabilities of all the variables in a system, a joint probability distribution that covers this complete set is called the full joint probability distribution. This is the full joint probability distribution of weather and cavity. The probability of each combination of states is on the rightmost column and I've also included the prior probability of each state next to its name and parentheses just so that you can compare them with the combined probability which is commonly known as the posterior or conditional probability which I'll explain in a minute. A typical full joint distribution will not list the priors next to each of those variables like I did. Notice that the sum of all the probabilities is one. In a logical world the sum of all possible atomic or mutually exclusive probabilities has to be one. Notice also that the probability of sunny and cavity occurring together plus the probability of sunny and not cavity occurring together is equal to the probability of sunny. Anytime this happens you know that the variables are independent of one another. In this case the behavior of the weather is not dependent on whether or not you have a cavity. The summing process I just showed you is called marginalization. Given the full joint probability distribution of a set of variables we can determine the probability of individual variables advanced by summing up all the cases where they hold. This process got its name from the way actuaries would write up the sums of observed frequencies in the margins of insurance tables. So in this case we could calculate the probability of whether equals sunny by summing up all the states of cavity where it holds. And when we calculate the probability of variable a by summing all the applicable probabilities of variable b like this we say that b is marginalized out of the probability of a and b. Here we have a generalized formula for the marginal probability of any probability distribution a from the joint probability distribution of a and b. As we saw on the table you just sum up the probability of a given each state of b and notice the vector notation refers to a whole distribution not individual probabilities. Another thing to mention is that the comma here is another way of just saying logical and. I should mention here though that the complexity of inference using the full joint distribution is exponential for both time and space. The probability table for a domain described by n boolean variables is on the order of two to the nth and the time it takes to process the table is the same and these bounds of course only get worse with multi-state discrete variables. And for these reasons it's just not practical to use the full joint probability distribution for inference but it helps us to understand understand some of the basic operations that we use in Bayesian networks. So once we've obtained some amount of evidence about a previously unknown random variable prior probabilities no longer apply. Instead we use a conditional or posterior probability. The notation for expressing a conditional probability is something bar or something else. So the way you read that is you say the probability of a given b. And you can also think of a prior probability as a special type of conditional where it's conditioned on no evidence. Conditional probabilities can be described in terms of unconditional probabilities and that's what this defining equation shows right here. The probability of a given b is equal to the probability of a and b together divided by the probability of b. And I just did a quick little example so you guys can get how this actually works. Let a denote the event that a puppy in a kennel is female. Let b denote the event that a puppy is a puppy is a labrador. In a kennel of 100 puppies suppose 40 are labradors and that 10 of them are females. That gives us the probability of a and b puppies who are both labradors and female as 10 over 100. The probability of b puppies who are labradors is 40 out of 100 and the probability of a given b is 10 over 40. That is 10 of the 40 labradors are female. So we can see that plugging all these things in gives us a valid equality. We can also derive something called the product rule from this equation. Now when I do equations I like to show every step because I'm not that good at math so anyway here it goes. You first multiply both sides by the probability of b then you cancel the terms then you flip the equality statement then you change the order of the multiples on the right hand side and you've got the product rule. You can think of this as saying that for a and b to be true we need b to be true but we also need a to be true given b. You can also turn it around and say that the probability of a and b is equal to the probability of b given a times the probability of a and we can put all of these things together to come up with something called Bayes rule. The equation for conditional probability and the product rule they can all be combined to do this and this simple equation is what underlies all modern AI systems for probabilistic inference. Bayes rule shows us how to update our belief about a hypothesis in light of new evidence. Our posterior belief or the probability of a given b is calculated by multiplying our prior belief by the likelihood that b will occur if a is true and we have the Reverend Thomas Bayes to thank for this equation. It was published posthumously in the papers of the British Royal Society way back in 1761 through the recommendation of a friend of Bayes who knew what a brilliant man he was and that his modesty had kept him from publishing much of his work. Not extremely well known in this time his work is now like the foundation of most probability theory. I should add that there are some anachronisms in this portrait that have cast doubt on its authenticity but I just couldn't resist throwing in a picture of some old guy even if this isn't really what Bayes looked like. The power of Bayes rule is that it is often difficult to directly compute the probability of a given b but we might have information about the probability of b given a. So it lets us compute a given b in terms of b given a. I'll show a quick example about how that might work. Suppose we're interested in diagnosing cancer patients who visit a chest clinic or I should say cancer in patients who visit a chest clinic and let a be the event that the person visiting the clinic actually has cancer. b is the event that a person visiting the clinic is a smoker and we know the prior probability of a from past data because we know that 10 percent of all the patients visiting the clinic end up having cancer. So we want to compute the probability of a given b that is we want to know if they do smoke how much more likely they are of getting cancer. It's hard to find this out directly but we know the probability of b because this information is also collected from incoming patients. So we're likely to know the probability of b given a by checking our records and determining the percentage of patients diagnosed with cancer who were also smokers. Suppose that this percentage was 80 percent. We can now use Bayes rule to compute the probability of a given b. So in light of the evidence that a patient is a smoker we revise our prior probability from point one to point one six. This is a significant increase but it also shows that the low overall probability of cancer has the greatest influence. Often the probability of b is thought of as a normalizing constant so we pull it out and we can actually discover the probability of b through marginalization. So we'll separate that from the rest of the inference process and do it at the end. So when we do that we have this general form of Bayes rule with normalization. And one more tool that we really need for Bayesian inference is the chain rule. So you'll recall that we derive the product rule from the formula for conditional probability. We can expand the product rule for more variables and like this and here we have the generic version for an arbitrary number of variables. This is called the chain rule and it's important for Bayesian networks because it allows us to calculate the full joint probability distribution of any domain of random variables by starting with something we know such as or something that doesn't have is not conditioned on anything else like the probability of a sub n and then using that to calculate the next probability and so on and so forth and then we can do the whole full joint probability distribution. But this rule doesn't help us to reduce the complexity of inference but it is a tool that we'll use that will combine with something called conditional independence. Earlier we talked about independence. A is independent of b if its posterior given b is the same as its prior. In other words b has no effect on the probability of a. Another very important concept for Bayesian networks is the notion of conditional independence. If the probability of a given b and c is the same as the probability of a given c then we say that a and b are conditionally independent given c and understanding this concept was a crucial step to the development of Bayesian network theory. Before they understood this concept they didn't know how to reduce the complexity of the operation and they didn't see any reason for pursuing it. To understand the conditional independence a little better let's just take a few examples. In our first example we'll say that Alice tosses one coin and Bob tosses another. So event a is the result of Alice's toss and b is the result of Bob's toss. It's pretty obvious that a and b are independent. The outcome of one is not going to affect the outcome of the other. Now let's suppose that Alice and Bob toss the same coin. Assume there is a possibility that the coin is biased towards one side. In this case a and b are not independent. Observing that b is heads or tails causes us to increase our belief that a's result will be the same because we have at least a teeny tiny prior belief that the coin could be biased. In this example variables a and b are both dependent on a separate variable c. The coin is biased event. Even though a and b are not independent it turns out that once we know the value of c any evidence about b can't change our belief about a. Specifically it says that the probability of a given c is equal to the probability of a given b and c. In this kind of a case we say that a and b are conditionally independent given c. In a lot of real life situations variables that seem to be independent are actually only independent conditional on some other variable. Let's suppose that Alice and Bob live on opposite sides of the city and come to work by completely different means. So Alice takes the train and Bob drives this car. A will represent the event that Alice comes to work late and b will be that Bob comes to work late. So it would be tempting to assume that a and b are totally independent. However even if Alice and Bob lived in some totally different countries there could be some factors that affect them that could cause them both to be affected by this such as an international fuel shortage or you know a nuclear holocaust or something like that. But when we're developing our model of probability or uncertainty we usually don't take crazy things like that into account. In this case for example we could say both a and b might be affected though by a train strike which we'll call c. And clearly the probability of a will increase if c is true. But the probability of b will also increase because of the extra traffic on the roads. So you thought that these things were independent before but you realized they might not be. But the key is that once we actually know the state of c the probability of a and b are independent of one another. It doesn't mean that knowing c doesn't affect your belief about a and b. It just means that once c is known knowing what a is won't change what b gets and vice versa. And this is an actual example of a Bayesian network. It's taken a long time to get there but sorry about that. A Bayesian network is a directed acyclic graph whose nodes represent random variables and whose edges represent influence or causality. Each node has a state table with a joint probability distribution of its own states and the states of its parents. Conditional independence means that the nodes only have to keep track of conditional probabilities for nodes that directly influence them. That is their immediate parents. This means that a Bayesian network is complete is a complete and non-redundant representation of the domain. It also means that a Bayesian network is much more compact than a full joint probability distribution. The full joint distribution for a Bayesian network can be generated using this formula that we have here where all you need to do is basically look at a node's states and the probability of a node given its parent states. On the left there you see a notation that represents a certain arrangement of variables in the network and on the right you have the way of calculating the probability for that arrangement. So this compactness is an example of a characteristic that's usually found in something called locally structured or sparse systems. So no matter how big the system gets, each component of the system only has to communicate with a few other components in the system. So it really reduces the complexity of the operations in that system. Local structure or sparse structure like that is usually associated with a linear growth curve rather than an exponential curve. With Bayesian networks it's reasonable to assume that in most domains each random variable is directly influenced by at most k others for some constant k. So if we assume that we have n Boolean variables just for simplicity sake, then the amount of information needed to specify each conditional probability table will be at most 2 to the k numbers and the complete network will be specifiable by just n times 2 to the k numbers. So in contrast the full joint probability distribution contains 2 to the n. And just to give you like a concrete example of how big of a difference this is. Suppose we have 30 nodes each with five parents so that would be k equals 5. The Bayesian network requires 960 numbers but the full joint distribution requires over a billion. And this compactness also makes it possible for Bayesian networks to handle complex domains with many variables. Let's look at a simple example of a Bayesian network. In this domain we have four variables. One cloudy is it cloudy outside. Two sprinkler where the sprinklers turned on. Three rain did it rain. Four is the grass wet. We can see that cloudy has no parents so for our purposes the things that influence cloudy or influence whether or not it's cloudy outside are outside of our problem space. We just have a prior belief on that. Cloudy has a prior probability of 0.5 or 50 percent. Half the time we expect it to be cloudy and the other half not. And sprinkler and rain are influenced by whether or not it's cloudy. For rain that's pretty obvious. If you have cloudy skies you increase the chances of rain. But they also increase the chances that the custodian will disable the sprinklers for the day so cloudy has an influence on sprinklers as well. The probability that sprinklers will be enabled given cloudy is 0.1. So the probability that they will be enabled given not cloudy is 0.5. And the probability that it will rain given cloudy is 0.8 and given not cloudy is 0.2. And whether or not the grass is wet is influenced by both the sprinklers and the rain. Since it has two parents grass wet has to have conditional probabilities for all the combination of its own states and its parents. If both the sprinklers and the rain are true then it's highly likely the grass will be wet 0.99% or 99%. And if one or the other are true then the probability is 0.9. Otherwise there's no chance that it'll be wet. So notice that the tables here show only the rows for each variable's true state to save space on this slide but internally the variables have to have the probabilities for the combination of all of their own states and those of the parents. So that would mean two for cloudy, four for sprinkler and rain, and eight for grass wet. Inference in Bayesian networks occurs by passing evidence about variable observations to the network and then querying the network for the posterior probabilities of unknown variables. So here let's query the network and say that we only know, we've only observed that grass, the grass is wet and we want to know what the probabilities of those other variables states are. So we can query the network for the posterior probabilities of the other ones. We can see that the probability of cloudy has increased slightly given the evidence and we can also see that the probability of sprinkler is a lot lower than that of rain because rain is a more likely explanation for the wetness than the sprinklers based on the conditional probability tables that we have there. Now let's see what we get when we observe that both the sprinklers were turned on and that the grass is wet. Now the fact that the sprinklers were turned on becomes the more likely explanation for the grass being wet and the probabilities of both cloudy and rain drop precipitously. By the way, you can also query a network with no explanation or no evidence and you can see the likelihood of those states just by their priors and their conditionals. So in Bayesian networks there's a few different ways that a lot of different algorithms that have been developed for doing inference on them and they use combinations of those formulas we looked at earlier such as marginalization, Bayes rule and the chain rule. The first type of or they generally fall into two categories. First you have exact inference and although most networks or a lot of networks have characteristics that allow them to be solved in polynomial time, the general problem of inference on Bayesian networks or exact inference anyway is NP-hard and for those of you who might be unfamiliar with complexity theory, this basically means that the time or space required to solve the problem grows exponentially with the problem size and you could potentially be waiting for something like the the lifetime of the universe for an algorithm to complete given a problem of sufficient size. But in practice the algorithms are good enough for fairly complex Bayesian networks to be dealt with efficiently on today's computers. Not always but a lot of the time. And then there's another general classification for inference methods which is approximate inference. And this is usually used this when you just want to do it faster or you don't need as quite as precise a results or just when your problem size is so big that it's impractical or intractable to solve the problem with exact inference. And most of these approximation algorithms use stochastic methods and Monte Carlo algorithms which bounce around the network and set the variables to different states randomly according to their probabilities. There are two main ways that learning takes place in Bayesian networks. First it's not always obvious what the actual structure of the network should be. Whether or not a variable has a causal influence or is actually a symptom of a more fundamental cause is often unclear as we talked about earlier. So computer scientists have developed a number of different algorithms for analyzing the level of apparent independence or causality between variables and suggesting possible relationships between the nodes. These algorithms take sample points. A sample point is like just a state for all the variables in the system. And then they take these sample points as input and they calculate each node's conditional probabilities. And they analyze these probabilities to see how they influence each other. Another essential task in learning in Bayesian networks is learning the actual parameters. This is just that you take sample points and you have each node build its state table based on all the frequencies of the various states that it sees. So we barely had time to scratch the surface of the underlying math involved in Bayesian networks and I hope I haven't scared you guys off too much with the details. But in practice they're actually pretty easy to work with. One of the great things about them is that you can just sort of let the data speak for itself. You might not even be aware of all the relationships in the domain but you if you let the network learn on sufficient data and it can render pretty amazing results. As a general rule as long as your network structure is kind of is reasonably accurate and you use all the available data during the learning process when you query your network you're getting close to the best possible predictions given the evidence data. This means you're nearly optimally leveraging all your knowledge of the domain. There's a few caveats there and sometimes when you when you really do well it might take a long time to do inference on the network but it can work very well. For the rest of the presentation I'd like to show you how to use a library that I've been working on that allows you to actually use Bayesian networks in Derubi applications. This library is called SBN or Simply or Simple Bayesian Networks. So this started out as a class project for class I took back it in the BYU computer science master's program on Bayesian methods and I got the pseudo code for the algorithms from artificial intelligence by Russell Norvig. It was written in Ruby but it was kind of sloppy and it was done on a short deadline and so it was kind of slow and inefficient. I later rewrote it in C++ and Ryan Davis he said he asked me why I decided to cross over to the dark side when I could have just used Ruby inline which if you're not familiar with that lets you write C code inside your Ruby app. I probably should have listened to him in retrospect but I thought it would be good to be able to share the code with a wider audience and I figured I could write a Ruby extension for it later but when I revised the code two years later I decided to just try starting from scratch again in Ruby and seeing where I could improve the efficiencies and the funny thing is the latest Ruby version which doesn't even use any C code yet ended up being faster than the C++ version. I don't know why probably that's saying more about my poor C++ skills than about Ruby's speed but I will say that Ruby's built-in methods for hashes and arrays have been pretty tightly optimized and I think a lot of the improvements this time around came because I tried to rely on those as often as it made sense. This experience also reminds us that premature optimization is the root of all evil as the famous Donald Knuth would say. First you should just code it in a simple and elegant way as possible then profile it and then optimize the areas that are most obviously bad or inefficient. By the way when I wrote the conference abstract for this talk I thought of titling the project SBN4R and basing it on the C++ library but I just decided to change it to SBN kind of out of respect for Sergio Espejas recent BN4R library which the name was just too close so and also to avoid the confusion. So let's look at how we would recreate the grass network with SBN. First we require the necessary files then we instantiate a new network giving it a title then we create each node in the network passing to them the network they belong to their names in the form of symbols strings will be automatically converted to symbols. The probabilities for their state tables and I'll tell you more about how to do that in the order to do those in in a second and then if you don't specify their states they just default to be in Boolean variables true and false. We then create the edges in the network by calling either add child or add parent. So adding a child also creates a parent relationship and vice versa so it really doesn't matter which of those methods you call. We then set the observed evidence in the form of a hash with symbols for keys and values so in in this case we the key represents the name of the node that you've observed and the value represents its state. In this case sprinkler has been observed as being false and rain has been observed as true and we then at the end I print the output from querying the network about the probabilities for grass wet. So let me just run this real fast and see how it works. You've got to have your your demo. I just got to do that. Kind of boring but we get back that it thinks that the grass the probability of grass being wet or not being wet is is about point one and the prop oh thank you. There we go. Let me run it again. Yay! All right so as you'll see right now at this point SBN uses an approximation algorithm and I haven't yet implemented exact inference. I hope to do that soon so it doesn't return exact results every time but the precise result that it would have returned if it were exact inference is point one for false and point nine for true but sometimes it's close enough for jazz and you're like oh that'll be good enough and the number of samples you take in your approximation algorithm will determine the accuracy of that result as well. So let me go back to presentation here. I always lose my mouth on those other monitors. Okay so the algorithm that it uses is called the Markov chain Monte Carlo algorithm. The order that the probabilities are supplied is as follows. You always alternate between the states of the variable whose probabilities are supplying. Supply the probabilities of these states for the the variables parents in the order that the parents were added from right to left with the right most or the most recently added parent alternating first. So I'll show you an example of how to do that. Let's say we have a variable a with two parents b and c. A has three states b has two and c has four. So I would supply the probabilities in the following order. First you go through a states for the first states of b and c and you have to remember that these tuples these little groups of states all each of them should add up to 1.0. Then you go through a states for the for b's first state and c's second state for c's third state c's fourth state and then all of a sudden you're now on b's second state and c's first state and so on and so forth until you got the whole thing. Entering probabilities you can also do them another way. This is possibly less confusing but it might be a little bit more verbose as you can specify an actual hash for the combination of states that the table entry belongs to and then just give it the probability for that combination of states. Parameter learning is another thing that the SBN does for you. Although it's sometimes useful to specify the probabilities in advance a lot of times we start with a clean slate and we're only able to make a reasonable estimate of the variables probabilities after we collect a lot of data. So this process is easy with SBN. You just pass in complete sample points for all the nodes in the network and as many of those as you want and then you just run the learn method on that. Another way you can do that is you can pass in one sample point at a time and then when you're done passing in all those sample points you call set probabilities from sample points and the networks will store the sample points that you pass into them so if you do add more sample points later it will keep the old ones that you've added so it continues to benefit from all the things you've given it. The library also supports serializing your network. Right now it uses the format called xml biff which is I guess kind of one of the few open source formats for Bayesian networks that was sort of proposed or promoted by the guy who wrote Java Bayes which is another cool little tool that you can download. So the networks that this library creates actually you can open them up in Java Bayes and mess around with them there too in a graphical way if you want. At present the sample points that you've given to your network aren't saved in the xml biff file but I hope to add that soon. There's also a few advanced variable types that make it a little bit easier for you to handle real world data and I'll talk about those right now. First there's something called a string variable and it's used for handling string data. So rather than set a string variable states manually you rely on the learning process and during learning you should pass the observed string for this variable for each sample point. Then each observed string is divided into a series of n grams or short character sequences that represent you know whether those were found in a given string. A new variable is created for each of these n grams whose state will be true or false depending on whether the snippet was observed or not in the evidence. And these co-variables are managed by the main string variable and they're transparent to you the developer. So they inherit the same parents and children as they're managing string variable and by dividing the observed string data into fine-grained substrings and determining the separate probabilities for each substring occurrence you can get a really extremely accurate understanding of the data. But I have to confess that the practicality of this feature is still kind of in doubt. It has the potential to greatly increase the accuracy of your inference but it often makes your network structure totally intractable, way more complex and impossible to infer on. So it was a good idea but it's kind of like those early attempts at flight you know that you see in those funny movies. But it might come in handy for some of you. I also created a variable type called numeric variable which you use for handling numeric data. Numeric data is continuous and so it's a little bit more difficult to categorize into discrete states. And the current inference algorithm that SBN uses actually can only work on discrete variables. So the the workaround for this is to use a process called discretization which kind of divides the numbers into buckets for each range of number. And the thresholds for these buckets are based on the mean and the standard deviation of the observed data that you've already passed to the network and they're recalculated every time you run learning on the network. So even though you're losing some amount of accuracy by discretizing your numbers at least the numeric variable makes it a little bit handier because it dynamically adapts to your number to your data and it also handles the discretization for you. And so as long as your data is somewhat normally distributed it should do all right for you. I wanted to end just by talking about a real world example you might use. Let's say I've got a budgeting pro I've written a budgeting program and I want to improve the accuracy of the expense tracking. So I'd like it to automatically categorize my transactions into budget categories with as little human intervention as possible. And I'd like it to keep learning as I categorize more things. So I could say that category is the variable that will query when we don't when we have a new transaction that we want to classify. We set the other variables down at the bottom there and we assume that the category is the main influence that will determine what they end up being. And we're not sure if the data such as the day of the week and the day of the month has a great influence on the category but there might be a trend in there that we're not aware of so we've got the data so why not just use it. And a string variable will be used to identify the merchant identifying string and the amount will be a numeric variable. I was hoping to have a demonstration of this to show you today but it's still not ready for prime time. I'm just not getting the results I would expect. I may need to use a different network structure. But anyway, hopefully I'll have more to show you soon. Some future improvements that I'd like to do with this library. I'd like to add exact inference, have real continuous variables and I'd also like to be able to serialize your sample data along with your network. And I'd just in general like to really speed things up. If I had more time I was going to show you a cool little mini lesson on array languages and vectorization and parallelization. You can really do some cool things. There's a library out there called Mac STL that I would recommend you take a look at that lets you just like use bunches of data in arrays and using just conventional math on arrays. You can under the scenes his STL library actually converts your code into code that's optimized for the multimedia instruction sets that are in today's processors. And those multimedia instruction sets let the processor like execute an instruction on say 10 or I mean 8 or like 16 or 32 variables at a time or 32 numbers at a time rather than one number at a time. So rather than iterating over all of your numbers that you have to process you can actually pass them in chunks. And this Mac STL library will abstract all that away from you. So all you do is convert your algorithms into array like notation and it just takes care of the rest and he shows how some algorithms have been sped up like 450 times by just using his library which is really cool. I'd also like to be able to learn the structure of the network. Right now you kind of have to specify the structure and have more control over like the precision of your of your inference when you're using the approximate inference. Finally I'd like to just talk about or thank some of the people who've got me interested in this subject. Some of my professors and also the cool tools that are available for the Mac for doing all those neat little graphs that you saw. That's it. Any questions? Right here. That's that's probably a good question. There are some Bayesian libraries that have been done for R already. I'm not a huge I'm not I've used R a few times but I'm not a big guru on R. But I bet I bet it would that would be a good idea to do this in that way as well. But hey I guess I guess yeah I guess you could do and connect R to Ruby that way so that's worth pursuing. Anything else? Thank you.