 Okay, good afternoon everybody. So We want to talk about one of the oldest techniques in AI in one lecture Which will be tough to do so you will skip the details of the algorithm I will post them later, but you want to understand the concept at least with one example. So how it is done and So that's of course about Decision trees quite old technique in AI and the idea of Decision trees is Basically what they are saying that intelligence Can be captured Intelligence can be captured in a In a set of if then Else rules that provide branching That provide branching for classification So which means From the get-go that decision trees will be a technique like All networks like SVM They do classification So they have been around for quite some time So the keywords here are So you have some rules and this rules are of nature if then else So if this is true, then this is true Otherwise, this is true so human intuition and We put a lot of we put a lot of hope into decision trees back in the 60s and 70s We said this is the way to go expert systems So you can come up with million rules and there you go. You can simulate an expert because well, who is an expert expert is Whoever has a lot of knowledge. So how do you represent knowledge if I ask you how you do it? How do you invest your money you say? Yeah, if the market is like this and the government is doing this and the better is like this I will spend my money here So there must be some rules for any piece of knowledge and Of course another key word for us here is branching Because if you cannot branch you cannot classify so branching is discrimination branching is Saying what is what so now? Well, we know We can we can we can display or visualize trees as arrays So if I have a if I have a tree that contains numbers between 1 and 10 So let's this is this 1 and 2 and 3 so I'm going from left to right and then I have 4 and Then I have 5 and 5 has 6 7 8 So now this is where Data structure course comes handy 9 and 10 So we put 10 numbers in a tree now. This is this is an abstract idea So I'm doing a I it has to be something I can code So whatever so we still don't know how we constructed tree that has contained some intelligence That's a very important question for us to ask in this lecture How do I construct a tree? I don't want to know all details, but what is the main idea? How do I get a tree because we usually get an excel file with some numbers so we get data How do I construct a tree from a data? So the first question I have to answer How do I implement a tree? one is not a specific AI question, but We can We can put a list 0 1 2 3 4 5 6 7 8 9 10 and Then we can look at the parents. There is no zero. So that's a dummy field right and one Doesn't have an doesn't have a parent so I can look at parents and say Who is the parent of two one is the parent of two who is the parent of three one Who is the parent of four and five is two Who is the parent of six seven and eight is five five five five who is the parent of nine and ten three? so a At a tree like this looks like this in the computer It's basically an array. So everything is an array. Everything is a list. Everything is a set So you don't see it. You don't see the structure the structure why you see it if you're if you are a If you are a really good coder you see it Well, but it's not in that form that form is just for teaching The branching is he embedded That if you say five five five five is the parent of six and seven and eight so I can reconstruct that tree So when I get to five the question is where can I go when I get to five? So when you get to five you can go to six you can go to seven you can go to eight you cannot go anywhere else That sense of putting something in a in a data structure is very important for trees because Well, they have been used Again another example from undergrad logical Propositions quite old example. So for example, you want to say what is a and b or? Not a and Not be how do I implement something like this? How do I process something like this? How do I write code to you want to write a program that can take any logical proposition of any length and Say whether that's true or false That requires a certain level of intelligence. I would say that's that's reasoning reasoning is intelligence So, okay, if I start with a a can go to be or it can go to be if a is Let me say I can do this so So then I have two cases either Is true a or a is false? So that's two two situations two possible values that a can have then I go to be and of course then be can go Be also true or false and then you have to reach a decision. So So what happens if a is true and B is true, okay? So if a is true and B is false or if a is false and B is true So if I draw this here too So True and false So this is just for this part a and b Just a and b so a is true b is true is true a is true b is false is false a is false b is true false a false b false false So one section of that proposition So I can deduct I can reach conclusions of course the trees that you want to work with are not that simple So what type of trees will you get so you want to come up with a tree that can do the same type of classification as a deep convolutional neural network dense net with 200 layers It will have it will have more than three nodes. It will have a little bit more than three nodes. Yes I just set this part. I'm not doing the entire proposition Just because we don't have enough time So you figure out the rest Which is not a big deal. You just want to make the point that you can use trees to process logical propositions so Fun example, can we recognize animals with trees? Of course we can So animal recognition So how do I know what type of animal I'm dealing with maybe I asked for the color of the animal is the animal gray Is the skin color of the animal gray? How do I come to that? Maybe that's in my data. So So then the answer is either yes or no So you realize this yes and no this branching in two different directions true and false. Yes and no I'm doing binary classification like SVM did Very easy that has been from the beginning a plus point for multi-layer Perceptrons because they can give you a number between zero and one not just saying zero or one What SVM we started with just binary decision trees It seems to be binary if I want to do more than binary that means I need many many many more edges to emanate from that first node which would make my tree really messy so if I want to Instead of zero and one if I want to depict all numbers between and zero one in that even in a certain discrete step How many are just what I need so that would be messy. So let's keep it binary. Let's keep it binary So, okay, it's gray. Okay. It's great. My next question is Is the animal large is a large animal? Then it could be again. Yes or no You cannot tell me as a matter of taking me don't come back with fuzzy logic. No, no, no, I don't want to hear fuzzy logic Yes, or no, I want to keep it simple So if it is yes, okay, then I make a decision and say, you know what that's an elephant It's great. It's large. What can it be? It's an elephant is ridiculously Simple as if I just that's the attribute that I had that's the feature I had so if it's no Okay, it's it's gray, but it's not large. It's a mouse. I can't work with what I have so Then I go to the other side is not gray. Oh, okay. I Ask another question. Can it fly? cannot fly Then you tell me No Now I'm drawing no Inconsistent the other other side just because of I want to be writing more on that side doesn't mean anything Usually we keep it consistent. Yes. No. Yes. No. Yes. No, so I'm breaking the consistency here just for my own convenience here So no What is not gray? And it cannot fly It's a frog. It's a possibility. There's a Probability that it is then I ask another question Is it active at night Go back Yes No Well, if it is active at night, it has to be an owl If it is not active at night Maybe it's an eagle not a realistic animal recognition program, but you get the point so I'm looking at attributes Which is Which is feature In the terminology of decision trees, we don't talk about features we talk about attributes and Then every attribute has a value so the value of the attribute and then we make branching so This is branching of course this is branching to and then here we make decisions at the bottom we make decisions Okay, if you give me a little bit more data and I can graduate from kindergarten Maybe I can do something sophisticated But I need data so if you give me the color and the size and the ability to fly it and activity at night, that's not enough to Do a recognition of animals So that means you are giving me a table with just four columns. I would I would need 200 columns 300 columns I would need four or five hundred attributes of every animal is it a mammal Where does it live in land or water? So many many other things to really be to really do something that makes sense okay, so If this should convince us, okay, it seems Please keep in mind we are going to reverse order Usually in any machine learning AI course we start actually at the beginning with decision trees because this is an old stuff and Then we get toward the end to deep learning and reinforcement agents Why should we keep with that order? Why because I tell you when we get to a certain point and combine that Advanced versions of decision trees you could if you wanted to Compete with convolutional neural networks That's an amazing thing that what the networks can do decision trees can to do such simple concept What we have to establish a learning scheme for them, how do you learn a tree? How do you construct a tree? Well, we have to plant them So how to plant a tree? Okay, so Then let's look at the decision tree again decision trees and classification Okay, so we learned about classification that they draw lines Which means we said look if I have two features feature one and feature two Then I have I don't know I have some squares here as a class Then I have some circles as a class Then I have some crosses as a class and I have some triangles as a class and If you could draw two lines like this We could perfectly separate those four classes. So let's say this is W1 this is W2 I'm simply finding thing not not talking about bias and everything this just just Give me give me the caught me a little bit of slack. I'm I just want to make a point Okay, so if you if you're doing something like this with SVM Forget about the animals and logical propositions and things like I'm doing something like this. I get numbers Sometimes those numbers don't even have labels I don't even know what the name of that column is in that data set that somebody gave me Just here get this is my measurements classified How do you do that? How do you how do I apply a decision tree on that? So, okay? then we can ask is feature one Greater than W1 So that's a question It's feature one greater than W1, which means I'm looking at in this direction So if I look at this direction, I still have uncertainty It could be squares. It could be triangles so Then I may have the answer yes or no Then if it is yes, I will ask the question is F2 greater than W2 Then I will get yes or no Then I can make a decision If F1 is greater than W1 and F2 is greater than W2 Then I have a square If it is not, I have a rectangle on the other side. I will ask whether F2 is So F2 is greater than W2 again. I get yes or no So if yes, I get a circle if no I get a cross So I can do it I can understand formulate conceptualize a typical classification problem which seems in 99 percent. This is what AI does to classify data. I Can understand it in a way that the decision tree the booring decision tree that we suffered from it in the second year Can be part of AI It's unbelievable what linear algebra dot product is intelligence so Decision trees DTS they have nodes that verify nodes to verify Slash evaluate an attribute so we have nodes This nodes Evaluate or verify attributes look at the attributes. What are attributes your features the columns the columns of your data Then they have branches that embody attributes Values so the branches look at the values and yet this is the weakness of decision trees the branching has to be binary That restricts us somewhat maybe you can go to three or four. I don't know but You cannot go hundred you cannot have a branching of hundred it gets really nasty. We lose the overview So let's keep it at branching to is always yes or no zero or one Younger old and so on so let's keep it. Let's keep it binary and At the bottom the leaves categorize instances Which is a slash classify Classify instances instances So at the bottom you classify we knew that We knew that for many years, but we didn't really know how to effectively Use decision trees because we didn't have algorithms to automatically construct them in The 70s up to mid 80s. We fundamentally manually created decision trees So the work of AI expert was to sit down and say Yeah Gray yes or no. Yes, then I ask you this and then if it is yes, then I go here It was manually designed It would take you could get a grant and say the next five years I want to design a tree like that and people would give you half a million dollar Now maybe not half a million dollar, but they would give you hundred thousand dollars to do it Because everybody know that's difficult. This guy has to come up with a tree with hundred thousand nodes. Oh my god He will get lost so giving some money make him happy so You have to design this manually that was one of the that was the major restricting point for decision trees He couldn't Automatically create them. How do you create a decision tree? How do you? How do you grow a tree? We didn't know the answer until mid 80s an elegant answer. We had many answers Yeah Sometimes I exaggerate to make a point and I count on your intelligence to realize that I'm exaggerating So I the onus is on you so, okay These are usually not black and white in reality. We know that but one Okay, so why? decision trees DTS Could be could be a Good a I choice I Don't know how this question sound What I'm trying to say is who says I should go with decision trees Why should I go with decision trees? We have support vector machines. Why should I go with decision trees? Well, don't you like nature? Trees are beautiful Wow output is Discrete that could be a nice feature for some applications the output is discrete. I Don't need an estimation. I don't need a probability. I know the likelihood. I need a discrete number. Tell me yes And no, tell me it is this it is this it is this it is discreet the output is discreet If you have a data like this if I see an application the output is discreet first thing I think is not normal network first thing that comes to my mind is all decision trees Could be a good choice Here comes the number two killer for me So it's a good choice decision trees are a good choice when no large Data is available Somebody gives you a data center file. You are new in the company here is all the data. We have two hundred lines 200 the hungry monster of deep network 200 you cannot nothing with it decision trees I Can easily come up with a very impressive good functioning Classifier even if you give me ten ten rows ten measurements, I can come up with something We use decision trees if the data is noisy We use decision trees when classes are Disjoint when we know they are disjoint when we know that a square does not share anything with a circle and Elephant does not share anything with a mouse So if I put all attributes together And there are some other reasons to me to me For me, this is the major reason Whenever in the practice I have resorted back to decision trees Especially if I when I do consultation for companies, which happens a lot Decision trees are the best low risk Don't not hungry for data very reliable fast to train Compact do not need a lot of storage perfect Do you think the customer cares about how do you call your AI techniques? Yeah, my method is chaotic evolutionary neural network deep shallow, whatever come on I just need a solution give me a solution call it whatever you want so Yes, if the data is noisy generally Sometimes making things discreet and binary helps because you just get rid of it's sort of filtering basically So if the data is noisy and you are doing some sort of classification with neural networks You need a lot of representative data to take care of noise. So you need also sample for noise But if I have just a few sample for noise a discrete decision tree will be a much better choice So So what happens now? So I'm trying to answer the question. Okay, you convinced me. How do I? How do I? grow a Tree for example, when I have many attributes Why do I ask that question because you convinced me if I have five attributes? That's an easy question even I can sit down like the 70s and On the paper design the tree. I can do that if the problem is small But if I get an excel file with 500 columns, which is 500 features or attributes. Okay, I cannot do that manually So, how do I do that? It convinced me but unless you can show me how I will stick with my MLP and SVM So the question is basically how? to select How to select the best attribute How to select the best attribute to generate to generate The most compact Branching Now I'm trying to become more specific So if you give me even five you gave me gray the color of the skin of the animal You gave me is it the size is it large or small? Where does it live? Is it active at night? How do I know which one of them are more important feature is it the skin color or Is it the size? Even if I had just two color and size of the animal Which one is more important? Why? Because I have to put something at the top The top if you take this tree and rotate at 180 degrees. That's the root this goes in the ground Which like a tree this is where you start and this is where everything is happening and sucking the minerals to other branches So it's very important to who goes at the top Who goes at the top the most important attribute? I would say Because if the most important attribute is at the top So the biggest branching happens here the most important question is the first branching yes or no And then the least important and least important and least important So how do I do that? How do I generate? So what is the best attribute with respect to branching? Why with respect to branching because this is the secret of decision trees This is what make a decision tree a boring decision tree an exciting classifier. So how do I know? Should here be a disc the skin color of animal or the size of the animal or its ability to fly or Its activity at night or it's a mammal or not. So what does it go here? What does it go at the top? Okay, flip the coin No That's not gonna work. You need some intelligence here I would say That's what probably make decision trees very important So, okay, first of all, let's emphasize again. Let's restrict Let's restrict things to binary Okay, I'm just saying it officially. Okay guys. I don't want to do non binary. It would be difficult Let's restrict ourselves to binary Everything that you give me is yes and no then okay, then I have an s which is the set of my training samples I Have training some don't I If I don't have training samples, I don't have anything there is no AI if I don't have training samples So I have training samples. The question is do I train a network with it or a decision tree with it? so Then I have s plus and I have s minus and as plus are the positive Samples and s minus are the negative Samples Well positive and negative is for us the words that we selected to do the binary. Yes and no PSP no P true P false whatever two two situations Two values only So I have a set of training data and some of them are positive some of them are negative with respect to the classification your classification is binary Okay So which means we know then that probability of positive stuff is the Coordinality of the set of positive samples divided by the Coordinality of all training data and the probability of negative stuff is the Coordinality of as negative divided by the Coordinality of set everything Coordinality being the number of members in that set how many numbers are in that list in that vector Fancy way for saying that but there is mathematical terminology Okay, so When I see this I Immediately think about entropy So what is the entropy of s? Why do I think about entropy first of all? I don't know anything else is is early 80s The only tool that I know is a hammer So I see everything as a nail. Okay. I everything I see okay. Let's let's hammer it in Information theory. I don't have anything else Do you know anything else? I don't know anything else So you give me a data set the first thing that comes to my mind. Let's calculate this entropy. Okay Maybe we go with this with somewhere So the entropy is of course minus probability of positive samples times the log of Positive samples in base 2 minus the probability of negative samples times the log of Negative samples in base 2 So what does that mean so we calculate the entropy of data so we know from Information theory So try to try to to get into the historical context So was it the time that we got cybernetics? Norbert Wiener the entire AI was driven behind the stage with among other some notion of information theory and Communication was big we were trying to figure out how to phone without a cable That was important we wanted to we were talking about all we were talking about was in the 70s was information channel Everything was information still is everything is information. So we know from information theory that Because the your question is it random why we start with the entropy again go into the historical context We didn't have any other choice and if it is now 2019 and I sit down I Know two three other more equations that I could use instead of entropy, but all of them come back to entropy That's such a fundamental equations for us I and not sure when we will replace that with something totally different that the optimal the optimal length code Optimal length code For a message you see this is entirely information theory You want to send a message to somebody and the question is what is the optimal length code? So how should I encode it? Such that I don't waste bits and then when I send the information and some bits get manipulated I can reconstruct the message on the other side of the information channel. It's not about learning is about communication message with probability P is of course minus log of P in base two Bits So the optimal length code for a message with probability P is minus log of P in base two bits So you need that many bits You need that many bits to optimally encode a message Just take that as the fundamental statement of the information theory you need that many bits To encode a message that has probability P So if you have a if you have a message that is frequently used Will we use more or less code for it? Less of course less because it's very frequent. I don't want to use big codes for something that I'm sending every five seconds Now I have a message that only I send every ten years Short or long code make it as long as you want. You are sending it every ten years. So That means to us the entropy quantifies the expected the expected number of bits the expected number of bits I put expected in the brackets because We never know the accurate Probabilities we always know an estimate of probabilities because we can never Observe events Infinitely we can always observe them in a very limited time. So whatever you get is an expected value It's not the actual value so entropy quantifies the expected number of bits to encode a class of Randomly drawn randomly drawn samples So let's see how far we come we started with AI when you're back to the roots of information theory. We are at entropy Yes Because this is it Okay, so what? to construct a tree This is what you want. Tell me whatever you want about Shannon and Wiener and anybody else cybernetics communication channels I Want to figure out how do I construct a tree when you give me a table of numbers? That's what I'm after Just push your terminology on me. I will patiently listen to you at the end I want an algorithm that takes a Table of numbers and construct a tree for decision-making. That's what I want So but to construct a tree We need we need to know how much how much we gain When we add how much we gain when we add a specific Attribute so we are trying to nail it down. You're trying to nail it down Okay, I don't care about that entropy probability whatever I want to construct a tree So I need to know how much do I gain So how much value is every attribute adding you give me a table Go back to the the first two lectures. We were talking about principle component analysis You give me a bike a table of 1,000 columns and the question is how much of them are garbage? So same question. I want to know how many which one of those columns add value To my classification Which one of them are principal components? Okay, then use PCA. I may use PCA But PCA will give me the 10 principal components I still have to construct a tree with those 10 principal components of the features So how much do I gain? I start with a skin color and then I want to add the size of the animal Is that beneficial? Would that make my class if I had better how much do I gain if I add this specific if I add a Specific attribute so you ask this question then you trim your mind to go after the first algorithm to automatically construct a decision tree How do I measure the gain so the question becomes this? How do I measure the gain of any specific attributes which are features? When I'm putting the first node then the second node and then the third node and then the fourth node So to put a tree together Okay So then people sat down and say okay, let's define a gain function That takes the data and takes the attribute So you give me a data set which is your table of numbers and you give me the attributes Separately What what I can do at the moment is of course, I will calculate entropy. So I will calculate the entropy of the data Okay, I don't know anything else The entropy of data is this expected number of bits to encode them So it's an optimal size message. Okay good So what if I if I put things in place? I want to reduce the entropy So whatever reduces entropy is good. Do we agree? Do we agree? This entropy is chaos. I Don't want to have entropy If my attributes get bigger and bigger and bigger they lose their discrimination. I Want to make them compact lean Not chubby lean Not chubby in bits. So how do I do that? Well, I have to take off minus Whatever I'm adding which is minus the sum of the the number of Elements in a set SR divided by the number of total elements in my in my training data set times the entropy of SR and The sorry, this is V not not R. Sorry SV and V comes belongs to the set of all values of the attribute So this is all values of all Attributes so what is done the gain the gain becomes the expected Reduction expected reduction in entropy upon upon sorting on a so I'm going I'm looking at subsets so SV SV is of course subset of s of Course So I grab only the instances that are talking about color I grab only instances that are talking about flight. I'm grabbing instances. So I am looking at different attributes and That's a if I if I start with the size of the animal Is it good or is it bad? How much I cannot reduce the entropy? because I Understood classification as an information channel. My measure of goodness is a minimum entropy So you see we can formulate any problem in any framework that you want If you are good in that framework if you are an inform who thought we get SVM in with using linear Algebra and everybody was working with back propagation So everybody's using a hammer and you say you buy an axe and say I'm working with an axe Axis much better. I can destroy things So if you understand your tool You understand the problem redefine the problem within your own framework You define it and say We know the entropy is there. I want to select once that minimize I subtract it I will minimize the entropy So I want to have a decision tree with the lowest possible entropy What does that mean simplest way? How many nodes you get to construct a tree? For a reasonable size problem 5,000 nodes 10,000 nodes half a million nodes So it requires a storage you have to store them, right? So do we have an understanding of bits and bytes? I have to store that tree. I want small trees. I Want small trees all comes razor? I? want to solve the problem But have low entropy which based on my limited understanding from it has to be small trees And thanks to God we have all come because all comes also told us keep it simple Keep it simple means low entropy Because if it is big it becomes chaotic Okay, good. So now we want to have an example How do we an example I was I was torn apart in doing doing something Rather theoretical or doing something practical and say, okay, let's do something practical Let's say you want to This is an example from a textbook that I wrote somewhere I don't even remember which textbook was but I'm pretty sure if you type it in a search engine you will find Where it's come from it's called playing tennis play tennis so you want to write a program that Somebody uses to make a decision should I play tennis today or not or the club owner should I keep my club today open or not? So and then you look at the suit you look at the measurements. So let's let's do the measurements Let's see I can give you this table. So it's day one day two day three day four day five day six the seven eight Data science is fun, isn't it? a nine a 10 11 12 13 I'm at the bottom 14. Okay, good enough 14 14 So somebody the club owner gives me 14 measurements of 40 day 14 days So he has 14 days randomly selected and he has looked at the outlook outlook and Says the outlook and day one was sunny Was sunny again was overcast Was rainy Was rainy Was rainy Was overcast was sunny Sunny Rain sunny Overcast Overcast rain data collection is a painful process. Oh my god. My hand is Pain, okay So look at one attribute which is outlook Outlook is one attribute 14 measurements every day. I look at that. I measure or observe the outlook. What is it? Is it sunny? Is it overcast? Is it rainy? What is it then? I look at temperature? I look at the temperature Let's say I say I have hot mild and cool just to keep it simple and then first day You had hot hot hot mild cool cool cool mild cool mild mild mild cool and at the very bottom again mild Okay There are other attributes, but I will skip them. They measure Humidity they measure wind so there are many more But I will skip them. I just want to look at two Attributes I'm I'm assuming that looking at the outlook and measuring temperature this guy can make a decision Should be play tennis, which means should I should I keep my club open or not? That's a business decision, right? So if I keep open will people come I have to pay for the operation very serious Don't charge who will go play tennis not me is not about you is about business They will pay you good money to write that program. You don't want to don't want to do it Refer them to me. I will do it and then the decision first day was no No Yes Yes Yes No Yes one two No Yes, yes, yes, yes, yes, no So this are the decision. So so of course this are Okay, this war the attributes And this is the output which is the decision Nothing has changed. We are used in AI to somebody give us a table With many many rows columns are features. There is usually one last column in most application Which is the desired output? So we made this decision and he says the glob the club owner I Assume is a male for some reason or maybe prejudice. He will tell us That he made that decision to close the close the club down or keep it open and they were good decisions So this is gold standard for him That's why you want to learn it because we know This decisions were correct So you cannot learn shaky decisions the data has to be good Therefore when somebody comes and give me this data I don't start writing Python code. I Spend some days back and forth to make sure that this is not garbage How do you get this thing? Well, we hired a co-op student. He did it. Okay Which university it was not a little okay fine and we continue Who did the data collection? What was the bias? What was the frequency? Was there noise? Were there any manual changes you have to make sure because this is all you get You have to make sure that the data is good Data is everything So we are not talking about that part So it's not our business, but we usually assume in this course that the data that somebody gives me is Is well posed is well behaved is a good data? Okay So now using this table and everything we talked about with information gain this made up Equation I want to construct a tree. How can I how can I construct a tree if I figure out? So I'm just looking at this too. It's not even difficult. I didn't add humidity. I didn't add wind. I Didn't add the weekday. I didn't add all any of this I just have two attributes. So the decision is this Which one is a better feature? Outlook or temperature How do you make that decision How you tell me if I make that decision and Sell that software that software is not intelligent. Of course it is How do you make that decision? Okay, so example so Outlook Now I'm separating the outlook as one attribute This is outlook Outlook has three values It has sunny. It has overcast It has rain the problem is not Try value the problem is still binary because the classification is yes and no but I have multiple Attribute multiple values for my attributes Has nothing to do with the classification itself so So if I look My data set has nine positive and five negative samples one two three four five nines and Of course nine yeses So you can calculate the entropy of this the entropy of this is 0.94 Right, we know how to calculate the entropy minus The probability of positive cases times the log of the probability minus the probability of negative cases times the log of Probability of negative cases you get 0.94 at the entropy the entropy of this data set is 0.94 Hundred more elements such that I can get a lower entropy. So high entropy is not good This is not gonna be easy So you should know the data You should not develop this and sell it for five hundred dollars The value of the data and what it takes to come up with an AI solution for it So if I look at this now, I messed up because I want to write below this something Let me rewrite this here So we had s is nine plus and five minus and the entropy is 0.94 so here We had two plus and three minus What does it mean? I have two cases I Have two cases sunny that have contributed to a yes decision Not here not here Not here one here sunny. Yes One here sunny. Yes So I have two cases that an attribute sunny Has contributed to a positive decision and I have three That has contributed to a negative decision. So I have five sunny one two Three four five. So I have a subset. I Have a subset SV right of sunny Overcast sunny outlook I Can calculate the entropy of this 0.97 even higher the subset of sunny days Has even a higher entropy than the entire data set. What does it mean? Sunny alone Probably is not a good value. I cannot make big decisions with this So if if the entropy is one if the entropy is one just flip the coin Because everything goes You look at the overcast. I have four pluses and zero negatives Overcast no, is it no, no, sorry is yes Overcast yes overcast yes Overcast yes overcast yes Overcast is always yes How much is the entropy do you need a calculator? How much is the entropy? How much is the entropy? Every time I see overcast is a yes There is no chaos entropy is zero good feature other feature That's a good feature the problem is you can not classify with one feature That's the problem. We need more Rain We have three positive and two negatives Anthropy is again 0.97 Okay We calculate these numbers We have to do something with it because now I have the gain. I Calculated the entropy for every subset SV is sunny SV is overcast SV is rain and Now I want to calculate this push the number of this here divided by this which is 14 Calculate the entropy which I did have the entropy of s which I did So now calculate the gain for me how much do I gain if I put outlook at the top So the outlook is the root is the point of departure you start there You can make compact decision really fast So so what is the gain? My data set s when I'm looking at outlook as my attribute as my attribute features So which is we set? entropy of s Minus or let me write it here minus 5 over 14 Entropy of Sunny entropy of Sunny minus 4 over 14 times the entropy of Overcast minus 5 over 14 Entropy of rain So if you put all those number here you get 0.246 So if I put all those number here So this is of course this the size of this how many you have five you have four you have five Calculated the individual anthropies. I put them here do the calculation. I get this Okay, so the information gain if I add Outlook as in my main attribute is 0.246. What does that mean? Nothing at the moment because I don't have anything to compare to So which means I have to look at Every other attribute in this table to see how much would I gain if I go with Temperature how much would I gain if I go with wind how much would I gain if I go with humidity How much would I gain if I go with week day? So I need to calculate all of that. Oh, I have to write a for loop. Yes, please do you have to write a for loop For now I equal number of attributes to end and do this calculate the gain find the maximum Maximum you want the maximum gain So, okay Okay, next one is temperature I intentionally just took two So in a real case we have more you have several hundred of features and attributes Okay, again s doesn't change s is nine positive and five negative is the data set the data set doesn't change The entropy of the data doesn't change is 0.94 and now here also I have three values I Have hot I have mild I have cool Hot I see that I have two plus and two negative Hot no hot. No hot. Yes. I Had one more Where is it? I don't see it. I must have I have one more Maybe I made a mistake. This is hot. Yeah Because if I if I messed that up everything will be wrong. That's the data. I Cannot change the data. I made a mistake by reading the data But if I instead of hot I calculate mild Everything is messed up so So I have two hots Twice hots that the answer was no and I have twice hot that the answer was yes What is the entropy? What is the entropy? 50% of the time is yes 50% of the time is no. Is that a good feature? Is not a good feature flip the coin So Anthropy is one point. Oh, 100% 100% entropy. Oh My god, so I cannot count on being hot So it can be hot and I go play tennis it can be hide and I stay home and watch TV So mild I have four pluses and two minuses Which make the entropy to be 0.92 and For cool temperature. I have three pluses and one minus And the entropy is zero point eight one so so far the cool temperature is The value of an attribute with the lowest entropy. I like that It's make the decision making a bit easier So then I have to calculate the gain of us given Temperature which is? the entropy of My entire data set s minus four over fourteen Times the entropy of hot Minus Six over fourteen The entropy of mild Minus four over fourteen The entropy the entropy of cool Which will give me the moment the bottom? zero point zero to one so this is the gain For outlook this is the gain For temperature What do you think should I start with temperature or I start with outlook? Of course, I start with outlook if there is no other feature If there are other features I have to do the same thing for all other features and Then I decide so if if outlook and temperature is the only thing I have I Will start with outlook so outlook will go at the top and that says my branching would be sunny Overcast rain For each one of them the temperature could be hot mild cool hot mild cool Hot mild cool The decision would be yes. No. Yes. No. Yes. Yes. Yes. No Okay, so you could all we need is this You're done. That was 1985 Quinlan came up with the idea of decision tree constructing a decision tree based on information gain so Yet you don't I will I will I will have it somewhere I will find that textbook and we'll post this example online But you can easily find it. That's a that's a good example so Sometimes for the sake of Education we plagiarize each other works and then we don't give credit to each other but These are things that we can easily look up. So who came up with a play tennis example you find the textbook So the gain of data set Taking outlook is greater than the gain of the data set s Taking temperature that means start with Outlook at the top. So assuming there is no other no other attributes So outlook we said it could be sunny It could be overcast It could be rain Then we said, okay, if that's the case Then if it is sunny, what about? temperature Then temperature could be sunny Could be sorry could be hot could be mild could be cool Same thing again Temperature sunny Sorry hot mild cool Rain temperature hot mild cool Then you have to make a decision For each one of these cases That's the branching now For some of these cases You have already seen the day So have I seen a day that it is sunny and it was mild If I have seen it that's not a big deal to say keep it open or not But what happens if there is a sunny day and it's cool? That's a very unusual Combination and you don't have any measurement for it If you don't have measurement for it and you make a prediction and it's correct. That's intelligence So if it was something in the table What do you think we always? Exclude the training data from the testing of course that would be cheating So Generalization is how would you tell me about unseen situations? so if this is so if this if this is Never seen and then you say yes Because now you have a decision tree That's a performance I Generalized your if then else rules to unseen situations That takes intelligence How how we came about that intelligence using entropy and information came Really? But those things are boring Well, not if you give it to passionate researchers. They make something with it Okay, so Decision three construction Algorithm there is of course there is a bit more Detail necessary it's already two minutes. Okay, the first one was the so-called ID three Which stands for? Sorry Okay, you have one minute iterative dichotomizer Version three that was the first one. That was the first algorithm that Quinlan came up with 1984 1985 ish And a better version this was entirely based on what we talked So that's all loop plus the information game plus the manual example. That's ID three That's in math lab. That's in Python. That's in our that's everywhere. You use it people use it They don't know that's ID three and there is another one which is c45 and C5 so this is just the names that people have come up with for those algorithms. So So you don't need to implement it. You just need to use it. So the question is now What is overfitting? Can you overfit with trees? Of course you can You think you were fitting is just for deep network. No, you can also overfit with three. How do I know I overfit it? So if you get a very large tree Then you most likely overfit it if you come up with a gigantic tree half a million nodes. Whoa Take it easy. What did you do? The branching is supposed to help us to come up with compact trees not many levels although is logarithmic to go in but it is it takes some time to to to construct them We prefer short small Trees who comes razor So why is that how can I why is that? Why do we prefer short tree small trees? Yeah, but What is the rationale that the smaller tree is a better tree? Same apply to same. Why do we still I'm not not all of us some people What why the the same ones among us still prefer shallow networks Shallower networks if I can do it with seven layers. I will not use 15 layers Why is that the probability that a small network or a small tree? Randomly fits a difficult problem is very low But the probability that the gigantic tree fits everything is very hard So the smaller the more customized it has to be So how do we avoid overfitting? We say this and we stop How do we avoid overfitting? Why you can grow the full tree and then you do some post pruning prune the tree Right let it grow as much as you want but come up and say okay this branch. I don't need it this branch I don't need it this branch. I don't need it. How do you know you don't need it? K-fold cross-validation is also applicable to trees So I can cut the branches Do the validation do I drop or stay the same accuracy if it says accuracy if this the same accuracy This branch was not necessary of course if I want to implement this it's more than half a day job K-fold cross-validation running on a decision tree. It takes a little bit of time, but we have packages that help us be so stop when branching not statistically Significant I like to be more because I Generally do not prune my trees in the backyard. I love them to just Let that let the animals and birds prune them. So I just don't touch them. I don't touch them unless they are sick So I will go with me Stop when the branching is not adding any value anymore. What is value? K-fold cross-validation There's nothing else leave one out. I Have 10% of the data using for validation. I grow and grow so when I add when I add branches. I validate Am I getting the same result? Why should I add? Why should I branch more? When my accuracy is staying the same Okay, I will I will upload some additional information What I didn't talk about is random forest When I don't create one tree I create many many many trees So I have many many many decision makers is like a room full of expert With random forest you can deep you can beat deep networks in many cases. So I will also provide some information on random forest