 Welcome back for my third module in our demo and looking at different methods for predicting who would win individual head-to-head baseball games. So far we've looked at the ELO based model from the 538 website and found that that does actually a pretty good job, perhaps a little bit too conservative in assigning probabilities to the favorites. We've also looked at the win probability model from Bill James that looked at the average number of games each team won to then come up with a probability that one team will beat the other team. What we found with that model was that we really need to know the true winning average of each team at the end of that season to predict the games within that season, which is a really silly requirement because we don't know the final record. If we knew the final record we would know who won that game. We also tried looking at the record from the end of the previous season that didn't do so hot, and then we looked at the record up to that game and found that that didn't do so hot either. So far the 538 model is looking pretty good and like I said it's a little bit conservative. So the third model that I want to look at today with you is the betting line. The betting line is a reflection of where people think or who people think is the favorite to win a game. People bet on all sorts of things. People bet on baseball for better or worse. We might even think about betting on baseball based on what we come up with our results from these analyses. But what they reflect in a way is what we might call the wisdom of the crowd, that people are putting money where their brain is or where their heart is or where their mouth is. So you might think that the Cubs are the best team in the world, but how much are you willing to bet on that? And so for a single game if you say how much are you willing to bet on team A versus team B, that gets reflected in the betting line. And what people have found in a lot of different applications is that when we look at the wisdom of the crowd, the crowd actually does a pretty good job in aggregate of predicting certain estimators. So for example, people have gone to state fairs and they've said, hey, looking at this cow, can you tell us what you think the weight of the cow is? So individuals, individual estimates might be all over the place, but when you look at the average of those estimates, they tend to hone in right on the right weight. People have done this for a lot of different things, not just the weight of a cow. And so the question that I have for today is when we look at the wisdom of the crowd for the betting line, how does that compare to the actual probabilities, the actual outcomes of the games people are betting on? And so that's what we're going to do today. We're going to take the analysis we've done so far and we're going to integrate in the probability calculated from the betting lines. So a couple things we need to figure out is we need to get a source of data for the betting line and we need to know how to convert that betting line into an actual probability. This information was not included in the 538 website and I'm finding that it's actually kind of hard to find. And so this is going to force us to learn a little bit about how we can use our to parse HTML files to extract information we want. We're going to do some web scraping to get access to these betting line data. Okay? So as we've done so far, we're going to open up our GitHub issue tracker. I was looking back at the code from before. And also at the end of the last module, one of the things I forgot to do with you all was to commit and merge our branch into master. So I went ahead and did that after I signed off. And so if you look at what we have here, I went ahead and closed issue two. And if we look at our issues now, we have we've closed our two issues. So I'm going to create a new issue. And so we're going to then validate the betting line model. And so a couple of things that we need to think about what we need to get if we think about our checklist is jotting down some notes earlier to pop in here into our issue tracker. We need to find a site with the betting line data. We also need to figure out how to convert the betting line data to probabilities. And we then need to calculate the probabilities from the betting line data calculate. And we need to make a table with the probabilities of winning for each game. And we then need to join that table, the probabilities table. Let's call it the win line probabilities data into our favorite win probe data frame. And then we need to plot the performance of the win line model relative to the 538 and winning percentage models. All right. Cool. So I'm excited to be able to work with you to show you how we can go about extracting data from websites. We're so used to having data come to us as CSV files. Well, the Internet is full of data. And if we can learn some simple tools like the Rvest package from the Tidyverse, we can use those tools to extract a lot of information from the Internet. All right. So there we are. And I'm going to create a new branch. So I'm seeing that my master is in red, which tells me that I've got some uncommitted changes going on here. So I do get status. I see that I generated a file day 2 EPS. So this is a plot that I was using for thumbnails. So I'm going to go ahead and move this to my desktop. So I'll do mvday2.eps to my desktop. And this isn't being tracked, so that's not going to cause any problems. I was using that as a thumbnail to put in kind of the caption for my YouTube videos, as well as when I put up a Twitter announcement of the new module. All right. So we're back to green. I'm going to do get checkout. B. And I'm going to call it validate the pay line. So I'm going to call it, I'm going to, so I'm going to check out a new branch. Validate betting line model. All right. So here we are. Again, if we do get status, get status. It tells us we're on the validate betting line model branch. And everything is up to date and we're good to go. All right. So we need to find a website that will give us the betting line data. And so again, if we look at our 538 website, there's nothing here. So I'm going to Google for MLB betting line. Yeah, MLB betting line. All right. So money line or betting line. I guess money line is perhaps another way to call it. And let's go ahead and click on this link to see what it tells us. So the next game is tomorrow night between the Cubs and the Cardinals. It doesn't actually have the betting lines in here yet. But if we look down here, it tells us a money line used in baseball and hockey takes the place of a point spread. So it's based simply wagering on the contest based on a given price rather than a point spread. So the team wagered on has to win the game outright regardless of the score. So there's a minus sign. So say the betting line is minus 130, that'll always indicate that it's the favorite. Whereas if it's a plus, that team is the underdog. And so if you bet $130, you will make $100, right? So you'll bet 130 and you'll get back 230. Whereas if you're betting on the underdog and it's plus 120, then if you bet $100, you'll get back $120. So you'll get back a total of $220. And so it's a better payout if you bet on the underdog because, of course, they're not expected to make as much. So this tells you how it works financially. But it doesn't tell us exactly how to convert that to a probability. So we'll come back to that in a minute. And so let's go ahead and click on some of these links to see what we get. Now here's a betting line for the game tomorrow night. So St. Louis at Las Vegas is minus 136. So they're the favorite. So if you bet on the cardinals at minus, if you bet $136 on the cardinals, you'll go back $100. Whereas if you bet $100 on the Cubs, you'll get back $126, at least according to this. One of the things I'm not seeing in here, however, is this tells us the odds for the next game. But it doesn't tell us the odds for the previous games. So I'm trying to get historical data. I'd love to get data going back to 1871. But I know that's a hard sell. Here's a site for Odd Shark. They don't have anything in here yet for the upcoming game. Let's see if we click on Moneyline, Odd Setting. Just kind of looking around. Maybe click on MLB. So this, again, doesn't quite tell us what we want. Maybe we can't scan backwards. All right. So again, what we're looking for is the ability to look at, say, last week's games or last year's games. But I'm not seeing it in here. Go back to my Google search. Let's click here on ESPN. Data line is currently unavailable. Hmm. This is a problem, right? Like, we know what we want for data, but sometimes it's just hard to find the data. I'm just going to open up a few links. So let's try looking at the Sportbook review. Here we have the game that's going to be played tomorrow night. Again, they don't have any betting line data in here. But they do seem to have, yesterday, we've got a little calendar here. Ah, so if I go back to the 15th, which was Sunday, and I look and I see the Cubs versus Diego, the Cubs were a favorite, minus 138 to 127, and the Cubs ended up winning that. We then see all the betting line data that we have there. So let's see. So this is for July 15th, 2018. So oftentimes when websites make a page like this for a specific day or for a specific search query, when we look at the HTML or the URL, sorry, it will embed that up in the URL. So we see 2008, 2018, 7, 15. So there it is. So if we go back to say 2015, 7, 15, maybe that was all star break. So let's put an 01. Ah, so 2015, we have data. If we go back to 2010, yep, we've got data for 2010. Let's do 05. No, no data for 2005. Let's look at 08. No data there. 09. We have data for 2009. And let me just try 08 again, but with a different date, because we might be hitting that all star break. So if I put in 0501, 2008, nothing there. But if we look at 09, we see this page. And so what this tells us then is that we can get odds data going back to 2009. So about nine years worth of data here, which is really exciting. I mean, it's not back to 1871, but still that's pretty cool. The other thing is that I can make, I can use R to generate the URLs for any date that there was a game if I know the date, because it goes year, month, day. Okay? So the next thing I want to know as I'm looking at this from May 1st, 2009 is that we have on the top here Miami versus Chicago Cubs. We have the wind line probabilities. So the Cubs are heavy favorites over the Marlins that day. And I want to know, can I get the names of the teams as well as the score and the betting line? Okay? So the tool that we have, I'm using Safari, but if you're using Chrome or Firefox, it's similar types of things. There's a, if I right click on the area of the website, I'm interested and I can do inspect element. And so this then pulls up the HTML that goes into making, making that link, making that spot in the table. And so what I see is this div that's highlighted, if I click up here, that's highlighted across the middle here of my page, this div is the part of the HTML code that's representing this data. And if I kind of keep highlighting down, I see if I click on this div that I am in the scorebox. So what I'm looking for then is, so this tells me it's the final, the score. So six was the total score and six, eight was the score for the Cubs. Okay? So this, this gives me my score information. I'm also interested obviously in the teams. And so how do I get the teams? So it seems like my divs are out in a weird order because when I click on this, it's not looking at the score, it's looking at the game numbers. So if I look here at the teams in this div, it tells me team 624, team name in that span, MIA, Miami. Okay? So similarly in here, we see CHC. And I wonder what this opener has. So the opener must be, it must be hidden because it's not shown in here where I'm highlighting here. It's showing the score, which is a bit weird. So this must have been the opening bids, the opening cents, event line consensus. And then here we have the first bookings, where again we have the 250 and the minus 270, which I suspect corresponds to pinnacle. I'm not sure what that is. Consensus. Yeah, mine histories. I'm going to hide that. That 230 I think must have been the opening bid. Okay. So what we're interested in getting is the date, the name of the teams, the scores, and then you can see it's changing the code as I hover here. And then the first information or this information here in the event line book tag. Okay. So again, what we're going to do is we're going to try to extract this information from the table to then get our probabilities from that wind line. Okay. And so I'm going to go ahead and copy this because I'm going to use this as my practice. And so a library that we're going to want to use is our vest. Nope. And so our vest is part of the tidyverse, I believe. And I'm going to, you know what, for right now, I think I'm going to make a new file to have something to play around with. So our script. And I'm going to say, read HTML is this. And so then this is the HTML page. And if we look at the HTML page, it's not going to show us much. It's going to say it's an XML document. And it's got a head and a body. These are the first two layers of the web page. And so I'm going to then actually, I'm not going to pipe that I'm going to keep that separate for now. And if I look back at the website, so the div that I'm interested in is a class called event lines. Okay. So you see if I highlight this div, it's covering pretty much exactly what I want it to cover. Right. So div class event lines. And so I can do HTML page. And we can do get nodes. And then div, period, event lines. We just make sure. Yep. So it's lowercase, uppercase event lines. I'm sorry, it's not a get it's HTML nodes. Content delayed. So this then takes a little bit of doing to figure out what exactly it is we're looking at. So this now is returning the div class event lines and everything that's contained within it. And then from this, we want HTML nodes. And we're going to want what is the div that's being described for each of the games. And you know what, content final is probably more of what we want. And then see how we have all of these that that the class is event holder, holder complete. So I'm going to try instead of doing that event lines, I'm going to try holder complete. That didn't work. Should it should be div period. Great. And so we see we've got the 15 divs, one for each of the games that was played on May 1. Okay. So in here now, we have each row representing represented by a different node from our HTML page. So if we look at this first div, that's in here this ID 21712. This then should have all of the information we want. Just looking through here. So it's got the score. And then in here is the check box, but I'm not worried about that. The time, the teams, right? So we want the teams. And then we want the money line. And so what I'm going to try to do is for practice, I think I can get out the first node by doing this. I'm not sure what that is. But I guess what we're looking at now is the content of the first node. Okay. And that didn't work. Alright, so let's just work with that. And we can then say, get nodes. So what we want are the team names. So we're going to do span team name, hyphen name, HTML nodes. And so this then is that line of the span. And if we do HTML text, we now get back our two team names. And if I save this as team names, I can then do team names one. So this then gives me team names for that first game. Okay. And if I looked, if I did two here, and looked at team names, I'll get Mets and Philly, which is Mets and Philly. Okay. So we'll leave that at one for now. So team names. And similarly, we can look, we want to look at team scores. Or we're going to look at scores. And for now, I'm going to copy this, because I'm going to want to parse out the scores from that. So we're going to probably change the span team name to be up here. So we're going to look at total. So we're going to want, let's try total and see what happens. Six to eight, which is the score here, right? Excellent. And now we want the money line. Now I'm going to again, copy this, we'll clean this up here in a bit. We want the money line for that first. So this was there. Event line opener. Consensus. I'll div event line book. Okay. So we can do event line book. And let's see what happens when we do div event line book. If we do money line, we get nothing. All right. So I'm going to just focus on this part. That gives us nothing. So maybe I'm typing in the class name wrong. Yep, L should be capitalized. So this then is giving us the 10 event lines. If I do HTML text, we get the text version for all of the lines. So 250 to 70 is what we want. And then 245 to 65 is the next one, right? So instead of doing HTML nodes, we can do HTML node for money line. And this then gives us the text of the money line, right? Excellent. And so I'm curious what this looks like if we do this, we get two nodes with that event line book value. I'm going to try to do another HTML nodes and book value and see what we get for money line. I'd like to have it output it as a vector, not just a concatenated character. And that gets it for us. Excellent. So we now get 250 and 270, 250 to 70. Excellent. Just as a test, let's throw in a three here. What do we get for money line? 120 128. And that's the third game there. Excellent. All right. So what we want to do is to create a function to parse table row. We'll say function. And then we'll say row, right? And so then if we give, so we're going to want to give it the page and the row. And so then we're going to say team names is page. Actually, we don't want to give it the page, we want to give it the rows, right? And so basically where we put it in would be right about here. So we're going to want to do something like this where we then say, we're going to want to do some type of map where we give it the input. And we're going to ship that to parse table row. Okay. And so then team names should spit back out row. And it's going to then be this, then give it scores and money line. We take this stuff out and put in row. And then we're going to want to return. So then we're going to, yeah, we're going to return C team names, scores, money line. And then let's see what happens if we try this. Could not find function map. So let's per that package that worked. And so we run map. So we have to use the per function. So let's be sure I'm sorry, the per package, which I think we used before as part of the libraries that we've been loading. Anyway, we'll figure that out when we bring this in right now, I'm just kind of sandboxing what's going on. So we have map. Let's make that map the FR argument one must have names. So we can we can get explicit about this. Let's do team one equals team names one team two equals team names to score one equals scores one score two equals scores two and then money line is money line one, sorry, money line one, money line two is money line two. So while I'm doing this, so let me just double check that this works. argument one must have names. So let me think, we've got this error that it argument one must have names. And I'm not quite sure what that means. We're returning a vector here. Maybe if we returned data frame instead, that would be better. So let's put tibble in there. And so if we run this, that gives us the output that we want. That's pretty exciting. A couple of things that I'm going to take care of here just because it's going to be a lot easier is that everything here is going to be formatted as a character, which isn't really ideal. So I'm going to do as numeric score and as numeric scores. And then here we've got the plus and minus. So if it's got a plus, I want to get rid of that. But we're going to do as dot numeric. And then str replace, we can do plus with nothing money line as dot numeric str replace plus with nothing money line. And this needs a second closing parentheses, closing parentheses for the overall tibble. Great. So let's see what happens if we run all that empty pattern not supported. You know what I think I've got my syntax wrong for str replace. And it wants the string and then the pattern and replacement. So I got this out of order. All right, so we come back here, pattern replacement, money line plus nothing string. So that's the string money line one. And the pattern is the plus sign and replace it with nothing. And I think what I need is the escape character of back back because the plus sign has a special meaning in our networks. Excellent. So now we want to scale this up for all of the dates. And so I'm going to now cut this and bring us back to analysis dot r. And I'm going to come down here to before we do our tidying, because we're going to want to add in code to extract the web pages. Right. And what we want to do is we're going to I'm going to go ahead and run all this code that we previously ran yesterday. And I'm going to want to take the date from fave favorite windprop. So favorite windprop. And we're going to I'm going to pull the date column. Alright, so this gives us all the dates. But I need to filter first on date greater than or equal to what did we say was the earliest may 2009. So I'm going to say date before 2009 01 01, sorry, date after that. And we're going to pull that date. And so then this gets us those. But what you notice is that there's a lot of dates in here that are duplicated. And so I'm going to push this through unique. And now we have one date one value for each date. Here it's going back to 2013. Again, I can do tail to look at the end of the vector. And we see it goes back to opening day of April 5 2009. Excellent. And we need to save this as betting dates. And I'll just move this around a bit to put everything on its own line. So these are betting dates. And I would like to now get my HTML pages. So maybe instead of betting dates, I'll call this HTML pages. And we will then pipe that. And we're going to put pipe that into a paste zero, or paste zero, where we give it this URL. And we're going to replace that with what's coming through the pipe. But we need to remove those hyphens. So we need to do str replace all what's coming through the hyphen with nothingness. Let me just double check that this works. Yep. So that's can that's remove those hyphens. We now pipe this to generate HTML pages. And we get our URLs. I'm just going to grab a random URL to make sure this works. Excellent. So that works. So we've done this now for one page. And we want to repeat it now for all of the dates in our vector HTML pages, we want to be able to turn through all of them creating this table and concatenating them all together. What I'm going to do is create a new function that I'm going to call get get money line data. And it's going to be a function. We'll give it a URL. So we're going to create a function to read in the page, extract the nodes, and then give us the table for that day those, you know, however many games there are. And we're going to use a map function again to spin all these HTML pages into get money line data to then spit back out what we're working on. So to test this, I'm going to create a variable URL. And I've got to steal this from, I guess we had, let me use 2009. That and then we're going to then say, read HTML URL, spit that out here. And this should work. And again, this is the table for August 15 2009. That's excellent. I'm going to go ahead and comment this out. So it's not a bother. And then we'll do map DFR where we're going to give it HTML pages. And we're going to spit that to get money line data. Let's see how long this is. It's about 2000 records long. So before I do that, I want to double check a couple things. So I'm going to make a test data set. So I'm going to do tail HTML pages. I'm going to then put that into here, test. I got to read this whole function in. So it's running, but it's got to go and pull, go and pull. And so it's kind of slow. It tells us that there's about 64 lines, something I'm going to put in here just to know where things are and that things are moving. I'm going to put in a print URL. So every time through, it's going to spit out the actual URL. One other thing I want to test is sometimes formatting changes over time. So I want to put in a 2013 just to make sure this all works. So what we see in the output is that for this 2013 data, it's inserted the name of the picture into team one and team two. And so I'm going to come back up here to my parse table row function and use str replace all replace. And that's going to be my string. And then I need a pattern and a replacement. So the replacement is going to I'm going to make nothing. I'm going to put in a space hyphen dot star or maybe hyphen star, and then forward slash n dot star. So the dot reads to the end of the line. It doesn't also get the new line character. So I'm going to go ahead and run this and then run this to see if we get an improved output. And it looks like nothing's changed. So it's a possibility that this isn't really a space or a space on the space bar. It might be a tab. So I'm going to put in here a backpack ass and realize this should also be backpack. And let me just leave that the way it is with the backpack and for now, because it's with are you need the double backs, not just a single run this, and that's still screwing up. So let's be safe. And in here we'll put a backpack ass, which is the meta character to match a white space. We run this and this. We see that for team one. We got it right. Okay, excellent. So we want to copy this down. And we'll do this for team two now. So team two, team two, reload this. And if we do this, everything looks good. Okay, I'm going to try one more date. Let's try this 2017. And hopefully this will work too. Excellent. So now we're ready to run this for all of the dates. And I'm going to save this as money line. Yeah, I'm just gonna save as money line. So we'll save load this into our will get this going and this is going to take a little bit because like we said, it has to go through a few thousand files. So hopefully I've got some pretty good connectivity here on my internet and it will keep going. So I'll sit back and watch and see what happens. Ah, it stops because I put in test rather than HTML pages. Give it another shot. Well, excellent. It took about 45 minutes or so but it finally completed running. If I take a look at what money line looks like, I do the tail. You know what I forgot? I forgot to put in the freaking date. Dang, Pat. That was a fail. So let's see if we can figure out how to redeem this. So we're going to have to go back up here and figure out how to get date out of this. So we're going to want to do row, HTML nodes and let's go back and look at our HTML and I'm going to grab that link as a tester. Where were we? Here, HTML nodes. And so if I look in here, this is that node that we were on and you'll see that there's an attribute rel which has the date in here. So event line odd status complete. So if we look at, if we get status, that status hyphen complete, complete. And then we say HTML adder rel and here we can plug in date as date. I hate it when I do that stuff. We run this. Now we have the date. Fantastic. It's formatted a bit weird. So if we just test, if we do YMD, that's not going to work. So we could do that thing again where we just stripped that crap out. So let's do that. That'll probably just be easiest. str replace date, comma, and we're going to back back s.star and replace that with nothingness. And we'll run these lines, let them rip. Voila. I guess it might be nice to have that date be in date format. So I'm going to quickly look here at lubricate, convert chr to date. So what do these people do as date? Now let's give that a shot. If we do as date, probably wants capital D, huh? Great. That looks right. Do I still have my test thing in here? Let's go ahead and try to test with moneyline just to make sure everything works. And if we look at moneyline, we get the dates, we do tail, moneyline, we get everything too. All right, I guess that's all it was. It was six things along. So tail doesn't really matter. And so then let's come back here and put in HTML pages. And oh, go ahead and I need to take out my URL. And we'll run this again. See you in another 45 minutes. So we ran all that and everything looked good. So we ran all that and everything looks good. One of the things that I want to make sure is that I don't want to have to rerun this code too many times. I suspect the website's going to get pretty pissed off if I do that. So I'm going to add in here from a write CSV command. So do write underscore CSV. And we'll do path equals data moneyline.csv. And we'll then give that moneyline. I'm going to go to my terminal and I'm going to do mkdir data. And then I'm going to run this and my console. And if I look in my terminal now and do all this data, I see my files there. So I'm going to now take all this code that I had run. And I'm going to copy that into a new R script. And hopefully you can appreciate this is not the ideal way to be doing things. I guess I don't need the West Anderson. I don't think I need Broom. That should work. Ideally, this would be a package, or I'm sorry, this would be a function. And I would need to give it the dates from the favorite Win Prob to run this. But we've kicked this enough that I'm just going to stop for now. And I'm going to save this as get moneyline data.r. I'm going to close that for now. And where we took that out, this needs to be in that file to copy this. And we're going to do write, I'm sorry, read CSV, where we will read that in to moneyline. And I will run this. I'm not sure what the argument is, but anyway, we get that in. And so one of the things that we want to do is to think about how do we calculate the Win Probability. And so while it was running, I was doing some Google searches and found this link to how to convert odds. And if we do a search for moneyline, we see converting moneyline to odds. And so if you have a minus moneyline, which is the favorite, it's minus that number divided by minus that number plus 100. And if you have a plus, it's 100 divided by the moneyline odd plus 100. So let's go ahead and write a function that will allow us to do that. We'll call it get moneyline, prob. And we'll say function. And I'm going to call it give it x. We'll say if, so if x is less than 100, so if it's that negative, then we're going to do minus x divided by minus x plus 100. Else x, I'm sorry, 100 divided by x plus 100. And I'm going to put this up with my other function and I'll load this into r. And we'll have that here now. And so what I'll do is I will do a mutate. So mutate. And we are going to create a, so we've got moneyline one, moneyline two. So I'm going to call it moneyprob1 equals get moneylineprob. And we're going to, so this is actually we're going to need to do a map. I'm sorry. So map dbl, where we give it moneyline one, get moneylineprob, comma. I'm going to copy this down here, set of speed, close parentheses. And for now, I'm going to run that missing value where true false needed. This makes me worried that something weird is going on. So I'm going to go ahead and do this and I'm going to send it to summary and see what the output looks like. And sure enough, my moneylines have some na's in them. And so in the, in my handy dandy cheat sheets here, there's a command in, I believe, the tibial function, yes, to handle missing functions to drop na. And so this will remove any rows that have na's in the column. So if we do drop na and pipe that over, this works great. And so we no longer get that error. And we have the values there. Excellent. So next I would like to, so I'm going to steal code from up above here, where we're getting the favorite and the probability of the favorite. I'm going to add that to my mutate here. And again, instead of WP live, I'm going to put in money. And here instead of WinProp1, I'm going to put MoneyProp1, MoneyProp2. And so we run this. We now have our favorite Money1, as well as favorite MoneyProp. And we can then do our inner join. And I'm going to join the favorite WinProp with this. And we're going to join by date, team one, and team two. And the output is two values, which makes me a little bit worried that my team one and team two are the wrong ones. That perhaps I have things in a different order in the two different data frames. And so if I look at this, and then I look at favorite WinProp, and I'm going to have it show me n equals 15. I'm going to I'm noticing that up here for my Moneyline data frame, I've got some two character abbreviations, whereas down here everything is a three character. So like KCR, sort of KC and CWS instead of CHW. And so I'm going to need to convert between the two. And this is a pretty big pain in the butt. So let me see if I can do this quickly. I'm going to create a table where one column is from Moneyline and one column is from what we've been working on. And then we'll convert the names. And so I'm afraid that this is going to require a fair amount of plugging and chugging. I can get my FaveWin probability. I'm going to and I only want the team names from the since 2009. So this is the list of those names. And so this is my old Moneyline. I'm going to do the same thing. We have a CSV. I do pull team one and we're going to do unique. And here we've got some random crap at the end that I don't want. And so we'll put that here. And like I said, I'm going to now make a table manually. And so I'm going to speed along here. So you don't have to watch me do that. Alright, so I went ahead and created a new table for name convert. And if we look at it here, we see that like Anaheim Angels and Los Angeles Angels of Anaheim, whatever the names were Arizona, Arizona, Atlanta, Chicago, W, Chicago, White Sox. Okay, so we need to add in here then a join. So we'll do inner join. What will we do? We will do the pipe with name convert by and we're going to use the team one team one equals we're going to combine that with ML. And if we look at what that generates, we then see that we've got the FWP name over here in this column. Before we rename that, I'm going to run it again to convert team two. And so now we have FWPX, FWPY. And I am going to select out team one team two. And we're going to rename. And of course, I forget how to rename it. That's my deployer cheat sheet rename. So we're going to then rename FWPX as team one, FWP.y as team two. That's the other way around. And so now we've got team one and team two. And these are all the what we saw before, you know, like we had KC and TB as part of the money line data. And now we have KCR and TBD. All right. So the other thing I noticed, when I look at this output where team one and team two are Toronto and Boston, and so forth, if I look at the first 10 lines of let's do whatever. Favorite WinPROB, we see, so you have SDP, let me just find one that there's overlap. So TBD and MD and IN are swapped here with men and TBD. So that's a pain in the ass. And so when we do our join, if we had time, I'd go back and yet again, change ones and twos for generating the money line data. And so we're going to do team one for team two. And team two, team one. And actually, you know what I might do here would be to do I'm going to I'm going to change it here in as I read it in. And so I'm going to do team one equals team two, team two equals team one. And you know what, that's really all the places that matter is because we already have fave money one. And in fact, screw it, I'm not screwed. So if we join this now, we now have a much larger data frame. And that's pretty sweet. So I'll also add score one equals score two and score two equals score one. This looks pretty good. So this now has a lot of what we want. And I'm going to then select out these values. Great. And so we have our scores, whether the favorite one and the probability. And I'm going to now save this back as favorite win prop. Get rid of that. Okay, and now we have our favorite win prop data frame. That's got 24,000 25,000 or so records since 2009. We now want to make this tidy. I'm going to add here to my mutate the wind line or wind prop. This is going to get favorite money one, very many prob and gather win prob, we run this tidy win problem. If we look at the tail of tidy win prob, there we go. Excellent. We can then missing argument to function call. Oh, that common to be there. I'm going to then look at the overall win probabilities. And just to make it easier to see, I'm going to arrange by mean. And what we see. So this is an ascending order. Let me descend it. And we see that the win probabilities, which we've been working on today has a similar effectiveness at predicting the winner, the favorite as win percentage using the current. And it's a little bit better than the 538 model. So that's pretty slick. What I'd like to know is to recreate that plot that we've been looking at of performance over time. We're not going to go all the way back to 1871 because we don't have the probability data, the wind lines going back that far. But we should still be able to see some some pretty good stuff. And I think all this stuff will stay the same, except so have WP current. And then we will also have win prob. Right. Let's give this a shot. And it's a bit hard. And for some reason in here we have four lines. We probably have more than that. I forgot to do our filter. So we're going to do a filter model equals equals equals FTE or model equals equals a win prop or model equals equals WP current. Pipe that. And so now we will get those lines. And so those are pretty steady over the last nine years or so. And I think we've got overall win prob coming in here in our genome H line. I'm going to leave that for now. Actually, let's go ahead and just put the same filter up here. We get effectively the same thing. So you can see the little blue line right on top of the WP current and the 538 just below that. Okay. Excellent. So next we want to go ahead and look at how well the observed win probability matches the predicted win probability by all these models. And again I'm going to add in this filter to focus the data on these three models. Otherwise we'll have like a whole bunch of lines overlapping each other. And it's just going to be a horror show. So this all I think looks good. I want to add in here my win prop and remove these two. Win prop. And so this is what it looks like, which is pretty slick. It's pretty messy but you can imagine that most of the data that we have is really in this range between say 0.5 and 0.7. And that the three models do a really good job of replicating each other over this period of time. With the 538 we tended to see it overpredict. But again that was with all the data going back to 1871. And so I need to update this to be since 2009. And so I'm going to say the three models do an excellent job predicting the true fraction of games that the favorite excellent. So that's pretty, that's pretty cool. So what we're seeing is that the wisdom of the crowd with this win prop model here in blue more or less matches what we'd get with the 538 model or the WP current model. So remember the WP current model assumes that we have all the information about the season on day one to know the wins and losses heading into that game. Whereas the 538 bakes in information about you know pitcher and home field advantage and distance traveled and all that kind of stuff. Whereas the wisdom of the crowd is probably taking all that into account also as well as a bunch of homers who want to you know bet for the home team their favorite their favorite team. And so to see that work out that way is pretty cool. Alright so let's go ahead back to our issue tracker. We figured that out. Calculated the probabilities. We made the table. We joined the wind line probabilities with the performance and that's all good to go. So I'm going to go back to my terminal and get status. So we've got get add analysis. Get get that want to see how big that file is before I put it. Yeah, it's pretty small. So I'm going to go ahead and get add data money line. So if that file was too big, then I wouldn't really store that inversion control. And so I will do get commit dash m. And so we will do incorporate the money line probability model closes number three. And I do get check out master. I'm going to say no, I'm going to get merge. You know what? I'm a little bit worried. I'm going to get check out validate betting line model. Save that. And that should be good. Okay, so then I'm going to get get check out master. Get merge validate betting line model. Get status, get push. And so refresh. All is good. Excellent. So we'll be back tomorrow with the final installment of these demos looking at wind loss predictions in Major League Baseball data.