 Welcome back. We're ready to get going again. Hopefully you enjoyed the first demo where we looked at the 538 ELO-based model for how they predict the favorites and what probability they predict that team will go on to win. As we saw, it's pretty hard to predict who's going to win a baseball game. I think the average was about 56% of the time or so. The favorite actually went on to win the game using that ELO model. One of the things that I think is really cool is that because the ELO model is a probabilistic model, like a lot of the political models that 538 and more sites are using these days, it gives you a probability that a team is going to win. They don't just say this team is the favorite, but they say this team is a favorite and we think they're going to win 55% of the time. Because there's such a long arc of history in baseball data, we're able to go back and look at all the games that had a 55% favorite and then say, how often did the favorite actually win when they had a probability of 55%? And lo and behold, it's about 55%. So I would say that's a good validation. As we kind of looked out at higher probabilities, what we actually saw was that 538's model was a little bit conservative. This tells us that when they say that it's maybe a 55%, that maybe it's actually like a 57-58% chance that that team is going to win. I think that's pretty good. So that's one based, one model, relatively new. This ELO based model that was originally developed in chess that takes into account kind of the prior history of a team's record. As I said, prior to about a week ago, the Brewers, Milwaukee Brewers, were actually ahead of the Cubs in the standings, but according to the 538 model, the Cubs were by far ahead of the Brewers. Now, the Cubs went on like an, I'm sorry, the Brewers went on an 0 for 5 tear, getting shellacked by the Pirates last weekend, and they've now fallen in the standings, and the Cubs are in first place with the Brewers in second. So, will that be how the teams shake out at the end of the season? Who knows, but at this point at least, that ELO model is doing a better job of representing the relative rankings of those teams than the actual win losses did. And that's of course because we don't have all the information about the quality of the team until the season is done. So, that's one model, right, ELO. Another model that was developed by Bill James back, I think, in the early 80s is based on the win-loss records of individual teams, that if you know a team's win-loss record and the other team's win-loss record, you can use those two averages to predict with a certain probability which team will go on to win the game. When those models are being developed, they assume that we have a lot of information about those teams, that we know the true winning percentage of those teams. That's really hard on, say, day one, where everybody is zero and zero, to kind of estimate, or to kind of extrapolate out what will that team's true win-loss record be at the end of the season. And so, something I want to show you is, if we go to the Riffa-Mannis website, the 2018... yep, 2018, I'll start break. In this initial description that I put up, I have the Bill James formula for predicting the probability that a team will beat another team based on their win-loss record. So, we can use this formula to go ahead and predict which team will win. So, for today's demo, what I want to do is implement this model. So, we have a couple questions, though. What do we use as the winning percentage for the two teams? So, I think there's a couple different options. So, one would be to do it based on the record of that team to that point in the season. I think that's the most informative, right? That would be, say, I'm a gambler and I want to bet on the team and see who's the favorite. I want to know where they are to this point in the season. A second approach would be to say, well, let's use last year's record and see if last year's records can help us to do the estimation of the probability of a team winning the games this season. And then a third approach that we can try is a bit academic, and that's really how it's been implemented when people have been testing and developing this win percentage model is to use the record at the end of the season. Now, I don't know the record of the Cubs or the Brewers or the Cardinals at the end of the season, at the end of the season, because that's still two or three months away. But we can still use it as a way to validate that we're kind of doing this model in the right approach. Okay? So, again, this is what we're going to be doing. We're going to add this to our data analysis, and we're going to see how this approach compares to the ELO-based model. So, we'll go back to our Finder, and I'm going to go ahead and double-click on my Rproj file The other thing I'm going to do is to go back over to my repository and I'm going to start a new issue. So, for today, I need to... So, this is... What am I calling these things? So, I'm giving it a real title. So, I'm going to create a new issue, and I'm going to call it Analyze Win Percentage Model relative to 538 ELO Model. So, what I need to do is I need to calculate the winning percentage for each team prior to the start of the day's game. I also need to calculate the winning percentage for each team at the end of the season and calculate the winning percentage for each team at the end of the last season. And then we need to integrate the... Sorry, we need to integrate the winning percentages into our data frame with the 538 model and we need to calculate the... I'm going to call it WP Probability for each game and what else do we need to do? We're also then going to want to plot the ability of... Let me call this the Model. The WP Model to predict favorite over time and then we're going to plot the observed versus the expected probabilities for the WP models and the 538 ELO Model. Simple, right? Alright, I'm going to submit this as an issue and we'll get going. I'm going to go to my terminal and I'm going to create a new branch. So I'm going to say get checkout dash B, validate WP Model, excellent. So I am going to then also... I'm going to open up my analysis.r file and we're going to start to modify this to bring in a new code for our winning percentage data. So we get organized here a little bit. So to remember where we were, we read in the game data. We then made this data frame that had favorite WinPROB game data and we then created this data frame that had the season, date, team one, team two, their names, whether the favorite one by the 538 model and then the probability of the favorite by the 538 model. So missing from this are the scores of the game. So we're going to need those scores to calculate the winning percentage, the winning average for each team over time. So I'm going to go ahead and add score one, score two, and I need to go ahead and run all this and this should work again. I've closed our studio and reopened it and hopefully it will continue to work because we want this to be reproducible. There we go. So then if we look at favorite WinPROB, we see now we have the season, the date, team one, team two, score one, score two, favorite and then the one in the probability. I'm going to push this to the side because what we want to do now is again we need to calculate the winning percentage for each day of the season for each team, the winning percentage at the end of the season and the winning percentage at the end of the previous season. And so I'm going to call the day-by-day winning percentage our wins-losses-live, that's kind of our live record. We're not capping it or setting it at a specific value. So we're going to say win-losses-live and again I'm going to, I'll leave that there, but I'm going to comment it out because as I develop it, I really like to see the output as I develop my pipeline. And so I'm going to do favorite WinPROB. I'm going to pipe that to do a mutate and we're going to determine who is the winner and loser for each game because again what I want to do is for each, as we go through the season, I want to keep track of the number of wins and the number of total games each team has played by day. And so I'm going to then say win one equals score one greater than score two. Win two is score two, greater than score one. And then who is the winning team? So we'll do team win one equals paste. So what I'm going to do is I'm going to make a new column that pastes together the team name and then whether or not they won. So what I want to do is I want to get to a point where I have a data frame, where I have the date, the name of the team and whether or not they won on that day. Because then what I can do is I can take that data frame, I can group by the team and by the season and I can then also arrange the data by date and then use kind of a cumulative mean function to see what is the average number of games by date that an individual team has won. So then if I know the record for each team at any given time I can then join that back to my large favorite win probability data frame to see the winning average at that point in the season. So I've got these two columns or I guess four columns. I've got the names of the two teams and I've got the win whether or not they won. And so I want to kind of push that all together and the way that I'm going to do it is to paste the two columns, two sets of columns together and then I'm going to gather it to make it tidy and then I'm going to split it back apart. So you'll see as I go through here. So we'll do a paste, team one, win one, sep equals underscore. So it's important to give it a separator that we can come back and separate it based on. Team win two is paste team two, win two, sep equals underscore. And so if we look at this, we see what we get here is that we have team win one. So we see the Padres false. And then if we looked at team win two that comes true. And so we're going to now gather to create a column called one two and so that's going to be the key where it's going to be team win one, team win two and then we're going to have the second column be team win, team win one, team win two. And so what this will do is this will, this is kind of a dummy variable of one two and what it's going to have is going to, it's going to have team win one, team win two as the alternating values throughout the data frame and then the, that's the key and the value is going to be this pasted together thing. So the SDP false, SFG false. Okay. So if we look at this, we see that we now have, let me, let me make this a little bit wider and we'll run this again. So what we see now is that we have one two. So this is team one. If we were to look at the end of this, we then see that we have team two and then we have the name of the team as well as whether or not they won. Okay. So from that then we will then separate. We're going to separate that team win column into team and win. And then we're going to say our separator is the underscore. Okay. And so what we see now is that we still have this one two but that's really not relevant. We have the team and whether or not they won. Okay. So we, our data frame now is significantly larger. It's got 433,000 rows. Whereas I believe if we were to look at the dim of favoring prop, it's half the size, right? So we've basically doubled the number of rows because now each team in each game has two rows, has its own row, right? So this game has been doubled so that both SDP and CHG have a Boolean value to indicate whether or not the team won. Okay. So we've separated that. The other thing to note is that unfortunately when we separated it, that win came out as a character. We can then say I believe we can do convert equals true and we should now get a logical. Okay. So that's what we want to do. Otherwise when we do things like our means and sums and things like that, we're not going to know what to do with the character value. We're going to now arrange the data frame by date and so now we see those 1871 games and you see this now. So we have Fort Wayne in Cleveland. I don't know what WS is in Atlanta and so forth. Okay. So we've now sorted the data frame by date and now we're going to group the data by season and team and we're now going to do mutate, wins, losses, and average. And so if we were to do wins, we could do cum sum win and we could then do losses, cum sum, not win, exclamation point win and average is wins divided by wins plus losses. So of course there's many ways we could do that. We could have, instead of doing losses, we could have said games and we could have used the end function, whatever. We'll get the right answer. So if we run this now, object ABG not found. I put in two equal signs. Normally I only put in one equal sign when I was supposed to do two. Never have I ever put two when I was supposed to put one. All right. So here we go. So to make it easier to see, I'm going to pipe this and we're going to select team equals CHC and season equals 2018. Why isn't it doing that? I must have a typo in here somewhere. So team, sorry, this should be filter not select. And so this is the Chicago Cubs and what they've done over the 2018 season and if we do this and let's, now let's select date team one, team two and we'll do wins, losses, ABG. So we now have the development of the teams record over the course of the season. So this shows the first week or so where the Cubs played the Florida Marlins, Cincinnati and Milwaukee and Pittsburgh, right? So at about 10 games into the season, they were 500. It took the Cubs a while to get going this year, okay? So again, this is the Chicago Cubs record for every given day in the 2018 season. But the other thing you should note is that this is the record after that game was played. So on the morning of March 29th, the Cubs were zero and zero. On the morning of April 10th, the Cubs were five and four. So we need to add a lag to indicate the record at the beginning of the day. I don't want to know the record at the end of the day. I want the record at the beginning of the day. And so over here, we can add a function for lag. So there's a lag and a lead function in the player. So if we run this, and again, if we do team, I did select again filter season 2018. And I'm going to just copy this over here because this will drive me nuts. So one of the things we see, and I'm going to add my filter to do date, let's do team one, team two, win, sorry, wins, losses, AVG, elevation, not, select, not filter. All right. So we see that it puts in an NA for that first game. And so we saw that the Cubs won their first game. And so they were one and zero, right? At the start of the second game on March 30th. So what we want to do is to replace those NA's with zeros. And so what we can do is if we do NA dot omit, we can wrap that around these two. And if we run that, it's going to complain because the column wins must be length 31, not 30. And so the problem with what's happening is if we were to do, say we do X is 1 to 10, 20, and I do lag X, I now have that NA in the first spot. If I do NA dot omit lag X, I now have a vector that's one unit shorter. So what I need to do is I can add, I can do C zero comma NA admit that. And that's what I want to do. And so we're going to do C zero comma that. And this should work. Although I think for my inputs, it's going to give me an NA. And so what I want to do is, let me think, how do we get that not to be an NA? Let's do the same thing we did up here with the wins and losses of NA dot omit that. Let's see if that works. That's great. So that replaces that with a zero. And we've got what we want. So now let's simplify this data frame. So we're going to simplify this. As we look at this, we see that we currently have the data frame as grouped. And so we need to ungroup this by season and team. So we're going to do ungroup. And we want to select. We want to get the season, the team, the date, and the AVG. I don't think we need the wins or losses. And that should work. And see what this looks like. And so that's the AVG over time. If I pipe this to get my Cubs, I guess I don't need this. That works. I could test this by doing print n equals inf. And this should tell me that the Cubs winning percentage right now is 587. And if you go to ESPN.com and look at the current standings, it is 587. So this all works great. So now we have this variable. So now we have this variable win losses live that we can use to join in to our larger data frame of game data, sorry, our favorite win probability data frame, to add in a column where we can then get in the predictions based on the live wins and losses. So the next thing I need to add is the season win losses. And for this, we're also going to bring in a favorite win and actually most of this is all going to be fairly similar. So we're going to basically copy all this down. And so what this does, if you remember, again, I'm going to just develop the data frame. I'm going to comment out the variable name, get rid of that pipe and run this to see what it looks like. But you'll remember we have the season, the date, the team one, team two. But this junk really doesn't matter. What really matters is the team and what they did on that date in that season. And so we'll pipe that then. So we've got that. That's the logical double check. We'll then group by. We're going to group by the team and the season. And so that way, again, we've got this big data frame. We're going to group by the team and the season. And then we're going to summarize the current AVG for the current average of the current season. And we're going to do mean wins, a win, sorry. And so if we run that, we then see each team and the season and the current average. So again, if we do filter team equals CHC, we see the Cubs records over the first however many years. So that's the current season. Now, I'd like to get the past season and you'll see that we currently have this being grouped by team. And so what we then want to do is to do mutate because we now want to make the average for the last season. Okay. So we'll do pre AVG. And this is going to again be using that lag. And so we can do pre, got pre Vav already. And we're going to do see something, na.omit lag current AVG. And so the question that is, what do we want to put in here for that lag? So you can imagine if this is 1871, we don't have the record for the teams in 1870 because they didn't exist. And similarly, as they went through expansion and added teams, like the Arizona Diamondbacks, that team didn't have a record the year before they existed. So I'm going to just put in 0.5 to give them the benefit of that. Typically expansion teams really suck. And maybe we should make this like 0.3. But you can change that in your own code and see if it really matters. So let's run this and see what we get. And so what you'll see is that up here, Anaheim had 432 in 1961. So they are one of the expansion teams in the early 60s. And so then 432 is their next year, is the record for the 1962, which would be the previous, and the previous for 1960 should be 0.5. And so that all looks great. This is still being grouped by team. So I'm going to ungroup this. That's all good. And we now need to assign this to win losses season. Excellent. So now we have our win losses as we go through the season, the win losses for the end of that season, and the win losses for the previous season. What we want to do is take these three pieces of information now and fold them into our favorite win probe data frame so that we have all three or all four models with the 538 ELO model next to each other in the same data frame so we can easily compare them to each other. So now what we want to do is we need to join in the win losses live and win losses season into a favorite win probe. And so dplyr has some really nice tools for joining different data frames that we're going to make use of here. So we're going to do favorite win probe and we'll do an inner join where we're going to join favorite probe with win losses live and we're going to do it by, what are we going to do it by? We're going to do it by team one equals team. So team one column from favorite win probe and the team column from our win losses live. We're also going to add in, have it join on season and date. We're also, we're going to copy this because we're also going to join then team two. And so at this point, what did I do wrong? This needs a C. So now we have season date, team one, team two, score one, score two. The favorite 538 one. And then the averages for those two teams at that point in the season. Okay. So when the Cubs and Padres played their records or winning averages were 408 and 587 respectively. Now we want to use those winning averages to calculate the win probability based on their winning percentages. So we'll do a mutate win probe for team one and win probe for team two. Similar to what we've done elsewhere. We're going to want to then add in whether or not the favorite win probe WP. I'm going to call it win probe live one and fave WP live probe. And so here to get the win probe, what we'd like to do is to say get WP. And we're going to, for this, we're going to give it, what are we going to give it? We're going to give it AVG.X, AVG.Y. And here we can do one minus win probe one. And here we're going to do if else win probe one greater than win probe two. Then team one was the favorite. So then we want to score one greater than score two. And otherwise it's going to be score two greater than score one. So this is similar to what we'd done before. And so here we're going to again do if else win probe one is greater than win probe two. Then it's the favorite and we want to return that win probability. And otherwise when team two is the favorite and we want to send back its probability. So win probe two. So you're probably saying, Pat, what's this get WP function? We haven't defined that yet. In fact, if we run this, it's going to complain, could not find function get WP. So we need to define get WP. I like to put all my functions up at the top of my code. And so we need a function get WP. And I'm going to give it A and B as the two averages that are coming in from down below where I ran my mutate function. So if I give get WP two averages, I wanted to tell me the probability that the first average is going to win. And so you'll recall here is the formula for calculating the win probability. So we want to put this in now. And so it's going to be A times one minus B divided by A times one minus B plus B times one minus A. So this is the win probability. If we run this, so if we do get WP and say we give it 0.6 and say 0.5, then the probability that the team with the 0.6 will win is 0.6. 0.55. They'll say it's like 0.55. Great. So something we might notice though is if we do get WP of 0.0, so say it's the first day of the season, it's going to give us a not a number. If we do get WP 1.1, it's also going to give us not a number. Those would generally happen on like the first or second days of the season. If we do get WP 0.1, then it does 0. Get WP 1.0, that's going to give you 1. So we need to add some logic here to say if A equals 0 and B equals 0, I think we want the double ampersands. So if that happens, then I'm going to make the call that it should return 0.5. Else return this value. So again, if we run this and we do get WP 0.0, it gets 0.5. And if we do 1, it's still 0. If we do 1.1, it's still not a number. So we need to then add some logic. So I'm going to copy that. So if both of those are 0 or both of them are equal to 1, then it should return 0.5. So if we do get WP 1.1, 0.5, get WP 0.0, 0.5, get WP 0.1, 0. And if we do 0.5, 0.5, 0.5. And if we give it 0.7, 0.7. Excellent. So we have a function now that works and that we can use to calculate the winning percentage. Where were we? So we have this as our way of calculating the winning percentage. So if we run this now, do we get an error? So we see now that we've got WinProb1, 327, WinProb2, 637. Let's put these numbers in just to double check. 0.408, 0.587 gives us 0.326. That's great. I'm going to run this again just to make sure it behaved correctly at the tail. So we're still getting these NaNs for 0 and 0. And I think one of the problems here is that we ran getWP where we gave everything, both vectors into WP. And it's really not set up for vectoring. And so what we want to do instead is map. So if we do mapDFR, and we then say we're going to send to map AVGX, AVGY, and we're going to then give it getWP. And so what mapDFR does is it takes, it steps through these two vectors, AVX and AVGY, the two columns for a data frame. And for each pair of values, it's going to run those into getWP and it's going to return it as a data frame. And it's basically just going to be a single column data frame. Actually we won't want DFR, we want double. So DBL. So mapDBL doesn't return data frame, returns a double, which is a numerical vector. So if we run this, result1 is not a length1 atomic vector. WinProp1. So I used map instead of map2. So map is when you have a single vector, map2 is when you have two vectors, and then I think it's pmap when you have any number of vectors that you're feeding into a function. So now if we run this, that works. And if we then look at the tail to see if this is behaving, what I put into zero, we now see that we get the right result of .5 and .5. So again we use this map to DBL to take two columns and for each value, each row in those two columns to send those to getWP and then to return that as a double vector. These map functions are really powerful and there's a whole bunch of different ways that we can output the data. So we're going to pipe this then into a select because we don't want all of this information necessarily. So we're going to turn the season date, team1, team2, score1, score2, fave, 538.1, fave, 538.prob, and we want faveWPLive1 and faveWPLive, so we run all this. We now get season date, team1, team2, the score of those games, whether the fave by the 538 model 1, the probability, fave by the live win percentage 1, and the probability. Excellent. Now what we're going to do is we're going to repeat this, but we're going to join in the win losses season and into, let me take this out, into faveWinprob. And so what I need to do is I need to update this to say this is, we're going to write over what we currently had, so that faveWinprob now has those columns. So now we want to add four more columns, two for the prediction based on the current season, end of the season record, and two for the end of the previous season record. So it's going to be very similar. I'm going to copy these first couple of lines down and work off of these. So win losses live, we want win losses season, and so this, we'll join this. Date is missing from right hand side, win losses season. I don't want to join by date because I'm just looking at the season, right? So win losses season doesn't have a date column because it's just looking at that season. So I'm going to remove the date column, and we now see that we have added current average, previous average, current average, previous average. So x are the columns for team one, and the y's are the columns for team two. And so now what we're going to do is we're going to add another mutate. So we'll do winprob1 equals, I'm going to live on the edge again and copy these down. And we're going to do this. And remember it's currentav and prevav. So I'm going to copy these. So this is going to be our winprob1, winprob2. And then this is going to be wp, I'm going to do current and wpcurrent. This then is going to be winprob1, bam, bam, bam. I think that's all good. I'm going to copy these down. Oops, copy those down. And instead of currentav, we're going to do, what is that here? Prevav. And then this is going to be wpprev. And if we run this, we now see that we've got this currentav. Those four columns we added. Winprobs, our favorite one, favorite probability, favorite one, favorite probability from the current season, as well as from the previous season. I'm going to then select, similar to what we had up here. So I'm going to copy this select column, the select command down. But we're going to add in favwp, current1, favwp, currentprob, favwpprev1, and favwpprevprob. So now we have our data frame looking like we want it to. And we have it stored as favorite winprob. I'm going to go ahead and revisit our checklist. So we have calculated the winning percentages. We've integrated that into our data frame with the 538 model. We've also calculated the winning percentage model probability for each game. We now want to make the plots. And we'll return to our studio here. And to calculate our overall winning percentage, actually we're not quite ready for that. We want to get our data into a more tidy format. So the idea of tidiness is that our data in individual columns represents the same type of data. And so I would like to have a column that says model. So it might be 538, wp-live, wp-current, wp-prev. And then the probability and then whether or not the favorite won. So instead of having eight columns that are very difficult to compare across, I want to have three columns that tell us the model whether or not the favorite won and then the probability that the favorite team won. And so we saw this earlier when we were calculating the season averages, the winning percentage across the season. So we now want to make the data frame tidy. And so we're going to take favorite win model and we're going to mutate. And I'm going to create a column called FTE for 538 because our column names, it's not ideal to have those be numerical. So I'll have 5TE and I'm going to again do my paste where I paste together fave, 538.1, fave, 538.prob, sep is the underscore. And I'll then also add wp-live. Where we'll do paste wp-live. Sorry, fave wp-live.1, fave wp-live.prob, sep is the underscore, comma, and then wp-current. It's paste fave wp-current, 1 fave wp-current.prob, sep equals underscore, and then wp-prev is paste fave wp-prev, 1 fave wp-prev probability, sep is underscore. So these are our four models that ends our mutate. We then run this and we now get these extra columns that we can't quite see. So if we select for fte wp-live wp-current, wp-prev, we then get these columns, right? And so what we're going to do now is like we did before with gather. We're going to gather these four columns together and then we're going to separate them based on the underscore where the first column will be the one and the one, and the second column will be the probability. Great. So we will do gather. I'm sorry. So before I gather, I'm going to remove those fave columns that we had taking up a bunch of space. So I'm going to select, I'm going to do minus, starts with, and then quote fave. And so when we run this, we'll see that we no longer have those fave columns, right? And so it's already a bit more compact. We'll then do gather. So we'll gather to create a column model and then we'll create one prob. And the columns that we're going to gather together are fte, wp-live, wp-current, and wp-prev. All right. So if we fire that up, we now see that we've got basically the same thing, but now we have the column for fte, and wp-live, wp-current, wp-prev, as well as the one probability. And now we want to separate, we want to separate one prob into one and prob. And we're going to separate on the underscore. And so this now gives us our tidy data frame, where for every season, every game, every pair of teams, we have a score, the model, whether the favorite one and the probability of the favorite would win. And so I'm going to save this now as my tidy win prob. Now I'm ready to revisit this overall win prob, where if I take tidy win prob, and we pipe that to a group by season, and we want to group by model, I'm sorry, we don't want to group by season, we want to group by model, because we want to look at the overall win percentage across all years. So model, group by model, and then we're going to do mutate, and I'm going to call this, I'm sorry, nuts, mutate, I'm going to do summarize, mean equals mean win, one. That's not good. So we have this problem again, where we did the separate, but it turned them into characters. So up here we want to do convert equals true. We rerun that, look at tidy win percentage, and we now see that that's formatted correctly. And if we run this, we now see the overall fraction of time, fraction of games, where the favorite actually won. So if we use the current season, at the end of the season, the probability, or the win-loss records to calculate our probabilities, that gives us the best model. Of course, that does us no good, because I can't see into the future. And so we'll call this then the overall win-prob, and we'll use that as we go along. So now we want to plot the fraction of games that the favorite has won over the history of baseball. We're going to group by season and by model, and we're going to then summarize the fraction that the favorite team won, and so this is going to be mean of one. Okay, so it's basically the same thing we did here, but here we're also grouping by the season. I'm going to do an ungroup to liberate that, and then we're going to pipe that into ggplot. So x, y, fraction, favorite one, which is not there. We're going to group by model. We're going to color by model, and to modify our gmh line, we're going to give it data, which will be our overall win-prob. Our aes is going to be y-intercept. Here will be mean, and group will be model, color will be model, and we can get rid of this color light gray, and our gmh line, theme classic. Our titles are going to have a bit of a problem, so let's just run this and see what we get. Column model is unknown. So let's see where it gagged here. If we run that, model unknown. We give it fraction win-prob. We want to give it tidy win-prob. So we run that. We see what we've been looking at, where we've got now our four models, FTE, WP Current, WP Live, WP Preve, along with the four horizontal lines. And the data looked pretty messy, but on the whole, as we saw, that the WP Current, again, where you're using the winning averages to calculate the win probabilities, but the averages come from the end of the season to calculate in season, does the best. And actually, using the previous year, does the worst, and the live comes in third. And so for our labels, we can throw in season, fraction of games won. We'll say that the 538 model... Let's see. Let me get rid of some of this. So the winning percentage model can outperform the 538. You'll model if it uses the end of season winning averages. And we kind of talked about how that's kind of useless. So we run this. Let me just throw in a line break here just to help it to format a little bit nicer. There we go. I can also add in scale, color, manual, and we can then say name equals null. That will get rid of this legend title of model. We can then say breaks equals... I'm going to put these in alphabetical order because R does weird things with factors like this. FTE, WP Current, WP Live, WP Preve. And then our labels. We'll say C538, WP Current, WP Live, and WP Previous. And then our colors. I'm going to do something a little funny here. We're going to use a package called the West Anderson Palette. And there's a palette. So this is a palette set of palettes that are colors inspired by West Anderson movies. And I kind of like the way they look better than the default GG plot colors. But, you know, to each his own. So I'm going to do West Palette Darjeeling 2. If you Google and go to R West Anderson Palettes, you get this GitHub account, GitHub repository from Karthik Ram where he shows various weather. To Anderson Palettes, it is in CRAN, so you can install it through RStudio. We need to add our library call. And you can see what the different palettes look like here. So up at the top, I'm going to add library West Anderson. And let's see. Here we go. Oops. Oh, this should be values. Two Ts. And so you see the colors are a little bit more not so in your face. And it looks nice. Of course, you can play with that and whatever. Let's move on. So I'm going to skip the point where we did the binomial fits. And instead, what I'd like to do is to plot these four models with the predicted and observed winning percentages on the y-axis and not worry about the binomial fit. We've done that before, and let's just move on to generating that plot. So we're ready now to go ahead and generate this plot comparing the different models and how they perform at different predicted wind probabilities. I'm going to scrap most of what we have over here because I'm not going to go back and do that binomial fitting. You can do that on your own. Really, all that gains us is the cloud along the 45-degree line. And also, each of our different models is going to have a different sized cloud because they're predicting different probabilities. And so that just gets kind of messy. So we're going to try to finish off here with a simple plot. Again, where our x-axis contains the predicted wind probability and the y-axis has the actual wind probability. And I'm going to do this all straight to ggplot pipe. So we'll use this tidy wind probe where we're going to mutate our probe to have only two significant digits to simplify things. We're going to then group by that probability. And we're also going to group by the model type because we want to know the predicted and observed wind probabilities for each model as well. Excellent. So if we run this... I still have that in here. And so we want to replace this with 1. We then see that we've got all of our models, our four different models, the games that had those probabilities, the winds and the observed wind probabilities. We can then kick this out to ggplot with our aesthetic. Our x, we're going to plot the probability. The y, we're going to plot the observed. We're going to group by model. And we're going to color by model. We can then do geome, abline, aes, intercept, equals 0, slope equals 1. I'm going to do color equals gray. And we will then add geome, line. And let's see what this looks like. Great. So we see again our predicted wind probability, the observed wind probability. This gray line is if the model's performed perfectly. That green line for the current season prediction looks really good. There is a fair amount of variation that I suspect would go outside of the cloud if we were to plot that. But it doesn't... Maybe it's a little underbiased. But perhaps if we averaged out the probabilities of the 538 and the current season wind probability, we'd be right on that line. But again, this is kind of like an academic exercise because we can't predict to date our Thursday's game based on the records at the end of the season. We don't have a time machine. Sorry. If you're out there, time travelers, let us know what the final season wind probabilities are. But then, of course, we would actually know who wins Thursday because time travel. Anyway, so let's make this look a little bit nicer. So I'm going to add my theme classic and I'm going to add my scale colors. Let's see what this looks like. It looks pretty good. We need to add our labels. I'm predicted wind probability. Why would be our observed wind probability? And then our main title is going to be the 538 and WP current models perform or generate more reliable wind probabilities than the WP live or previous. Excellent. Let's add another line break in here and I'm going to add a subtitle to say that this is based on all data since 1871. You run this, we get a nice looking plot. Again, you can play with the colors if you want. I'm pretty happy with how this looks and it's kind of cool to see that this current season wind probability model goes out. The idea of using the current season as it develops wind probability doesn't do very well. And of course that's in part because the model was derived assuming that we had exhaustive information about the performance of that team that year. And it's interesting to see that last year's team's records don't really indicate how this year's team records are going to fare. That's interesting and certainly it's been a lot of fun to look at how this wind probability model works with the 538 model. So stick around for the next module that we go through where we're going to use the bedding line data. We don't have as much data going back to 1871 say but at the same time it's another version of modeling wins and losses based on the wisdom of people that are putting the money where their mouth is so to speak. So until then keep playing with this data and this code and see if you can come up with your own model perhaps that you could then dovetail in with what we've already done.