 So this is the week of the 2018 Major League Baseball All-Star break here in the United States. I'm a big baseball fan. I don't really care much about the World Cup, which is also going on at this time, but Baseball is my favorite sport. I'm a big Cubs fan. One of the sites I really like watching to keep track of my favorite team and other teams throughout the season is 538. They post predictions about who's gonna win each game, who's gonna make it to the playoffs, who's gonna get to the World Series, who's gonna win that. And I've always wondered how accurate are those predictions? They never show any validation data to show that the results are right. A couple years ago when the Cubs won the World Series, I think at 1.538, after maybe game two or three, had the Cubs a 7% chance of winning the World Series. Of course, they went on to win the World Series. And so we don't really know how accurate that model was necessarily. In many ways, this is analogous to the 2016 political predictions that they made. Or I think they gave Hillary Clinton a 70, maybe 80% chance of winning. And of course, Donald Trump won. So we don't know was the model right or wrong, right? So 80% of the time, that means 20% of the time, Trump could have won. We can't play the election over again in the multiverse with a thousand different replications of that election. Thank goodness that'd be pretty painful to know if in those thousand different universes, if Trump would have won 20, 30, whatever percent of the time to validate the model. Well, with baseball and sports, we can do that. Typical baseball team plays 162 games a year. There's 30 different baseball teams. So there's a large number of games being played over the course of a season. My thought, and what I want to explore with you during this week of tutorials, is to see, can we take a prediction? And so 538 says the Cubs have a 56% chance of beating the Cardinals. Well, if they play a couple dozen games over the course of a season, with that 56% probability of winning, how many times do they actually win? Is it 56? Is it 40? Is it 80? And then what do we think of that? And there's some other ways of modeling it. 538 uses a model called ELO, which is largely based on a rating system that was developed for chess, where you compare two players going up against each other. They've adapted it for baseball. Their model has a lot of other things, like besides just win losses, to include things like how far the visiting team has had to travel, how much rest they've gotten, who's pitching, who has home field advantage, those types of factors. So the model gets a little bit more complicated than just looking at the wins losses of each team. So there is also a model, of course, based on wins and losses that we'll look at later in the week. And another model that is out there that we don't typically think of as a model is the betting line. And so if the bookmakers tell us that Cubs are a favorite over the Cardinals, well, how often is that right? And those lines, the betting lines that are made, are made by people that are betting on the game. And so perhaps there's some wisdom in the crowd in assessing who's going to win the game. And so my question is, if I want to know who's going to win a game or have a good sense of who's going to win a game, what model should I follow? And we could maybe put some money on this and say, say I'm going to put 100 bucks on the favorite for each game, which modeling system should I use and how much money would I make by the end of the season? And so through this, I want to answer these questions, but I also want to explore with you various tools that we have for doing data analysis and making that data analysis reproducible. Because I want you to look at what I've done and go forward and say, well, Pat looked at these models, I have found a fourth model. Or Pat made some assumptions that I don't quite like. Or Pat did this with baseball. I'd like to repeat it with NBA statistics. And so the idea being that you can use my methods, my approaches to build upon or to revise and perhaps explore these questions further or to go on a tangent and look at different questions as well. So I hope you stick with me over the next few tutorials as we explore these questions and have a little bit of fun during the All Star Break. Go ahead and navigate over to the 538 website at 538.com. And if you've never been to the 538 website, one of the things that really sticks out to me is that they do a lot of great journalism that's built around data. As you see their titles across the top here of their banner, usually they do everything from politics, sports, science and health, economics and culture. They have articles on the eight different types of rock movies or things in science and health related to they did one on gut health and other types of health, economics, sports, politics. I think politics and sports are where most of their strengths lie. Again, they do things like aggregating different polls. 538 comes from Nate Silver's ability and I think the 2012 or 2008 election to accurately predict all 538 electoral votes. And so they do a lot of great analysis that's really shaped by and shaped around data. If we go to the sports tab, as I said before, one of the great things about sports is that there's a lot of repetition. There's a lot of iterations. There's a lot of the same game, so to speak. So in baseball, the Cubs after the All-Star break are going to play the Cardinals five times in four days. So we basically get to see the same team play five times in a very concise period of time. And the similar types of things in soccer, the World Cup which just ended, tennis, lots of other stories that they cover that again are all built around statistics and data analysis of those sports. And so the one we're concerned about this week during the All-Star break are the MLB predictions. So if you go ahead and click on more MLB predictions, this brings you to their table of their ratings, their rankings of the different teams in Major League Baseball based on their ELO rating. And so that's this column here. So you see my beloved Cubs are in here at first place in the NL Central, the ELO rating of 1568. And they trail the Astros by about 30 points. I don't really have a good sense of what that means, but I think they're pretty far away from the Astros. The Cubs, as I'll show you here, have been pretty consistent in their ELO rating over the course of the season. One of the things they do with this ELO rating then is to simulate future games. The Cubs so far have 155 and lost 38. So that's about 92 games. There's 162 games total in the season, so it was about 80, 93. There's like 70, 69 games left in the season. And so they're simulating the rest of those to say that the Cubs will probably win about 42 more games to get to 97 wins, which is I think a spectacular season. They then estimate, based on those simulations, the probability that the Cubs will make the postseason win their division and then go on to win the World Series. So things are looking pretty good right now as a Cubs fan. The American League certainly looks pretty strong. The Cubs, at least by the ELO ratings, and also by the win-loss percentages are doing the best in the National League. Up until the last weekend before the All-Star break, the Brewers actually had a better record than the Cubs, but consistently the Cubs had had a higher ELO rating than the Brewers over the course of the season. And so if you believe the ELO rating or the predictions from the 538 website, you'd think, wow, this really had baked into it, knowing that the Cubs were a better team than the Brewers, even though the record indicated that the opposite was true. And perhaps what's going on is that it's taking into account things like who each of the teams has played. And that perhaps in the first half of the season, the Cubs had a harder schedule than the Brewers did. So we can click on the link for Cubs and we'll see, you know, basic summary statistics about their chance of making the playoffs, the standings, as well as the upcoming games. And so we know that Kyle Hendricks, the professor, as he's nicknamed, is slated to be the first pitcher in the games back against the Cardinals. We don't know who the Cardinals are going to have pitching against the Cubs, but based on, you know, as of right now when there's no games during the All-Star break, the Cubs have an Elo rating of 1568 to the Cardinals 1508. Kyle Hendricks gets us a bonus of seven points towards our Elo rating. The average generic Cardinals pitcher has an Elo adjustment of one. The Cubs will be coming off this long All-Star break with a lot of rest. It'll be a home game. So they get quite a bit of a bump in terms of a bonus towards their Elo rating, whereas the Cardinals have also had the rest, but they're traveling up the highway to Chicago to play the games. And so at the end of all this, we have all these adjustments and this then gives a chance of winning for the Cubs at 62% and the Cardinals 38%. And in fact, if you look across all of the games, and again, this is without knowing who the actual pitchers are, the Cubs have a pretty good chance to win 62% of these five games. So again, based on what we'd expect over these five games, we'd expect the Cubs to win three of the five games. And of course, five is a small sample size. And so if we were to, say, flip a coin that was weighted three-fifths to two-fifths, we might get a five-game sweep. The Cubs might lose all five games. But the question that I have is if we were to take all 62% probability games where the team is favored 62% to 38%, does the favorite team win 62% of the time? Okay. And so that's what we're going to tackle today as we go through in working with these data from 538. You can see that the Cubs have been pretty consistent in their Elo rating over the course of the season, never really deviating or moving around too much. And then these are kind of the summaries of the last games that the Cubs played. They finished the first half of the season sweeping the Padres, which is great to see. One of the things I love about 538 beyond just their great reporting and their ability to bake data into everything they do is that they really go out of their way to make a lot of their data and a lot of their code publicly accessible. And so you can read about the Elo ratings and how they've come up with these. But you can also, they have an article on the complete history of Major League Baseball, but you can also download the data. And so we're going to click on that link and they will give us the data that goes into generating these Elo predictions. So I'm going to click on info here. It was last updated two days ago. So that would have been Sunday. I'm a little bit behind in doing these have had some technical difficulties and some craziness at home and other things. So we're doing we're filming this on Tuesday of the all star break. And so this takes us to a GitHub page for the 538 website and their directory, the repository that we're working in is called data. And in there they have a directory called MLB Elo. As I was getting ready to do these tutorials, I looked at the headings for their data table and noticed that a lot of them didn't make much sense. So I kind of went through and tried to annotate or give a definition to the column names. And although it would have been nice for them to cite me or thank me, they took what I wrote and they put it right in here. But you can see we've got the date of the game year of the season, a neutral playoff, the home and away teams. And then also all the various Elo ratings, the probabilities, and then the scores for those games. And so if we click on this link for the files, this will open up the CSV file that contains all of these data for every game that's been played in Major League Baseball. So CSV is a comma separated variables file, and it's a quite large file. And so you'll see here the column headings here that we had defined in that read me file, as well as the data for each game, including games that haven't been played yet. So we see that, again, the Cubs are playing the Cardinals five times this week. At the end of the season, they're going to finish the season by playing the Cardinals as well. And so this is September 30th, 2018, hasn't happened yet. But these are their predictions for what's going to happen. We can scroll to the end of the sheet to May 4th, 1871, where I think it's Fort Wayne and Cleveland played each other. And I believe Fort Wayne beat Cleveland 2 to 0. So again, baseball is very rich with data. And there's a lot of great information that is just recorded by stat geeks and people that like to work with data. So this is the file that we're going to be using as we go through our analysis. So as we get started, I want to keep everything under version control. So I'm going to go to github.com. If you don't already have a account setup, I'd really encourage you to pause the video and go create one now. Click on this plus sign to create a new repository. And I'm going to call this baseball model analysis. I'm going to call it baseball WL win loss model analysis. And so this is going to be 2018 all star break demo demonstration, analyzing 538 ELO model and other models for predicting winners of baseball games, MLB games. Great. I'm going to make it public. I could make it private, but then I'm going to have to pay extra for my account. If you have an academic account, which you can get through github's education features, you can then have private repositories, although a professor at a university have never gotten around to making my own private account. So I'm going to leave this as public. I'm really excited for other people to see it and to perhaps fork it and make suggestions and improve what we do as we go through here. I want to initialize the repository with a read me. I'm also going to add a get ignore file for R. And I'm going to add an MIT license. And so this is a pretty permissive license. And the get ignore file is going to be customized to those kind of nuisance, if you will, R files that I don't want to keep under version control. And so I want get to be able to ignore those. So I'll go ahead and click create repository. And this now creates my very simple repository. Before we go on, I'm going to create a new issue. So what I'd like to do is drive my analysis by going off of issues. And so the first issue is going to be analyze 538 ELO model performance. So issues can be used to keep track of bugs and to report bugs on other people's repositories or keep track of bugs on your own. The way it's set up is it can be a discussion. I also like to use it as a way to structure the analysis and to be kind of a checklist of the things that I need to work on as I go through my project. So the different things that I want to work on is I want to, the other thing I'll add is that you can insert information into the text here using markdown. And this star space open close square bracket will give you a checkbox. And I'll show you what that looks like here in a minute. So we need to download and format the CSV data with 538 history of MLB data. So I want to download it. I also want to make sure all the columns are in the right format that everything looks good. I also want to ascertain who the favorite was for each game and whether the favorite one. I also want to, I want to see whether or not the, what fraction of the time the favorite has won over the course of history of baseball. So I want to plot the fraction of games that the favorite has won over the history of baseball. I would also like to then know, does the probability or do the probabilities generated by the ELO model actually bear out with real baseball data? So in other words, if it says the Cubs have a 62% chance of winning a game, if they play 100 games or they have a 62% chance of winning, do they win 62 of those 100 games or something close? And so if they do, then that would tell us that the probability is actually meaningful. So again, I can click on this preview tab and I see that I get a checklist. I can then submit my new issue. I'm going to, I can also add information. Sometimes I'll use these issues to keep track of different bits of data. So my raw data CSV file, I'll put in a link for that, for that. And there's also a great Wikipedia page on ELO model, ELO rating system. And so this is a great background set of information about the ELO model, how it was developed for chess, how it is modified for working with chess. And then also if you look down at the mathematical details, it tells you how you can take the two ratings and convert that to the probability then that player A would win or player B would win. So I'm going to copy this because that might be useful later as we go through our analysis. So background. And of course you could do, you could input these links as hyperlinks, right? So we could similarly do, you can make, you can make a hyperlink with markdown doing this. For, for these purposes, I'd like to have the naked URL out there just because I might want to copy and paste some of those links. Great. So this gives you a sense of how we might set up and structure our issues to be a to-do list as we go through our project. So the other thing we're going to do is we're going to work within our studio. And I'm using our version 3.5, which came out in April. I think it is the most recent version of our, in our studio. For the purposes of doing this demo, I'm going to modify my screen a bit so that the, that I have my code up here in the upper left. And then I want to put my console and terminal up here in the upper right. So I'm going to go to our studio preferences, pane layout. And I'm going to make this then my console apply. And voila, it works. Great. So next we're going to do a new our studio project where we can click on this link for create a project. And I'm going to check out a project from a version control repository. It's going to be git, obviously. And I want to get the URL for this. And so I'm going to clone or download, and I'm going to copy that into then my repository URL. I'm going to call this baseball wl model analysis. And it's going to save it to my desktop, which I'm cool with. And I'm going to tell it to open in a new session. So we'll create that project. And it reopens it. And you can see now that in my local folder, I have the contents of that repository. Let me go ahead and open what that looks like over here just to prove to ourselves that I do and have this to in fact have this on my desktop baseball wl model analysis. You'll notice that we don't see the dot get ignore file that we have here. And remember that the period indicates to the operating system that that should be a hidden file. So we don't see that here. And it just it keeps things clean. We can come up to our terminal and do ls. And again, we see the same things ls dash a, and we see the other hidden files like the our prod user dot get and the dot get ignore file. I have my bash system set up so that it tells me what branch I'm in and read if there are commits that need to be made. I'm going to go ahead and expand this to be the full screen. And I'm going to then do get status. And it tells me that I've got this get ignore file and the approach file that have been updated. I'll do get add dot get ignore and then baseball wl our project file get commit dash m set up our studio package or project. Excellent. So I'm now going to create a new branch dash b and we'll call this 538 validation. And so we now see that we're on the 538 branch and we can also do get status. It says unbranched 538 validation nothing to commit working tree clean. Excellent. So we're going to come back to our console. And I'm going to create a new r script. And I'm going to save this into my baseball wl model analysis directory as baseball model analysis. Maybe I'll just call it analysis dot r that that works. And what I like to do across the top of my r script is to create a banner that is a bit of a preamble and tells me in the future and you who are following along what's going on in this file. So at the top I might say file analysis dot r author patch loss date. And it is July 17 2018. And the purpose is this script runs the analysis to validate the 538 elo model and other models for predicting who will win individual baseball games. So we can make this more fancy include things like what are the dependencies what are the requirements those types of things. But for now and because you don't want to watch me type boring stuff we'll leave it at that. Most r scripts don't even have this much. So even having this is a is a big improvement over a lot of the code that I've I've written and that I've seen in others. So a lot of the coding that we're going to do is going to make use of various packages from the tidyverse. We can import that by using library tidyverse, which is a meta package that contains a bunch of other packages, things like dplyr ggplot, lubricate, four cats, redar, things like that. And if you're not sure whether or not you have tidyverse installed, you can come down to the packages tab in the lower right corner and type tidyverse. You should see it there. Do not click that box. I'll explain why in a second. And if you don't have it there, then you can come in here and type tidyverse. I already have it installed. I already have it installed. So I'm not going to worry about doing that. The reason I don't like to do that checkbox is because I want to be able to run my analysis dot rscript anywhere, whether I'm in our studio or I'm running it from the command line. And so if I click this, then it's going to run library. So I'll do it here, run library, and it runs that. Right. So it's loaded. But if I run this rscript somewhere else, and I don't have this line, they don't have that line and I run this rscript, it's probably going to complain because it doesn't know what dplyr is. And so that's why I don't like to have that checked. Instead, I like to put my library function calls inside of my individual rscripts. I also like to leave those at the top of the files so that if you come along, it's clear to you what files, what packages I'm sorry, need to be installed to run the code. So we'll start with library tidyverse, and we'll see that it installs packages ggplot, tibble, tidier, redar, purr, dplyr, stringer, and four cats. It also comes with a number of other packages that are installed but aren't quite loaded. And so one of those that we'll see is the package lubricate. Very good. So the first thing we're going to want to do is create a variable, an object called game data, which is going to be read underscore csv file equals. And the great thing about the readR functions like read csv and all the read functions really from R is that this doesn't have to be a physical file on my computer. It can actually be an HTML file. So I'm going to go back to my issue tracker, and I'm going to copy this HTML file name. What am I doing? Too many tabs open. And insert that here in the path for my file. And so since I've got my cursor on this line, I can hit command enter, and that will automatically run it over here in the console. It takes a couple seconds to load up here. But once it does, we see that it's read in the file. It's parsed it to put in different formats. Sometimes when I've done this in the past, it hasn't always correctly formatted those columns the way I'd want them. We can get another sense. If we do game data, and then we can see the different columns, the data that's in here. This looks like that csv file we were looking at on the website, but it's formatted a little bit nicer. So when you output a table, a data frame from the tidyverse, it limits the width of the table that's shown. And so it'll show, you know, as many as it can across the width here, and then it copies over and says there's 12 more variables that it couldn't include, and it tells you how it's being stored. So because I want to be a little bit defensive, I'm going to tell, read csv, what exactly I want the, I'm sorry, what I want the columns to be read in as. And so we'll do call types, calls. And then I'm going to say date equals call date. And I'm not going to do all 26 columns. I'm only going to do the columns that I'm really concerned about because I know I'm going to use them as I go along. So season equals int, I'm sorry, call integer and rating one post is going to be call double rating to post equals call double. And then we're going to want our score one is the call integer. And score two is call integer as well. Okay. And so one is for the home team. And two is for the visiting team. So for some reason, unmatched open bracket. So I need to finish this with a closing bracket. And I'm going to go ahead and run this, grab some coffee. And if we look at game data, we see that again, things are formatted correctly, they've been parsed, as we've specified. One of the things that I don't want in here is that there's a lot of games in here that haven't been played yet, that don't have the scores. Right. So if I do game data, select date, score one, score two. There's a lot of games in here that don't have scores because they haven't been played yet. So I'd like to filter this to filter where date is less than the current date. So I don't want today's games. I want the games that were played before today. And so I could go ahead and say current date is 2018, 07, 16. But if I run this on Sunday, I'm going to have to open this file and edit it. We can make use of a lubricate function, which is now. Let me just double check that. So running the now function within lubricate will give us today's date. And so we can say current date equals now. I guess I could have just put now in down here, but I want to make it a little bit more specific. And so we can run this and I can do game data, select date, score one, score two. And we see that we have all of the games that have already been played going back to July 15th all the way back to what was it May 4th, 1871. So this now gives us our game data data frame that we can use for all of our subsequent analysis. And so with this, I'm going to save that. I'm going to go to my terminal and I'm going to do a get add analysis dot r get status get commit dash m. And I'm going to say load and process or load and format game data. So now I can go forward and if I screw anything up, because I'm using version control, I can always come back to this commit and have a clean slate going forward. I can also now come back into here. And I can check off this first item on my to do list. So the next thing I want to do is ascertain who the favorite was for each game, and whether that favorite team one. To do this, I'm going to create a new object called favorite win probe. And we're going to have many ways to figure out who the favorite was. So today we're going to use the 538 model tomorrow, we're going to use the win loss percentage, and then eventually we'll get to the betting line to determine who the favorite was. So based on the ELO scores, the favorite win probe, we can get this from game data. We're going to pipe that into a mutate command, where we will then say fave 538 one and also have a fave 538 prob. And so we need to add code in here now to fill that column for 538 one. And so I'm going to use an if else statement. So if else, rating one post greater than rating two post. So if team one had the higher probability, sorry, this should be rating, I don't want the rating one post, do I? Is that what I want? Sorry, that should be rating probe one and rating probe two. So the rating post was the actual rating score. So this is the probability that I actually want. So I want the rating one rating probe one rating probe two. And I'm a bit worried now that I've got the wrong one. So I'm going to run that. And then I'm going to say, I'm going to have a print out game data. And I want to make sure rating probe one rating probe two, rating probe one rating probe two, rating probe one rating probe two. Great. So if team one has a higher probability, then that team is the favorite. And so then we want to know did score one, was that greater than score two? If rating two is actually higher than probably two is higher than one, then it's the favorite. And so we then want to know to score two greater than score one. And again, for the probability, if else, rating probe one is greater than rating probe two. So then what is the favorite? So then team one is favorite, what was its probability? Well, that was rating probe one. And then if two was the favorite, what's its probability? Rating probe two. Pipe rate. So this now creates two columns on our game data. We're then going to run a select command, where we will then do season, date, team one, team two, fave, 538, 1, fave, 538, probe. And then we'll do, that's it. And so we can run this, load this, and see what we get, that we get the season, the date, team one, team two, whether the favorite one and what the probability of them winning was, right? So on Sunday, the Cubs played the Padres, the Cubs were favored to win 65%. They actually then did in fact win. We also see that Oakland, or somebody, I don't know who it was, maybe Oakland, was a favor over the Giants, just by a little bit, but they ended up, the favorite team ended up losing. All right. So this takes care of our second task, which was to ascertain, so ascertain the favorite. And we can check this off our list. And now we want to plot the fraction of games that the favorite has won over the history of baseball. And maybe I'll go ahead and add some comments in here to say, load and format, baseball games that have already been played, ascertain, excellent. And so now what we want to do is we want to plot the fraction of games at the favorite. I get rid of that environment tab. I never really used that anyway. And so we're going to build a plot where on the x-axis is the season, the y-axis is the fraction of games that the favorite won during that season. To see whether or not the model's performance has varied over time. And so we will do favorite win probe. And we're going to pipe that then into a group by season. So that way then we'll take the big data frame, favorite win probe, and we'll then chunk it according to season. We will then do summarize. And we will save fraction favorite one is the mean of the fave 5381 column, 538. So the 5381 is a logical column. So it's trues and falses. And our true has a value of one, false has a value of zero. If you have a vector, or say 10 trues and falses, the mean will tell you the fraction of those 10 that are true. Alternatively, if you do some over the length of that vector, it will tell you how many things in that vector were true. So this is a cute way to very easily calculate the fraction of games at the favorite one. We'll then pipe that summarize into ggplot. And we'll say AES, they are aesthetic. So our axis season, our y is fraction, favorite one. And we will then do geom line. And for now, let's see what what we get. Mapping must be created by AES. Oh, I always do that. I for some reason use pipes instead of plus sign. Run that and we get the trajectory of the fraction of games at the favorite one over the course of the last 130 years of baseball. So it shows us what we want to see. I'd like to gussy this up a bit and make it look a little bit more presentable. I'm not a big fan of the default background. And I would like to have a zero to one scale on my y-axis. I would also like to put a line across indicating the average over all games of the number of the fraction of games that the favorite has won over the entire history of baseball. This will allow us to see, you know, is it falling off here or is it relatively constant with time? So we'll start that by coming back up. And I want to define an overall winprop variable. And I will take favorite winprop and do a, I will say that that is the mean of favorite winprop dollar sign 538.1. And I can on this and see that the average is about 57.5% of the time the favorite does in fact win their game. And so I can then do geom h line y intercept equals overall winprop. And I'm going to have to wrap this in an aesthetic. I didn't need to have that equal sign in there. And so here we see our horizontal line at about 57.5. I'm going to add theme classic. This gives it then a white background. I also want to add chord Cartesian. And we'll do Y limb from zero to one. And so now we go from zero to one. This h line is kind of dark. I'm going to make its color, I'll say light gray. And if you kind of see that that line is light. But if you look closely, you'll see it's on top of the line. So I can reverse the order to put that geom h line line of code before the geom line. And now it's reversed the order. Okay, so this is looking pretty good. Let me add some labels. So we'll do labs x equals season y equals fraction of games favorite one. And I'd like to also give it a title. And so I'll say something like the 538 model does a better than average job of predicting the winner of baseball games. I need to put this in quotes. And I'll do a subtitle. I'm going to do something like since 1871, the favorite has one percent of their games. And so I can use a paste function in here to concatenate together various strings. And so I will so I need to then insert in here round 100 times overall win probe. And then we'll do digits equals one. Should work great. So I see I'm missing a space here after the one. And that looks pretty nice. One of the things that I do kind of see is that after about the 1950s, the model doesn't do as well as it did before the 1950s. And so perhaps it would be good to retrain the model using data from the more modern series seasons. Baseball has gone through various iterations in its history, whether it's the whether the ball was more lively or not, or the height of the mound, whether it was during an expansion period. More recently, the American League and National League have actually been playing each other, whereas before about 2000 or so they never played each other. So it's had a lot of iterations. And so you can imagine that the model to predict when the winners and losers might change over that history. So this is great. This shows us how the model performs in predicting winners and losers over the course of its history. I would say it's better than average, it's better than flipping a coin, but not much. And so I suspect that baseball is really random or has a high likelihood towards seeming random because just because teams are perhaps very similar to each other and the margin of difference between the teams is quite small. So I'm going to go ahead and commit this and say, what, generate plot showing change in performance over time. Excellent. And we'll go ahead and check this off our to-do list. So the final thing I'd like to do today is to determine whether the probabilities generated by the model actually bear out with the real baseball data. So as we've seen that predicting a baseball game who's going to win and lose can be pretty hard. And so it gives us, saying who the favorite is is a dichotomous variable. Are they the favorite or not? But what things like the ELO model give us and betting line and other models is a probabilistic function. It's like saying Clinton was the 70 or 80% favorite to win the 2016 election. Well, like I've been saying, we can't run that election, thank goodness, a thousand times to see if she wins 70% of the time to validate that model. What we'd like to do is to say, okay, predicting winners and losers is hard, but when we give a probability of a win or loss, is that probability based in reality or is it just a random number? And so what we'd like to do is to plot the fraction of games that were won by the favored team when they're favored at a specific probability. And we would expect those to fall on like a 45 degree line with an interceptive zero and say a slope of one. So we're going to do some a little bit more advanced our work now to go in and figure that out. And so now what we're going to do is we're going to plot the observed versus expected fraction of games won by the favorite. And again, for this, we're going to use all of the data going back to 1871. Again, you could use the filter function to focus in on specific periods of baseball history. So you maybe only want to do 2018, or everything since 2015, or maybe you just wanted to look at the year 2017. For what we're going to do, we're going to look at the full history of baseball. And I'm going to add some space here just to move everything up. And so we're going to work with the favorite win probe data frame. And I'm going to pipe that to a mutate function where I will then say, I'm going to take, if I look at this, if we look at favorite win probe, you'll see that the probabilities here go out to the 1000th place. To build up the numbers that I have, I'm going to only go to the 100th spot. So I'm going to round all of these probabilities to the 100th. So it'll be 65, 51, 65, 55. And then I'm going to aggregate within those and figure out the fraction of games that were won within those bins. So I'm going to first mutate fave 538 prob to be around. And we're going to round fave 538 prob. It will say digits equals two. And to make sure that works, we now run that. And we see that we have, in fact, rounded it to two digits. We're going to then group by fave 538 prob. And then within that, we're going to summarize, and we're going to say n for the number of games played. I'm going to call it games, sorry. And to get that, we'll use the n function. We will then get wins, which will be the sum of fave 538 1. And then we'll have AVG, I'm sorry, observed, which will be wins divided by games. We run that. And we see that for, we shrink this down a bit. And let me run that again. And I'm going to pipe this to a print where n equals inf. And we get the entire table here. And we see the probability of winning the game, the number of games that have been played within that have had that probability since 1871, the number of wins, and then the observed frequency. So for games that had a 63% probability, actually, the teams that had that that were favored won 65% of the time. And so it looks like these numbers are a little bit hot, right? That the observed is actually a little bit higher than you would predict. So that's pretty cool. So let's see what this looks like. And also, you know, when you flip a coin, say we flip a coin 9,984 times, what do we expect to see for the number of heads? Do we expect to see 5,000? Or is 4886 outside of what we'd expect just based on random variation? So what I want to do is I want to have a plot where on the x-axis is the predicted frequency of wins and losses, the predicted probability. The y-axis is going to be the observed. And I'm going to have a line that has a slope of 1, an intercept of 0. And around that, I want the confidence interval for what I would reasonably expect given the number of games that were played. And on top of that, I want to then plot what I actually observed to see whether the observed falls within that cloud around that line. So hopefully that makes sense, what I'm going for here. And so to do that, we're going to use a binomial model to fit the data. And so I need to save this as a variable. And I'm going to call this all predicted, observed, and run that to store that into memory. And that looks right. And so now we're going to work with this to do binomial fit validation. And we're going to take the all predicted, observed, and we're going to pipe that into a group by, and we're going to group that by the fav 538 prob. And within that, we're going to then nest the data. And if we look at this, for now I'm going to maybe bring this down to its own line. Okay, it doesn't like me doing that. I'm going to remove this for now or just comment it out. We see that we get the probabilities that we had here, and it's converted each line into games wins observed. And it's converted that into its own table. So what I want to do is for each of these probabilities, to take the games wins and observed, really just the games and wins. And to calculate what is the 95% confidence interval that I would expect in the number of wins. Okay, and so we're going to use the map function, we're going to use the broom function, and the tidy function, sorry, the tidy function from the broom package to make that all work. So one of the problems with this though, or let me get to the problem, we'll see the problem here pretty soon. So with the nest function, we can then say mutate, and I'll say binomial. So we'll create a new column, which is really going to be its own data frame, a column of data frames, where we have binomial equals map. And the input from this pipeline and the player into mutate, the default value is data. That's what it's called by default. And then we'll say function df. And so the data frame is what map is pulling off of data for each row that it's going through of the data frame, so df. And this is going to be an anonymous function, where we then say, I'm going to do some indents in here just to make it clear where things go. We say tidy, and we'll do binom.test. And we will then say x equals, I'm just going to pseudocode this number of wins and n equals number of games. And get my parentheses right. So if we look at binom test, we see that this is a function that performs an exact test of simple null hypothesis about the probability of success in a Bernoulli experiment, where x is the number of successes and is the number of trials. So if I were to do binom.test and say, let's say 3, 5, so the probability of winning 3 out of 5 games, what is the P is going to be 0.6, that we then see that our probability of success is 0.6, and the 95% confidence interval is 0.14 to 0.95. So this is essentially what we want to do, but for all of the rows in our all predicted observed data frame. And so the number of games n is going to be our games column. And I'm sorry, but it's going to be df dollar sign games because we're giving this function df. And here we want df dollar sign number of games, but we also want it times df dollar sign prob. But we don't have a prob probability in here. We've got it outside, right? So if we look at this data frame as it's coming into the mutate, we've got the probability on the outside, on the inside, it's the number of games, the number of wins, and the observed probability. We want to know, we need that probability in there. So I don't really know how to look outside of the tibble to another column. So what I'm going to do is I'm going to add a column that is basically going to copy, it's going to be prob equals fave 538 prob. And so this is going to copy that column, 538 prob, to now be a new column within my tibble. And so if we look at this, we should now see fave 538 prob, missing my pipe. So from that, I now see I have one by four, right? So we've added prob and we've nested all, we've nested this within 0.5. And so what we're doing again is we're mutating by adding a column binomial, we're running the binome test to get that out. This tidy function that we wrap around binome test takes the output of, so takes this output and it turns it into data for a data frame. Okay, and so let's see what that looks like. The output then is going to be, cannot find function tidy. So this is coming from the broom package. So we need to say library broom. And so now we try this. X must be an integer and non-negative. So the problem is that df games times df prob is going to be a fraction. And so what I want to do is I need to make this an integer. So we can say as integer. And the other thing that occurred to me is that we need to add p equals df dollar sign prob. So if we run this now, we now see we have that probability, the data tibble, as well as our binomial column from our data frame. And we can now do unnest, run that, and that unnesting then opens up those data frames so that we can then see the 538 probability, these four columns coming from that data column when we did the nesting, and these columns then being the output from running the binomial test. And so we can then come back here and do select. I guess it's not essential, but it makes me feel better about life to have things kind of compact a little bit. 538 prob, games, wins. We'll then do observed. And then we'll also do conf.low and conf.high. And so now we have, for all of our probabilities, we have what we observed, and then the low and the high confidence interval. Great. So now we want to plot this, of course. And I'm going to come back here and save this as this data frame. I generally like to work it like this, just because it allows me to see the output at each step as I go through. And then to be sure to come back and to save it as a new data frame for input to other functions, other pipelines, and making plots. So I can take this now, and I can pipe that into ggplot, where I'm going to do aes, and my x is going to be the fave, 538 prob, y is going to be observed. I can then, plus I did that again, geom point. If I run this, what does it look like? We see a pretty good line here. And it's not quite, as we saw with looking at the table, I believe it's a little bit above a 90-degree line, or 45-degree line. So we can then add a b line, and we'll say y intercept, or say intercept, sorry, it's going to be aes, intercept equals 0, slope equals 1. Sorry, it needs to be geom a b line. So that's our 45-degree line. And we see that our points do largely fall above that line. I can then also add geom ribbon, and I forget the syntax for geom ribbon. So by my desk here, I have this stack of cheat sheets, and so I'm going to get my ggplot cheat sheet, and I know it's called geom ribbon, but I forget the actual syntax for what it wants for aes. So it wants ymin and ymax. Great. So we'll do aes ymin equals conf dot low, ymax equals conf dot high. If we run this, keep hitting that t. We get this ugly mess. And so our ribbon is black, filled with black, basically, and it's on top of everything else. So let's move these ae lines and ribbon above geom point at our plus sign. I'm going to put my line in front of my ribbon. I'm going to do color equals light gray. And if you look closely, you'll notice that the border to the ribbon is light gray, and it's not color I want, it's fill. So that looks good. Maybe I'll make my color for this dark gray. So that looks a little bit more subtle and puts more emphasis on our actual observed points. And again, I'm not a big fan of the default styling. So we'll do theme classic. And I'm going to do cord cartesian ylim from z0 to 1. See how this looks. And so that looks pretty slick, right? We need to change our labels. But we see that the 538 model seems to be a bit conservative, that if we say that a team has a 70% chance of winning, they actually have a higher percent chance of winning than it's indicating in this, according to this model, according to reality, right? And so that's pretty interesting. The other thing that we saw from before is that most of the games are actually in this range in here. So where most of the games occur, it's probably doing a pretty good job. And you really can't tell much of a difference. Whereas where we have fewer games, it starts to tilt above the predicted probability. All right. So finishing this up, we can do lay of labs x equals predicted probability of winning, y equals observed probability of winning. And we can then say main, we'll say the 538 model under predicts the true ability of the favorite to win. And the subtitle will say all games from 1871 to present. I think it should be title. I'm a recovering base R user. And so I get some of the syntax swapped between base and ggplot. Excellent. And so that looks really nice. And it makes the point, I think pretty nicely that that again, 538 is under predicting the true ability of the favorites to win. So we will finish up here by going into our terminal. I need to save my analysis to our file. We'll get add, get commit. And we will then say compare, modeled to predicted win fraction. I'm sorry, modeled to observed. And then we'll say closes number one. So this is finishes our branch closing out issue one. If we then come back to our issue tracker, we can click that off. And then I can do get checkout master. So now it's upset. Don't to close this file now. No, we can then do get merge 538 validation. And so this is taking all the data from all the code from 538 validation and now merging it that into the master. So everything is good, our branches ahead of origin master by four commits, we can do get push. And this is now pushing our data, our code up to GitHub. And I can hit refresh here. And we see that because I added that closes number one, it then adds in that it closed this issue just now. And if we look back at our code, we now see that we've got our analysis.rfile in here as well. Okay, so this brings us to the end of the first demo. Some of the things that we're going to go forward with in the next demo is to add a new model based on the team's actual win loss records to that point of the season. We also haven't added any way to store graphs in here. We also don't really have great structure, but I'm not so worried about that right now. And I know this has gone a little bit long, but I really hope you've enjoyed seeing this demo of how I work with R and the various tidy verse packages to make some inferences and to see if we can validate the 538 models and other models that are out there and try to improve upon them. And I think what we saw was actually that the 538 baseball model, the ELO model does a pretty good job of predicting winners and losers. And so that's that's pretty awesome. So I'll see you tomorrow and enjoy the baseball. They all start game tonight.