 Count data, a tale of pass-on and predicting football results. Hello, my name is Vernon Gale, I'm Professor of Sociology and Social Statistics at the University of Edinburgh, and I'm part of the ESRC National Centre for Research Methods. At the current time, due to the restrictions placed on us by the COVID-19 pandemic, the National Centre for Research Methods are unable to deliver any standard face-to-face research methods training courses. I hope that sometime in the near future you'll be able to join us in person at the University of Edinburgh. The film that follows is about the analysis of count data. Count data. Consider the following. How many times did you go to the cinema last year? How many people has your best friend slept with? How many goals have your favourite football team scored this season? The answer to any of these questions is likely to be a count, which means it is a positive whole number, i.e. an integer. The number must be positive because you can't have minus two visits to the cinema, and it must be a whole number because you can't make a three-quarter or 0.75 visit to the cinema. Similarly, sexual partners and goals are also counted in positive whole numbers, rather than as fractions or decimals. Examples of social science data that take the form of positive counts are legion. For example, how many burglaries take place in a neighbourhood? How many women under 20 gave birth last year? Or how many cases of a disease were diagnosed? Indeed, the question of how many anythings will usually be answered with a count. Given the prevalence of count data in the social sciences, for many years it has puzzled me why most social scientists know very little about analysing count data. In reality, social science data analysts tend to know more about analysing either binary, i.e. nought-one outcomes, or continuous, i.e. metric measures. The Poisson distribution is integral to analysing count data. The Poisson distribution is named after the French mathematician, Simeon Denis Poisson, who was also a fellow of the Royal Society of Edinburgh and his name is one of the 72 names inscribed on the Eiffel Tower. Lord Tennyson laments that in spring a young man's thoughts turn to love. I'm a middle-aged football fan and by contrast my thoughts often turn to the final game of the season. My interests lie in Scottish League 2, the fourth tier of Scottish men's professional football. I'm a sterling Albion fan and here I'm pictured with our mascot, Bino the Bear. I'm going to use an example that I developed before the final day of the football season in Scottish League 2 back in 2018. I'm going to follow an analytical approach that was used to analyse data in the English Premier League by Professor Sir David Spiegelhorter. Here I am at the Royal Historical Society teaching a course and Professor Spiegelhorter, who was then President of the Society, came in to say hello to us. In my view, David is the greatest living British statistician. You may well have seen him on TV during the current COVID-19 pandemic and he's a recognisable voice on Radio 4. David's a really nice fellow, but he's also great fun. For example, he's the reigning, first and only, world champion in loop. This is a version of pool invented by Alex Bellos and played on an elliptical table with a single pocket in the green bays. League football matches either end in the home team winning, a home win, the away team winning, an away win, or a draw, when both teams have scored the same number of goals. Many games end without either team scoring and typically games end with each team scoring only a few goals. There are, of course, the occasional shocker runes, for example, when a team will suffer a 6-0 defeat. There are also occasional goal fests, where both teams stick it in the onion bag half a dozen times. In 1885 our growth thrashed Bonacord 36-0 and in 1984 Stirling Albion beat Selkirk 20-0. Routinely, however, most games end with a modest number of goals despite the large number of opportunities to score. Here is the example. As Cher would say, let me turn back time. As the last day of the 2017-18 football season approached, the winner of the Scottish League 2 was still undecided. My own football club, Stirling Albion, was due to be battling for a place in the playoff competition. When consuming a traditional halftime pie, I have often ruminated on the veracity of a statistical approach to predicting match outcomes and final scores. Here is a list of the five games that made up the last day of the season. In this example, I'm going to bring some statistical thinking to the prediction of the outcomes of the matches and to predicting the final scores. To make things interesting, I consulted a fellow fan, a guy that's followed the club since his teens. Every fan thinks that they're an expert, but I prefer to consider this as pseudo-expert knowledge. Here are his predictions. He thought that Clyde Barrick would end 2-1, Cowdenbeath and an Athletic would end 0-0, Montrose vs Elgin would end 2-0, Peacehead vs Edinburgh City would end 3-0, and Stirling Albion vs Stenhouse Muir would be a 2-all draw. I've also constructed a set of random predictions decided by a 7-sided dice, and here they are. Now let's just look at the data that have been generated by the games played in the league so far. Montrose are at the top of the table with an impressive 76 points, whereas Cowdenbeath are at the bottom of the table with only 22 points. The final game of the season for Stirling Albion will be played at home. As a fan, I would like to think of our ground fourth bank as a modern day manifestation of a Roman Coliseum where football foes are routinely vanquished and pies and bovril refresh the Senators and Equites. The reality is somewhat different. So far this season, we've played 17 home games and won only eight. Stirling Albion will be playing local rival Stenhouse Muir on the final day of the season. Stirling have lost four of their last five games compared with Stenhouse Muir who have only lost once in the last five matches. I'd like to predict the outcome of the forthcoming match and other games on the final day of the season. Stirling Albion are playing Stenhouse Muir in the final game and as we can see, Stirling have played 35 matches. They've won 16, drawn six and lost 13. Their opponents, Stenhouse Muir, have also played 35 matches but they've only won 15, they've drawn eight and they've lost 12. The first measure that we're going to construct is called attack strength. It's a measure of how good the team is at scoring goals. Montrose, who are top of the league, have scored 59 goals. Cowdenbeath, who are at the foot of the table, have only scored 23 goals. The average number of goals scored by each team in the league is 49, i.e. the ten teams have scored 490 goals in total. Stirling Albion have scored 60 goals and Stenhouse Muir have scored 55. If we take a ratio of the team's goals scored over the league average, then we have a measure of their attack strength or the quality of their attack. Stirling Albion have scored 60 goals and the league average is 49 so they have an attack strength of 1.22. Stenhouse Muir have scored 55 goals and the league average is 49. They have an attack strength of 1.12. We can infer that Stirling Albion score about 22% more goals than the league average and Stenhouse Muir score about 12% more than the league average. The second measure that we're going to construct is called defensive weakness. How bad is the team at defending measured by conceded goals? The average number of goals conceded by each team in the league is 49, i.e. ten teams have conceded 490 goals in total. There's a beautiful symmetry here simply because when one team score a goal, the other team concede a goal. If we take a ratio of the number of goals that the team concede, goals against, over the league average for conceded goals, then we have a measure of their defensive weakness, a measure of the lack of quality of their defense. Stirling Albion have conceded 51 goals when the league average is 49. They have a defensive weakness of 1.04 whereas Stenhouse Muir have conceded 46 goals but the league average is 49. So they have a defensive weakness of 0.94. We can infer that Stirling Albion let in about 4% more goals than the league average whereas Stenhouse Muir let in about 6% fewer goals than the league average. There are two more measures that are required. Home average, the average number of goals home team score. The away average, the average number of goals that away team score. Team scored 255 goals out in 175 matches. Therefore the average number of goals that home team score in a league game is 1.46. 225 divided by 175. Team scored 235 goals away from home in 175 games. Therefore the average number of goals that away team score in a league is 1.34. 235 over 175. How many goals can we reasonably expect when Stirling Albion plays Stenhouse Muir? Putting the information together we can work out the expected goals for each team. This is the information required for calculating the number of expected goals when Stirling are playing at home. The average number of goals scored by a home team is 1.46. But Stirling are not an average team. They usually score about 0.22 or 22% more. Their attack strength is 1.22. They are also playing Stenhouse Muir who have an effective defence and have only conceded 46 goals when the league average is 49. Stenhouse Muir have a defensive weakness of 0.94. Therefore given Stirling's better than average scoring ability and Stenhouse Muir's slightly better than average defence I estimate that Stirling can expect to score 1.67 goals. 1.46 times 1.22 times 0.94 i.e. 1.67 expected goals. Stenhouse Muir are playing away. The average number of goals scored by an away team is only 1.34. But Stenhouse Muir are not an average team. They usually score about 0.12 or 12% more. Their attack strength is 1.12, remember. They are also playing Stirling Albion who have a slightly suspect defence and have conceded 51 goals when the league average is 49 goals. Stirling have a defensive weakness of 1.04. Therefore given Stenhouse Muir's better than average scoring ability and Stirling's sadly slightly weaker than average defence I estimate that Stenhouse Muir can expect to score 1.56 goals. i.e. 1.34 times 1.2 times 1.04 Now we have an expected number of goals for the two teams. It's possible to plug this information into the Poisson formula. In football, once the referee blows the whistle and play commences, in the 90 minutes that follow there are many chances to score a goal but few of these chances end in a success. In statistical terms we might consider this as a large number of trials with a low chance of success. The Poisson distribution expresses the probability of a number of events occurring in a fixed period of time if these events occur with a known average rate and are independent of the time since the last event. Here is the Poisson formula. Lambda is the expected number of goals and e, which is 2.71828 is Euler's number which is a mathematical constant. k is the number of events in this example 0 through to 6 goals and k in the explanation mark is k factorial so when k is 6, k factorial is 6 times 5 times 4 times 3 times 2 times 1. So here we have the probability which is equal to Euler's number to the power of minus lambda the expected number of goals times by all in brackets lambda to the power of k divided by k factorial. Plug in the information for Stirling Albion into this formula for one goal the probability is 0.31 the two goals the probability is 0.26 the predicted probability of Stirling scoring zero goals is 0.19 or 19% the predicted probability of Stirling scoring one goal is 0.31 as we've seen or 31% and the predicted probability of Stirling scoring two goals is 26% 0.26 and so on Here are the probabilities for each of the two teams there's a 31% chance of Stirling Albion scoring one goal Stenhausmuir are also most likely to score only one goal there's a 33% chance if we multiply 0.31 and 0.33 we can estimate the overall probability that Stenhausmuir versus Stirling will end 1 1 0.13 so 0.31 times 0.33 equals 0.103 this suggests that there's a 10% chance of a 1 1 result i.e a draw in statistical terms I've assumed that each event i.e. each goal is independent let's take a closer look at this chart as a fan I'd be delighted for the match to end with Stirling winning 6-0 but I can estimate that there is only a 0.001 chance of this result Stirling have a very low probability of scoring 6 goals and Stenhausmuir have a 20% chance of not scoring a goal so the result is 0.01 times 0.21 here are the predictions for the results of the other four matches that should be played on the final day I've estimated that Stirling Albion and Stenhausmuir will end 1 1 that Clyde and Berwick will end up with Clyde winning 1-0 that Cowdenbeath and an Athletic will end with an Athletic winning 1-0 Montrose who are very strong will beat Elgin City 2-1 and Peterhead also a strong side will beat Edinburgh City 2-0 over the years I've noticed some striking empirical regularities birds fly fish swim and colleagues sometimes accuse me of using an outdated software package or programming language in the examples here I've used Python simply as a defence against this familiar accusation here is the Python code running in a Jupiter notebook you don't get much more hipster than that other software though and statistical languages are available more complex models the models outlined above are simple pattern models a product of only a few terms i.e. home advantage, attack strength and defensive weakness but we could extend these models I'm going to pause for a bit and get you to think about some potential ways that these models could be extended more complex models these models use data for the whole season but could we put more emphasis on more recent results we might also consider that some teams have a better or even worse home advantage than the league average also there's no information on the composition of individual teams for example new players may have joined during the season or some influential players may be injured it might even be advantageous to include other information for example on the weather or even the state of the pitch Stenhouse Muir for example and a small number of other clubs play all of their home games on synthetic pitches the sports betting companies use much more complex models than the ones shown above that incorporate more information and they also have football experts advising them here are the classified results Clyde 1 Berwick 2 Beef Nill Montrose 1 Elgin City 1 Peterhead 2 Edinburgh City 1 Stirling Albion 1 Stenhouse Muir 1 the outcomes the statistical method only predicted one correct score it did however predict the correct result for three of the five matches the statistical method beat the fan who only predicted two correct results the dice only managed one correct result neither the fan a pseudo expert or the dice predicted any correct scores a word of caution we do not advocate using the methods outlined above for gambling we stress we do not advocate using the methods outlined above for gambling there was a popular saying when I was a boy that it was not by chance that at my local bookmakers there were five windows for placing bets and only one window for collecting winnings more complex models the models outlined above are simple pass on models that are the product of only a few terms i.e. home advantage attack strength and defensive weakness but we could extend these models as we've noted above you might also have thought of some additional information that could be included in the analysis on further reflection an underlying problem is that any score combination nil nil to six all is one of 49 cells on a seven by seven grid each specific score has a very low probability one technical extension might be to develop a set of confidence intervals to test the coverage of predictions it might also be prudent to check if the press on distribution is the most appropriate distribution to use when modelling the scores in a lower division football match who knows I might even get around to doing some more work on this one of these seasons we have been discussing count data and thinking about how to use it in analysis we've said at the start of this film the examples of social science data that take the form of positive counts are legion and I use the examples of how many burglaries take place in the neighbourhood how many women under 20 gave birth last year or how many cases of a disease were diagnosed indeed the proset question how many anythings will usually be answered with account I also said that the prevalence of count data in the social sciences for many years it's puzzled me why most social scientists know very little about analysing count data the technique known as press on regression estimates models of the number of occurrences i.e. counts of an event the press on distribution has been applied to diverse events for example Ladidus Borchovich is the number of soldiers kicked to death by horses in the Prussian army this is probably the first use of this approach over the years I've read various slightly pedantic discussions as to whether or not the data were for officers only or if it included both mules and horses but I must confess I don't really care Clark analysed patterns of hits by buzzbombs launched against London during World War II Thorndike analysed telephone connections to a wrong number if you're familiar with regression models or the generalized linear modelling framework then you will have seen equations like this before in essence there is a left-hand side i.e. the outcome variable and the right-hand side a set of explanatory verbals here they're written as x1 through to xk and then finally there is an error term it's easy to make the conceptual leap to have an account variable as the outcome and the regression model using information from the prason distribution there are several different models that are suitable for modelling count data the institute for digital research and education at UCLA provide this excellent page with examples using STATA SAS, SPSS R and M+. here is an example of a historical paper that has just been published that employs models for count data a stellar early career researcher Dr Sarah Stopforth and her colleagues model the number of school GCSEs gained at grades A star to C they undertake a sensitivity analysis comparing alternative statistical models suitable for count data then they use a negative binomial regression model rather than a prason model because there is evidence of over dispersion negative binomial regression can be used for over dispersed count data which is when the conditional variance exceeds the conditional mean in their data set there were high proportions of young people with zero counts so a zero inflated model was used in conclusion we have been discussing count data and thinking about how to use it in an analysis the prosaic question how many anythings will usually be answered with a count given its prevalence in social science research it's worth learning how to analyse count data I hope that watching this video and using the accompanying materials will help you to better understand count data and how it can be analysed thank you very much at the current time due to the restrictions placed on us by the COVID-19 pandemic the National Centre for Research Methods are unable to deliver any standard face-to-face research methods training courses