 I'm Salvatore Bobona, and today's lecture is the global standardization of data. The data included in the international data infrastructure is becoming increasingly standardized, routinized, and easily accessible. On balance, that's a good thing. But it does result in the problem that people, or even professional researchers, tend to ignore concepts that are not captured in these standard versions of databases. It can also be easy to forget that data must be manipulated, and I don't mean that in a sinister way, I mean it must be handled or packaged before they can be used effectively. Most of the time, the data do not speak for themselves. Data series are increasingly standardized across countries through the coordination of all sorts of intergovernmental standard-setting bodies. The best known, of course, is the United Nations, but also the European Union standardizing data across the 28 European Union member states, the OECD, the Organization for Economic Cooperation and Development, which standardizes data across some dirty, something rich countries plus a few middle-income countries, and the World Bank, which has, as its members, most of the countries of the world and produces standardized data for some 210 countries and country-like entities in the world. The global standardization of data about countries and about the people who live in them started with the spread of national income accounting of GDP measurement after World War II. National income accounting is the compilation of all economic activity in the country into a single system of national accounts. This first system of national accounts was developed in the United States in the 1920s by Simon Kuznetz, and it's based on a series of industry surveys and service surveys that measure economic activity. Literally, the Bureau of Economic Analysis and the Census Bureau conduct surveys of businesses asking them how many units were produced and which ones of several thousand standard industry classifications of products. All of this is then aggregated up to a total national accounting system. During World War II, the United States used the system of national accounts as a very effective planning tool. Simply put, the United States knew in great detail how much was being produced in every very small, fine-scale industrial sector in the country in a way that other countries in the world simply didn't, didn't have a firm grasp of what was being produced at the time. So while other countries had overproduction of some products and underproduction of others, famously in the Soviet Union, even after World War II, at times you could get one kind of food would be available in stores and so there would be an entire supermarket aisle full of canned peaches but no other goods because nobody was correctly planning the economy at a very fine-tuned level. During the war, the United States was able to plan at an extremely detailed level and as a result, not only outproduced the rest of the world, but to outproduce the rest of the world while keeping grocery stores well stocked in the United States itself. The success of the U.S. in using the system of national accounts in World War II led to it being rolled out to other countries after World War II and in fact one of the conditions of receiving U.S. economic aid and aid through the World Bank after World War II was that countries should use the system of national accounts to understand and plan their economies. Many of the common statistics you hear about in the news are derived from system of national accounts accounting and the surveys that go into that. Things like gross domestic product or GDP, the inflation rate, trade surplus or trade deficit, all of that is part of the system of national accounts. Data from the system of national accounts and related statistics make up the core of the world's international data infrastructure. The calculation of GDP requires the collection of so many different kinds of economic data that then all of that becomes part of the standard issue data available in the world. Things like industrial output, agricultural output, prices, numbers of companies, size of companies, every company in a system of national accounts has to report how many employees it has, at what different levels, engaged in what kinds of activities. All of this enters the system of national accounts. The further calibration of GDP per capita, GDP per person, requires the collation of GDP data with demographic data. So if you want to know GDP per capita, you have to know the population. In order to know the population, you have to know vital statistics like births and deaths. You also have to know how many people are moving in and how many people are moving out. So if you get the whole package GDP per capita, that includes economic data from the system of national accounts and demographic data from censuses and intercensus surveys. There are also intergovernmental organizations that produce a bewildering array of data sets, but many of them are purely specialist use. I'll just leave this card up for a moment to give you some idea of all the different sorts of databases that are available in the international data infrastructure, but these are specialized data sets that are rarely used by anybody outside the academic and research communities. Nongovernmental organizations, NGOs, also compile and publish data. But the trick is that NGOs publish data in support of their own missions. They don't tend to produce data just for the general public or for government planning use. They produce data because they have an interest in producing data. So for example, Freedom House, the U.S. Democracy Promotion think tank, produces a report on freedom in the world unsurprisingly, while Reporters Sans Frontières produces a press freedom index. Now, do we really know that these indices, freedom in the world, or the press freedom index, accurately captures in a valid way levels of freedom or levels of press freedom? Well, it's difficult to say. Faith in the data set really simply depends on faith in the organization. In the PowerPoint version of the lecture, you can click through to some data sets from Amnesty International about the numbers of prisoners and numbers of people executed for crimes in different countries, or data sets even produced by newspapers like The Guardian or data sets from Wikipedia itself. Caking into account the many different origins of data and the international data infrastructure makes us think about the limitations those data might have. Obviously, data that come from system of national accounts and censuses and intergovernmental organizations has the informantor of an official stamp certifying them, but on the other hand, they collect data that they want for official purposes, which may not be the same data that you want as a researcher or as a citizen or as a commentator. Thus, we have limitations in how we can use data. Social scientists of all kinds make use of data from the international data infrastructure, but we have to be judicious in using it and always be asking are the data that are available the data we really wanted or would have asked at the time. Now to paraphrase Donald Rumsfeld, again on the PowerPoint version, if you click through you will find a Donald Rumsfeld video where Donald Rumsfeld was asked about the Iraq war and deficiencies in U.S. equipment during the invasion of Iraq in 2003, and he said you go to war with the army you have, not the army you might want or wish to have at a later time. And the same might be said of data. If you want to analyze the world, if you want to participate in global policy debates, if you just want to know what's going on in the world, well, you have to use the data you have, not the data you might want or the data you might wish to have at some later time. And of course Wikipedia and Google searches simply rely heavily on the data we have. If you want to find out if Tajikistan is a democracy, you can ask Google, but Google will go to standard democracy data sets like Freedom in the World and give you an answer based on those. You want to know the population of Australia? Well, Google will give you an answer, but the answer will be pulled from the World Bank. Interestingly, not from the Australian Bureau of Statistics, which would be the official population of Australia, but it's easier to access via the World Bank so that's where Google will get it. Same for GDP per capita of Ecuador, lists of countries by imports, poverty rates. Google will automatically go to the international data infrastructure rather than going to national sources. And that's the interesting thing about all of these data, so you might be looking for data about a specific country. When you search Google or when you go to Wikipedia, almost always you will get an answer from a global organization, whether in an IGO like the World Bank or an NGO like Amnesty International or Freedom House. You won't get an answer from a national organization because the internet is a global space, not a series of 200 national spaces. One very interesting, politically sensitive economic indicator that is simply not systematically collected is the level of inequality. Now, this is a chart of income shares of the top 0.1%. So the percent of income in four different countries that accrues to the top 0.1%, that is one of many indicators of inequality. These data actually come from the Thomas Piketty data set, the World Top Incomes database from Thomas Piketty and his team in Paris. These data mostly come from tax databases, national tax databases. Piketty and his team have pulled these all together into an easily accessible online database that can be used across countries. Now, as you can see, inequality has risen dramatically in the United States. The United States and the UK used to have similar levels of inequality to France and just a little bit higher than Sweden back in the 1970s, but inequality in Sweden continued to fall into the 80s and then rose a little bit. Inequality in France has risen just a tiny bit, but look at inequality in the United States. It's risen dramatically. The United States went from having 2% of income going to the top 0.1% to having 9% of income going to the top 0.1% of the population. The UK has followed along in the footsteps of the US. So this massive increase in inequality is not available from any globally standardized data set. The Piketty team or a team of academic researchers have pulled together data from dozens of individual countries' tax databases, but there are lots of limitations. The data differ by country. There's no standard system, like the system of national accounts, for collecting inequality data. So all of it is indicative. It has to be taken a little bit with a grain of salt. When you see a change like that of the United States from the 1970s with inequality quadrupling by the 2010s, I think you can take that as a real difference. Are all of these minor differences in France? Is the difference between France and Sweden really substantively significant? Well, it's hard to say, because the data come from different sources. We're collected using different methods and are not strictly comparable across countries because there is no system of national inequality accounts. We just don't have it. But luckily, for all its faults, and GDP per capita is criticized in many different ways, GDP turns out to be a pretty good indicator of most other things about a country. Countries that have high GDP per capita tend to have good ratings on everything that's good and bad ratings on everything that's bad. Maybe the one exception is carbon dioxide emissions. Higher GDP per capita is related to higher carbon dioxide emissions. But it's also related to higher levels of education, people having more telephones. It's related to people living in cities, not in isolated rural communities. High GDP countries have lower population growth and lower fertility rates. They have much lower infant mortality rates, higher life expectancy, higher levels of immunization. They tend to have more service-focused economies, less people actually in the labor force, in rich countries, people don't have to work as much. And lower rates of consumption, people aren't living hand to mouth. They can actually save. Now, all of these relationships, I just went down this column, all of these hold up no matter what measure of GDP you use, and I won't go into the details, but these are six different ways to operationalize GDP. And despite the six different forms of GDP data, they all line up with all of these things about society. Interestingly, the one really important social indicator that is completely unrelated to GDP per capita is inequality. It doesn't matter whether you're a rich country or a poor country or somewhere in the middle, you can either have low inequality or high inequality. It all depends on the policy in that country and has virtually nothing to do with the level of GDP in that country. Well, once you go out to the international data infrastructure and you actually want to use these data in a research project or just to create a graph or to tell a story, the data have to be actually put into a data set and put into a database, rather, to be used for analysis. Most data from the international data infrastructure are freely available to download from online databases. If you search Google and you get a number for something you searched, go back and instead of getting it from Google, you can go to the original source, which more often than not is the World Bank's World Development Indicators. All of these data are available free for anybody. Now, some of the specialized data are not, but most data that any ordinary policymaker or researcher would choose are freely available on the internet. And usually databases in the international data infrastructure usually have three dimensions, so they're a little difficult to visualize because you have to think in three dimensions. You have a country, you have a variable, and you have a year. So for every bit of data, you know, the GDP per capita of the United States is $55,000. Well, that's for the United States, that's GDP per capita, and that's 2017. It has a country, a series, and a year. And you can see that very clearly in the World Development Indicators. I won't click through now. There's an entirely separate video available from my YouTube site about how to use the World Development Indicators. But I will be showing you some data that have been downloaded from the World Development Indicators. So for example, here's a standard representation of two-dimensional pages of data. These are pages of data that include a country code, a country, infant mortality rate, and the log base 10 of GDP, gross domestic product. 3 means 10 to the third. 10 times 10 times 10 is $1,000. 4 means 10 to the fourth. 10 times 10 times 10 times 10 is $10,000. So you can see Antigua has about $10,000 GDP per capita. Australia is rather higher, somewhere around $50,000. You know, Albania is rather lower. 10 times 10 times 10 is $3,000 GDP per capita. And here we have the data for 2000, the data for 1990, the data for 1980. So here we have the countries down the rows, the series or variables along the columns, and the year as a page. So you can think of this as literally a set of pages, one on top of the other, one page for each year. It's very difficult for people to conceptualize even a simple three-dimensional database. I mean, imagine you can have 4, 5, 10, 20-dimensional databases, but even a three-dimensional database is hard enough to imagine. So we usually use two-dimensional datasets for analysis, and instead of having them stacked, it's hard to work with a series of pages, one behind the other on your computer screen. Instead of having them stacked, we looked for ways to represent those pages on a single flat sheet in a two-dimensional sheet. And there are basically two ways to do that. You can have a horizontal format in which each page would go left to right in a single continuous horizontal arrangement, or a vertical format in which each page would go top to bottom below each other in a single top-down format. And I'll show you some examples of that. So here's a horizontal format representation of the same data. In horizontal format, we have, again, the countries on the rows and the variables on the columns, infant mortality and GDP. But notice that we have infant mortality in 1980, GDP in 1980, infant mortality in 2000, GDP in 2000. And in fact, you could have as many columns as you need, you could have infant mortality in 1980, 81, 82, 83, 84, etc. That is, you could have 40 columns for infant mortality, 1980 through 2010, and then you could have 40 columns for GDP in 1980 through 2010, etc., etc. And if you lay the years out horizontally, we call that horizontal format. The alternative, obviously, is to lay the years out vertically. So here we have, again, countries on the rows, variables on the columns, but we go, each row is a different year. And again, instead of just having two years, we could have a Ruba 1980, a Ruba 1981, a Ruba 1982, a Ruba 1983. We could have 40 entries going right down the row. So we could put the year down the row, down the rows, in which case we have vertical format, or we could put the years across the columns, in which case we have horizontal format. Those are the two basic approaches we could use. The choice of horizontal or vertical format depends on what you want to do with the data. So, for example, if you want to charge, like, produce a graph like this, this is a graph of imports plus exports, total trade as a percent of GDP for Australia, blue line, and for the world as a whole, red line. Running from 1960 through, I think it's 2015, you can see here the 2008 global financial crisis. Australia, as you can see, had a sharp drop in trade and then recovered very rapidly. The world had a sharp drop in trade and then recovered a little more slowly than Australia. Well, if you want to create a graph that has the years going left to right, then obviously you would want data in a horizontal format, in which the years spread out from left to right. It makes it very easy to produce a graph. On the other hand, if you're doing statistical analysis where you want to do analysis of variables that are measured at multiple points in time in a country's history, 1980, 1990, and 2000, well, then you would use a vertical format. In addition to data about countries, the international data infrastructure also includes data about people. And these are surveys that have been standardized and then given in multiple countries, sometimes up to 80 or 90 countries will have the same survey translated into different languages and offered in different countries. There are three major well-known social survey programs that for years now, I mean over a period of several decades, have repeatedly conducted surveys in a global panel of countries. They are the USA, the United States Agency for International Development Demographic and Health Surveys, which focus on poor countries. The International Social Survey Program, which focuses mainly on rich countries and the World Value Survey, which has a mix of rich and poor countries and focuses on values, as you might guess. The demographic and health surveys are really the main source of health information needed to support international aid. Whenever you read about global rates of HIV infection or rates of infant and child mortality or if you read about what is contraception use like in Africa versus South Asia versus Southeast Asia, these kind of statistics all come from the demographic and health surveys. And interestingly, these are supported by the U.S. government. These are not a United Nations effort or a World Bank effort. These are actually United States Agency for International Development-sponsored programs. The International Social Survey Program, ISSP, is actually a university-run program run by a series of universities. In Australia, it's the Australian National University. And it has a strong tilt towards political science questions. The ISSP asks people all sorts of political opinions about things like inequality, the environment, gender roles in the family, whether or not the government should be involved in people's lives. And then each year in the program, they also have a focus topic. So they've run focus topics on things like social capital or on things like gender relations or on attitudes towards democracy. And so every few years, these special topics will come up again for reconsideration. And finally, the World Values Survey was originally developed to investigate the relationship between modernization and democratization. And its most famous product is the World Values Map. And here's an example of the latest World Values Map from Wave 6 of the World Values Survey in 2015. And the World Values Map uses survey questions that have been asked of individual people. Again, these are surveys that have been given to panels of around 1,000 people in each of 80 different countries. And it uses multiple questions to come up with a composite scale. To what degree do people have to focus on survival or are they more focused on developing their own sense of self, on self-expression. And as you would think, richer countries like Canada, Australia, New Zealand, Sweden, Norway, Denmark tend to have more of a focus on self-expression. I mean, the need to just pay the bills and put food on the table is not so pressing in places like Canada and Sweden as it is in other countries. Poor countries or countries undergoing transition or having some kind of economic dislocation tend to focus much more on survival values. The second dimension here is traditional versus secular values. So traditional being more religious countries or countries that are more grounded in family values. Places like Sub-Saharan Africa, Islamic countries, Central America, Colombia, Ecuador, Trinidad, Qatar, Ghana, Nigeria are very traditional countries versus the most secular countries where people are not really bound by tradition. Our places like former communist countries, Estonia, Latvia, Lithuania, East Asian countries, Japan, Hong Kong, Taiwan, Korea, and again Scandinavian countries and Northern European countries like Sweden, Norway, Finland. If you take these two together, they make kind of an axis between the most modern countries, those that have most self-expression and secular values versus the least modern countries, those that focus most on tradition and just making ends meet with other countries falling somewhere along that dimension. One huge advantage of standardized comparative social survey projects is that they provide data on people, not just on countries. Since they provide data on people, you can actually study things like in this case, the correlation between education and income in different countries. This is from research I've done using the World Values Survey and I think there's some 80 or so countries in this database and for each country what I have is the correlation between levels of education, number of years people have spent in school and amount of money they make. So you can see in Albania there's a strong positive correlation. The longer you spend in school, the more money you make. In countries that have undergone severe rapid change, for instance in Armenia, there's virtually no correlation at all. It's actually slightly negative, but effectively that's no correlation. But countries that are more stable and that have stable stratification systems, United Kingdom, United States, Uruguay, all have positive correlations between education and income. There's even some evidence that the correlation is strongest, not in the most developed countries, but the correlation is strongest in places that have maybe higher levels of tradition rather than being focused on self-expression or people's own accomplishments. So in many countries it may be true that educated people from the same families are the only ones who are able to get educated and are able to make high incomes. So you see some of the strongest correlations in places like, for example, Tanzania, where there's an extremely strong correlation between education and income. Key takeaways. First, the system of national accounts, or SNA, includes data for GDP, but not for economic inequality, which is a big gap in the system of national accounts. And by the way, that gap can't easily be filled. The gap is due to the way the data are collected. The system of national accounts doesn't collect data on every individual's income. They use surveys about companies and how much they pay and their payrolls. So you can't easily get income inequality between people out of the system of national accounts. Second, horizontal format and vertical format are two different ways to represent three-dimensional data in two dimensions. And really, this has to do with what you do with the time dimension. You lay time out horizontally from left to right, or do you put time down the rows from top to bottom? And finally, standardized comparative social survey projects are very useful for providing data on individuals, individual levels of education or income or such, rather than just countries. And in fact, the fact that they have individuals in their data actually makes it possible to calculate some very rough or rudimentary inequality measures using social survey projects. Thank you for listening. You can find out more about me at salvatorebabonis.com, or you can also sign up for my monthly Global Asia newsletter.