 Okay, well, before I begin, I want to make a little point about the term big data. In April 2011, my colleagues and I published a special issue of historical methods titled big data. This is the Google trend of searches for big data. And there's where our article appeared. I never really figured out why we didn't get famous for inventing this fad, but in any rate, there you are. So I'm going to talk about three topics. The United States has lagged behind other developed countries in the use of linked demographic and administrative records. And I'm going to explore some of the reasons for this. In particular, I'm going to tell you about the 1965 proposal for a national data center that collapsed in a frenzy of paranoia about the perils of Big Brother. Then I'm going to talk about several 21st century data curation and linkage projects that are effectively attempting to fulfill the vision of the national data center and have the potential to provide a new framework for understanding the past. And finally, I'm going to close with a discussion of new paranoid threats to data. Even though the United States has lagged behind Europe in the development of linked demographic and administrative records, it has been traditionally a leader in providing open access to powerful public use data. And very soon that data may disappear because of a new frenzy of paranoia. So let me start with the 1965 National Data Center. Well, in the United States, we have no organization called Statistics USA. Instead, we have 14 different major national statistical agencies. Here's the 14. And in addition to these, the federal government has 89 minor statistical agencies. On top of that, vital records are a state responsibility. This is the state of Utah. I have a birth certificate, a marriage certificate from the state of New York, a death certificate from California. The content of vital records varies from state to state. Most states charge money to get access to these. And then when you go to the other records that are commonly linked in Europe, education records and health records, it's even more of a mess. There's a mix of public and private. And there is no centralized database. These are my parents, Richard and Nancy Ruggles. And they worked hard through their careers to overcome these obstacles. And in the late 1950s, they began working on the problem of integrating individual level and aggregate level data to improve national income accounting. And that led them to the Census Bureau where they began analyzing micro data, which is a term that they coined to refer to individual level data. In 1962, my dad was appointed chair of the Committee on the Preservation and Use of Economic Data by the Social Science Research Council. And they came up with a report in 1965 that became known as the Ruggles Report. I should probably include my mother in this picture because she probably wrote most of the report. But this being the 1960s, she didn't get any credit. So, at any rate, the report noted that there was a lot of data in the federal government, but it was dispersed over many agencies. And there were lots of obstacles to using these data. And it was absolutely impossible to link them together because each of the agencies was very proprietary and wouldn't let you mix it with data from other sources. The report was pretty forward-looking. It recognized the power specifically of micro data, noting that unaggregated micro information offer greater potential for the tabulations. You've got to keep in mind, in this period, micro data was a very new concept in 1965. In 1965, there were very few micro data sets available. And they also recognized the importance of old data, that you don't just need the new data, you need the old data as well. So, the report urged the creation of a new federal data center with these goals, to preserve the data, to open access to research community, to provide more consistent documentation, consistent data formats, to harmonize data across agencies, and provide record linkage so you could link not only across agencies, but also across time to build up life histories. And so they submitted the report to the Bureau of the Budget. And so the Johnson administration appointed two additional committees to figure out the logistics of the proposal. Everybody agreed it was a great idea. The Johnson administration decided to go full steam ahead and establish the new data center. But then hit the fan. There were multiple hearings in both the House and the Senate about the invasion of privacy that would occur if the National Data Center were established. The most vehement critic was this guy, Cornelius Gallagher, a Republican from New Jersey who shared the invasion of privacy subcommittee. He cited the horrendous potential for a computerized and dehumanized version of hell. And then the press picked it up. There were hundreds of articles on the subject. Here's a couple of examples. Here's the Harold Tribune, the data center plan called Invasion of Privacy. The Washington Post said it was a harbinger of Big Brother. The New York Times here, data bank peril or aid described it as an Orwellian threat to personal privacy. Here's the Pittsburgh Post Gazette, computer as Big Brother. And it was all the major net magazines. Look, Atlantic Newsweek, Forbes Time, U.S. News and World Report, The New Republic, New Yorker. There were even two separate articles in Playboy. Here's a few of these. This is New York Times Magazine. Bureaucratic efficiency could put us in chains of plastic tape. Look Magazine, which was one of the biggest magazines in the country in the mid 1960s. Will it kill your freedom? And the Atlantic, a little more highbrow article about the National Data Center and personal privacy. So the social scientists tried to defend the idea. They pointed out that data security could be improved with a central system because the way it was going, there was an ad hoc kind of system across all these different agencies, some of which had very little in the way of privacy controls. But nobody listened. The Johnson administration quickly backed down and abandoned the idea of the National Data Center. So why was there such a panic? Well, partly it was fear of the computer, the electronic brain, which in science fiction often proved to have malevolent intent. Some of the fears were more realistic. J. Edgar Hoover was the director of the FBI and his abuses of power had been coming to light. It was an open secret that the FBI maintained files on millions of Americans that were sometimes used to intimidate and blackmail. Because the FBI was corrupt and political, the entire federal government was suspect. In the wake of the firestorm created by Gallagher and the press, the data center simply disappeared. The Ruggles Report did have one direct consequence. The National Archives established the data archive branch, which is known as the Machine Readable Data Division. And the government began to preserve electronic records. And in the end, they did kind of get their national data set. They continued to work in the Census Bureau and the Social Security Administration and other agencies to conduct analyses and make data usable. And in the 1970s, they began working on a project to construct a longitudinal file from the annual survey of manufacturers and the census of manufacturers over time to link records for individual firms. And to make this data available to researchers, the Census Bureau established what they called a research data center in 1982 to maintain and disseminate the longitudinal establishment database in a secure environment. And that grew into the federal statistical research data centers, which now consists of includes data from not just the Census Bureau, but from 12 federal agencies. And there are 30 branches around the country. This is where they're located. So you could argue that this vision of a national data center is, in a sense, kind of coming to pass. But it's coming to pass more because of some large data infrastructure projects. And I'm going to the spring meeting next section where I'm going to tell you about these new big data developments that are I think, having a transformative impact. So first of all, there's two data collection projects I'm going to talk about National Historical Census Files Project that's been going since 2002 and is virtually complete. And then the big micro data project which started a little bit earlier and will go a little bit longer. And secondly, I'm going to talk about two record linkage projects, the census longitudinal infrastructure project and the multi-generational longitudinal panel. So National Historical Census Project first up. This is a collaboration of IPMS and the Census Bureau. And the goal was to recover all of the internal census micro data for the period 1960 through 2000. So we scoured the Census Bureau, found every copy of the data and verified that they actually match the published tabulations, that all the data was there, then convert them into harmonized IPMS format, write documentation and stuff like that, make them available through the Federal Statistical Research Data Centers, because of course this is all restricted data that's still confidential. And it's a big scale. It was 1.1 billion records. And we should have been done by now, but we're just working on the most recent census and getting them finally converted into IPMS format, but we will be done shortly. We got slowed down a little bit. The most challenging part of the project was 1960. When we went to do the 1960 census, we found that every copy of the data that we could find anywhere in the Census Bureau was missing this whole Chicago area. And that was a problem. So this is a FOSDIC machine. FOSDIC is the Film Optical Sensing Device for Input to Computers. And it was the means by which the 1960 census was converted into machine-readable form. At any rate, it was the first high-speed optical mark recognition system that there was in the world. It was actually built at the Census Bureau. And it used these bubble sheets. These are just like the ones you'd use for a standardized test. You fill them out with a number two pencil. Here's the question for, is there a clothes washing machine in this unit? Regular earth springer, washer, dryer, carbonation, and so on. The bubble sheets were microfilmed, and then the FOSDIC machine read the microfilm. You can see the operator there is loading some microfilm now. It seemed to me the key to solving the missing Chicago problem was to find the microfilm. And so after some investigation, I located the microfilm frozen on shrink-wrapped pallets in this cave in Lenexa, Kansas. And so it's a big cave. There's the interior. So we set up a scanning station inside the cave and scanned all the microfilm real sort of Chicago area, got them converted into data, modern machine readable data, and we merged them into the data sets. So 1960 was fixed. The second big project I want to talk about, the big data project, is the big historical micro data project. And this is a collaboration of IPHIMS with FamilySearch and Ancestry.com, the two largest genealogical companies in the United States. And through this project, we are creating complete micro data for the United States from 1790 to 1950. We have released all the data now for 1790-1940. 1950 won't be released to the public before for another year, April 2022. And we will process that when that happens. But this is a very large-scale project which would have been very expensive if we did not have the collaboration of the genealogical companies to do the digitization. So this is the situation we are in. The green is the data from the National Historical Census Files Project, the 1.1 billion records that are in the Federal Statistical Research data centers. The peach colored bit there is the data that we collaborated with the genealogical companies to digitize for the period 1790 to 1950. And then the blue down there is just the public use data that the Census Bureau has been releasing all the way since for the last 60 years. Okay. So now I'm going to talk about the two linking projects. So like the two data projects, the first linking project deals with the recent period, the restricted data, and the second one deals with the older data that is in the public domain. So the CLIP project, as we call it, is a collaboration of items in the Census Bureau. And the goal is to link records from the censuses from 1940 to 2020, both to each other and to administrative records. And we have completed the linkage of 1940 to 2020-10. And we have about a dozen research projects underway in the Federal Statistical Research Data Centers. And the most challenging part of this project is that from 1960 to 1990, the Census Bureau never digitized the names. And so we are working on optical scanning for those. This is the FOSDIC form for 1990. It's a bubble sheet very much like the form for 1960. And you can see the names are up here at the top. And so the issue is we have to scan those names and interpret them. And the work we've done so far is very promising. And we expect that this will happen, but it will not happen quickly because, you know, Census is Census. So the linking strategy for CLIP begins with 1940. And we link 1940 to the Social Security Enrollment Database. And one of the nice things about it is that it includes all the name changes people have over the course of their lives. So it includes maiden names for married women. Or indeed, if they have multiple serial marriages, they include all the names that they had at each point of their lifetimes. And so that makes the record linkage easier. We also use World War II military records to help us, in some cases, with identifying people in the Numident. The Numident is already linked to the 2020 census and to all of these other administrative sources. So we can, you know, it's really an extremely rich database. So right now, we are not doing, we have not done the 1960, 70, 80, or 90 censuses because we don't have the names. But when they get added, it will be, of course, much better. The second linkage project is called the Multigenerational Longitudinal Panel, or IPAMS MLP. And again, this is to cover the earlier period with the public data. It's much easier to work with because it's outside of the RDCs. But unfortunately, the data is not good, not as good. But the goal is to link 1850 through 1940 to each other and to other records we did our first release last summer, and we're about to come out with a new version. And we plan to come out with new versions annually, virtually indefinitely, because you can always improve this sort of thing. So again, our strategy as we start with 1940, but this time we link backwards. We link backwards to the 1850 to 1930 censuses. This only goes back to 1850, not 1790. And to do that, we use, there's a public version of the numedant that we can use that will help us link women, that does help us link women. And we also have access to some genealogical data that can help with that problem as well. And then we link to military records, vital records, and most importantly, to CLIP, so that you can actually then bring this MLP into the federal statistical research data centers and have coverage all the way from 1850 up to 2010. So that brings me to my last topic, the revival of paranoia. So in an echo of the paranoid streak that foreclosed the National Data Center in the 1960s, we now face threats to some of the most intensively used demographic data in the world. And the Census Bureau, as I mentioned, has been a world leader in providing access to public population data. In 1962, the Census Bureau created the first public use microdata file. And they also created the first small electronic small area file. The 1962 data was distributed on 13 Univac tapes, or you could get the one in 10,000 version on 18,000 punch cards if you didn't have a tape reader. So at any rate, last Friday, the Census Bureau at the ACS data users meeting described a plan to replace the American Community Survey microdata, which is the descendant of the 1960 public use sample with fully synthetic data by 2024. This is a bad thing, because the ACS is one of the most intensively used scientific databases in the world. According to Google Scholar last year, there were 12,000 publications based on ACS data. So the synthetic plan, what they do is they develop some statistical models describing the relationship between the variables in the ACS, and then they construct a simulated population that's consistent with that. And obviously, you can only look at relationships that they have already baked into the model. So this is not good for studying unanticipated relationships. This kind of impedes new discovery. And the Census Bureau recognizes you can't do most of the kinds of analyses that people are doing now with the ACS. It only incorporates individual level relationships. So analysis across household members, for example, won't be possible. You can't study ethnic intermarriage or family structure or the impact of a partner's education on fertility or something like that. All of those kinds of questions will be foreclosed. So why are they doing this? Well, the motive, the push is coming exclusively from the Census Bureau itself. So in this respect, it's very different from the big brother paranoia of the 1960s, which was driven by politicians and by the newspapers, and ultimately by the public. Nobody cares about it. The change is being driven by a radical reinterpretation of Census law. This is the chief scientist of the Census Bureau. And he's arguing here that the Census Bureau cannot reveal characteristics, even if people's identities are fully protected. So under this vision, micro data has always been illegal, that the Census Bureau has just been blatantly violating the law, according to John ever since 1962. But there isn't the risks are in fact very low. Nobody has ever been identified. There's not a single documented case. So it's, you know, it's very difficult to calculate a risk because there's nothing in the denominator. There's no there's no examples that we have where anyone has been identified. If by some miracle, somebody was identified, there wouldn't be very much harm because there isn't very much sensitive information in the American Community Survey. And besides which, most of the most of the data that's in there could much more easily be obtained from other sources. So if you weigh the profound cost of eliminating the ACS micro data against the fanciful benefits of responding confidentiality, the Census Bureau just has no justification for closing access to the data. So my conclusions are that the 1965 data center, great idea, but it was foiled by paranoid delusions. It would have increased confidentiality and respondent safety, but there was no buy in it. But nevertheless, you know, time passes and the current new projects promise to fulfill that vision of the National Data Center. But the paranoid delusions continue and now threaten and now they're coming from inside the Census Bureau. And so I'm going to do everything I can to try and prevent these delusions from destroying the crown jewels of American demographic data infrastructure. Thank you.