 Ungell dyn ni, Genedex, yn ydy'r gyfariad yw'r iawn yn ystod y dyn ni, yn y dyn ni ddim yn ei gael, a'i neud yng Nghaerfforddwn. Maen nhw ddau. Yn gynnwysu gweld ar hyn ar gyfer y dweud, mae'n ddiwrnod ymlaen o bwysig o bobl sy'n papur o'r un o'r dynnu. A rwy'n ffath gweithio'n ddigon o'r gofod yn gwlad yma'r gwahod dynnu am rwynt ymlaen o bwysig. A'r gw quarter i ni'n gweld du o'r llun cymryd, and we're going to do some of the research that's taking place with it. It's great to see a lot of students using the data. We actually have fewer students using our data for reasons Katie can get into, but basically the access process is much more difficult in the United States. In a way that's not good. I'm not saying that we're not proud of it, but it's slawl and kind of hard. So that's pretty easy to get this data as far as I can tell. Okay so this is the... Social NATO. This is a joint academic government collaboration. Michigan. We've been working together for 12 or 13 years and we've been working together on this for the last week or so. Up until a year ago, it was flipped. Katie was at University of Minnesota and I was going at the Census Bureau and now we're still yma, ond rwy'n gweithio fydd yr unrhyw bwysigol hyffordd o'r swyddad i dyn nhw yw, oherwydd yn i'r llyffr o'r dyddion, os wedi gynnwys ar y dychydig yn ystodau, ac yn y ffordd o'r dynnu'r gweld ym 1 yr 2 bias pan oedd y Ystod iawn yn Ystod, yna'n ei gweithio ar y gennyn gyfrif senseid, ond mae'n defnydd whyddur sy'n ei gwybod yn ysguadau, ar y interpretabwyd yn ei cyfrifio o'r peth o'r Swyddad i dynnu'r fyfan. Basically we use some administrative data in service of our linkage that I think is pretty different from what happens here that I wanted to talk about. And then Katie's going to jump in and talk about the specific infrastructure that we're building. I'll sort of give you the spoiler alert. Our goal for I don't know five years, maybe ten years is to have as close to 100% of every census from 1850 to 2020 link that we can. This is going to be a multi-generational longitudinal infrastructure. And we're, I don't know, three years ago that was just kind of a hope. I think we're actually on the path to where we're going to succeed, but we, there's a lot of work before us. So Katie's going to talk about the state of the infrastructure right now, how people get it. So we are, even though it's a patchwork right now, we don't have all the censuses linked yet. We have quite a few of them. People are using it. And so that's where you can really sort of see how our data access protocols compare to the ones that you already did. And then our goals to finish it. Okay. Oh, so I went to graduate school at Carnegie Mellon University in Pittsburgh, Pennsylvania in the 90s. I was a data guy. So just, I want, I've been hearing a lot about how hard it is to get and use the data. I want you all to think back. It used to be harder. This wasn't even linked data. This was just micro data. This isn't me, but it may have well been. That's a really awesome computer where he could load those real-time tapes himself. At Pittsburgh in the 90s, I had to send a mount command over the unit's mainframe and then call someone over in a different building and say, I sent a mount command and he would go mount it on it. Anyway, that was how it worked. Oh, and I had a five gigabyte allotment on the server, which was a huge deal. Like they would call me every month and say, are you done yet? We need that space. Anyway, I was actually using that in the 1980 census pumps. Katie on the 90s. But she used the time you surveyed in a similarly difficult way. It was hard. It was torture. Getting this data, finding the space to put it somewhere, analysing it. Overnight, you get a certain number of cycles and it was hard. But it was also hard because this was just cross-sectional stuff. A lot of people have talked about using the cross-sectional data. Because each file had its own layout, its own coding scheme and some files. Men had a sex value of zero and others had a value of one. That's a really easy one. Occupational codes change radically over time. Everything changed over time. And it was hard. It was really hard. And it's something I spent a lot of time doing the grunt work of just getting a string of censuses that I could do cross-sectional analysis over time. And a lot of content. I wasn't alone. Oh, those are the code books from the different. Our Census Bureau made, using confidential data, public use micro data samples from 1940, 50, 60, all the way through the present. I used those, but every code book was different. Every file was different. Getting it on-load onto the server and getting it to analyse it was all a huge challenge. When I finished graduate school, I went to the University of Minnesota. You may know this, but it's called the SAR, here, the sample of anonymized records. These are called PUMS files in the United States, public use micro data samples. It's the same thing. It's anonymized micro data that you can use on your own computer. And I went to work on a project in Minnesota where the I stands were integrated. Where we integrated all the data, we gave the variables the same names over time. So it was called sex in every sample, never gender. We gave the codes as much as possible, the same values over time. And then we're always a one, and then we're always a two. We did integrated occupation codes when possible, although something like that introduces a certain amount of sloppiness that we could just document and make the originals available for whoever wants that original, sort of unharmonized data. So this, I mean, I worked here for 10 years. Cami was there for part of that time as well. And this was, I mean, I think really the process of doing this kind of work in the 90s was so traumatic that it was like there were a bunch of us. This was sort of a mission for us to make this data easy to get and easy to use. And I say that because a lot of people here have been talking about using cross-sectional data. But also I think it's relevant to what we're doing now and what I think you're doing now, which is trying to accomplish the same thing with longitudinal data. It's really hard to use longitudinal data responsibly. And we're trying to make it easier, much like what's done in the past with cross-sectional data. You should know you can go to ifms.org. All of this data is freely available to anyone in the world. And also, ifms International has almost 100 countries that they're a micro-data. I hope you all know. You're just all lying around. Never mind. You did it! Look at that! Ifms.org. I'm actually going to talk about that for a second. So this data, a lot of it not only is just kind of hard to use because you had to get the wheeled ships to you and the codes were weird. And a lot of it was completely inaccessible. And this is so, as Katie said, we did this not just for the United States, but for a lot of countries around the world. This is Sudan. I can't remember which year this is. But this is how it works to work. Nobody was using the data. It was completely inaccessible and it was degraded. So we, University of Minnesota employees went there, helped to get this ready, shipped it to a firm in New York City that specializes in recovering data off of old media, sent the data to University of Minnesota, integrated it over time. We have multiple years disseminated freely now. This is public use micro data. This is Bangladesh. It looks like 1980. Same thing, that's mold. This was in very bad shape, but it was all recovered. And this isn't just countries around the world. This is a picture of a project that I carried out in 2007. This is where the US census original data is stored. And we, in the mid-2000s, the Census Bureau had made pump slots, which I already described, that have not retained the original micro data, not all of it. It was stored on real-to-real tapes, just like you've seen a bunch of already. And we said, hey, can you try to recover it 100% data? Because there's a lot of stuff people are going to be able to do inside the Census Bureau that they can't do with the pump's files. And you can monitor their access. And they have a great system for providing secure access to researchers' data. And nobody had the question yet. So for 1960, 70, 80, 90, the data was on reals. It wasn't even on a disk anywhere. And some of it was degraded, just like the Bangladesh state that you saw. So we got a grant to go down into, this is a cave in Kansas. We worked in the cave for several months for covering old data. Scanning it, it was on microfilms. Scanning the microfilms using, I'll show you. Oh, so there's a microfilm. That's the inside of the cave. Look at that gate. That's where we store our stuff. This is how the microfilms were originally processed. This is the 1960 Census. The Fozdic machine, this was a bubble form. This is how those bubbles were read. They were read optically with this thing that would shoot lights through the microfilm. And this was the modern Fozdic machine that we set up in 2007. Where we would scan the data, get digital images, process it to make a data set. We did that with the 1960 Census. The other files were fully recoverable without this kind of intervention. But for 1960, in the United States, we were down in a cave making a new data set because the original had degraded. In 2010, I went to the Census Bureau. I left Minnesota and I worked on the American Community Survey. It largely replaces the interesting parts of our decennial census. It's an annual survey of one and a half percent of the population. That's got a ton of questions and our decennial census now has very few questions. It's a short form only. I started to work on the kind of things we're going to talk about. Which is linking data and providing link data and secure facilities to researchers. Any questions so far? Sure. I'm going to explain to you that it's hard for us that it access the data but you in America cannot have the legacy that you can't make the data available for 100 years. 72. 72 years of life. So the 1940 Census was released in 2012. We're actually waiting for that. Any other questions? Okay. So there's kind of a big break in... Of course people are linking all kinds of data to everything all the time. I'm going to have a couple slides just focused on linkage projects using decennial census data in the US. So it's narrow. There's a big break in how this is done related to the question you just asked. Everything from 1950 to 1940 is public. Again, I don't want to make it sound simple. It was released as microfilms just like the one that I showed you from 1960. Katie and I were involved in many projects at Minnesota where we typed in samples from those microfilms and made available public use microdata samples. It's great. People use that to do all kinds of things. It's largely in the process, however, I don't know if we're stated or being superseded by full 100% files produced by genealogical companies for genealogical research that, after I left Katie and some of her colleagues wrote for access to her research purposes. So they heard the files largely to make money and they provided them for research purposes to the University of Minnesota. Even with names, they can be shared under usage agreements with data security plans. So that's only been three or four years that's been available and there's really exciting research coming out using, not only using 100% cross-sectional data, because there's a lot you can do that you can't do with the 1% samples we originally made, but linking it. This kind of is preceded to our historian. Even in the 60s, before people were really using computers for this work, there was manual linkage of decennial census data taking place where you would take everyone in Boston in 1850 and try to find them all in 1860. First, by looking at Boston, maybe looking a little further afield. So there was already some of the best research in U.S. social history that was based on linked data, but there were tiny samples, not links very systematically. Just in the past few years when we have the genealogical data, this complete count, it's open the door for automated. Even these automated efforts with complete data, very low linkage rates, maybe say between 1860 and 1870, you might expect to get maybe 10% to 15% linkage, even with the best linkage technologies that we have. And that's because there are so few linkage keys to work with in those old censuses, like we don't have data birth. I'm not sure people knew their data birth. We have age. So it's basically age, name, and state of birth. Some of our states are really big. So the number of unique cases you can get is small. So if you really want high quality links that you trust, you don't get many of them. And that is kind of a nutshell of what is up with linkage during this period until the past maybe year, which is that it's out there, people are doing it, they have the data, they have the tools, and they're linking maybe 10%. If you really widen your parameters a little bit and you accept that you're going to have some error. And that's it. So the first part of every article is like, here's why it's okay that I lost so many people in trying to explain and represent this is who you've got, and only been getting to your analysis and understanding the data. Oh, and then also the other approach is to not even try to do everyone, but to do people who are easier to link, like married couples, because you have the linkage keys, you have both of their names. So you can look for them together in multiple censuses and do a lot better. But then you're writing an article about married couples, which is a new kind of problem, because it probably wasn't what you wanted to study. There was here in business, because you can link them really well. But if it wasn't, no. If you want to link women from childhood forward, impossible. Their names changed very hard. So it's only been in the past couple of years now that are bigger and more systematic trying to use more administrative records, so it's easier to do something like link women. Because you can observe them in their birth record and their marriage record and see their name change. Once you have that information, then you can really do it pretty well in the census. The marriage record, you've got that couple, so you have an opportunity again to use more linkage keys and get much better linkage rates. This is just going. Katie and I are both on the board of a project that I think shows a lot of promise for us to finally get to much denser link samples for these years. That's the status of that research. Now, from 1950 to the present, it's pretty different. This all has to happen. This is private data. It all has to happen within the Census Bureau on census computers, census employees, according to the rules of the Census Bureau, which is great. They have a great computing environment, but it's harder and slower, and it's harder to disseminate when you're done. And census, they have a great linkage program, very high linkage rates, usually from one census to another. From the 2000 to 2010, we linked maybe 90% of the cases. And the reason isn't really that it's space age technology, the record linkage. It's more that the linkage keys are better. We do have things like data birth. And we, something I'll talk about a little more, we use administrative records to facilitate the linkage, sort of like they're beginning to do with the older historical data in a way that changes the whole bulkhead. One more point about this linkage here. The Census Bureau started this before I got involved, or Katie got involved, with 2000 and 2010. Those are the, prior to maybe two years ago, those are the only linked decennial census that the Census Bureau, and they didn't do this to facilitate the research. They put linkage keys on these files so they could carry out the census more effectively, so they could use it for missing data allocation, basically for programmatic purposes. They want to link income data from a tax service, from the Internal Revenue Service to the American Community Survey, or to the 2000 census, not so they can do a research project, but so they can allocate the income value so it's much more operational purposes rather than research. So they had a great linkage program they have for many years, but it's not to support people like us. It wasn't until very recently. Did you want to add something? Okay, the last thing I want to talk about is how we do it. So this is focused on that 1950 to the present, the stuff that happens inside the Census Bureau that's really good, where it gets really high linkage rates. I'm going to read all this. This is more relevant in a U.S. context where the question is always kind of close to the surface if it doesn't come up. Like, why do you think you can have all my data? What makes you think you have the right to do this? Because most federal agencies don't. The privacy laws in the U.S. are such that through the data to be shared. For one agency to share it with another, you have to consent this person. You know, that's difficult. There are a lot of people who want to. And a lot of people don't want to. The Census Bureau has an exception to that called the Privacy Act. They have an exception to the Privacy Act where data with fully identifying information can be shared with them if it's in service of their goals. The goals of the Census Bureau which are to produce statistics about the population and the economy. So it's kind of that simple. If they can justify a statistical need for the data, they can request it from any agency in the government, from any state. They can request it, but it's not guaranteed. No one has to give them their data. When our government, even the ability to request it, it's not entirely unique. There's a couple other agencies with this kind of right, but it's pretty special. And what this says is basically the reason they can request data is to make it so they don't have to bug the public quite so much. If we can use this data to create products or to improve value, to improve missing data to add value to our data products in a way that reduces burden from the public, we're allowed to. So we have a ton of administrative data and they started by saying it's mainly for these kinds of purposes, for operational purposes, for making the agency more efficient. Just recently, so we need to use it to conduct research, important social and economic research, which is what we're going to be talking about a little more. This is a sample of the data that the Census Bureau holds. This is in addition to census data and survey data. This is our tax data. This is like housing subsidy data, tenant rental assistance data, childcare support, Medicare is everyone over age 65 and the US gets Medicare, so we have a lot of information about who they are and what kind of services they receive. Medicaid is data that empowers people under age 65 in the data services that they have access to. So you can see there's a ton of data sources that the Census got, almost all of these, originally to support their surveys and make them more efficient and reduce burden on the public. Nevertheless, it's all there and can be used for research that's related to the mission of the Census Bureau. So Kate and I are going to be describing a frame of census records that we've created, but the frame I think it's just that. It's sort of a trunk. These are going to be the branches of the tree where there's so much more you can do when you link it all in. This is all linked to every census that we're going to describe already. So here's how we link. So we have a homegrown software that has a name, but it's like a bunch of SAS programs and a C4 grant and things like that. We call it the person identification validation system. It assigns a unique number to each person. We call it a pick. A protected identification key. And every file where I show up, I have the same pick if we did it right. Unless we didn't stick. That's how that works. So you can easily, it's trivial when the work is done, to find me at every file we have. Because I have the same, I'm assigned the same value on this variable. This is the Census Bureau initially developed this to unduplicate records. So in the 2010 census we had a lot of people to fill it out more than once. And we needed a way to identify duplicates. That's how this started. It's grown from there to be used to support programmatic research and also programmatic purposes and also research. And how we pick data. How we put these keys on data. This is based on the social security number. It's our social welfare program. Everybody is assigned an SSN at birth now. Anyone who comes to the country is assigned one if they have a work permit. Basically if they enter the labor force, if they receive any kind of benefits from a state governor or local government, they receive an SSN. This is not complete coverage, but it's pre-dark close. And that's our frame. If you don't have an SSN, we're not putting a pic on you. So I'll talk about that. That does create a little bit of bias. So we have this big file for SSN holders that has their name, their date of birth, their state or country of birth. To that, we get tax data. And we add their street address, the place where they live, which we have every year because we get tax data every year. So a year specific street address. And then we do some name standardisation and some address standardisation so it's easy to work with this file. We call that the reference file. Every file we get, including the census, we link to this, this thing that's built of social security data and tax data. Because that links you to a social security number and there's a one-to-one between every pic and every social security number. Does that make sense? All right, I'll just maybe go through this kind of quickly. This is how we link. We can't compare every record to all the other 300 million records when we do this. We don't have good computing power. So we do blocking, which is what most people do when they do record linkage. Where you say, well, I'm just going to look within these groups. But we do five different blocks. So the first thing we do is we say, if we get your data with an SSN, which we often don't, like we don't, on the census, we don't see what the SSN is. So we don't always have this as linkage key. But if we do, it's great. Because it usually works. So within that block, within the SSN, we'll go and look at the social security data and say, does the name match? Does the age match? Does the place of birth match? And we'll declare that works really well. Then we'll look within the address, so we'll compare the 2000 census address we have for you to the address we observed in tax data from the year 2000. We'll look at that and then say, does this person have the same name, the same date of birth, the same place of birth? If that fails, then we sort of look nearby. If that fails, you'll look through everyone with the last name that begins with A. For me, we have an A block. So this is how we do. We use all the data we have every time, but it's just a matter of which variable you privilege to reduce the sort of comparisons space. Because we can't just look for every record and compare it to every other record. Too much computation. Okay, I'll try to go through the real class. So this is Medicare data. Like I said, this is a social program that everybody over age 65 can roll in. It comes with social security numbers. And with that data, when I try to put a pic on it, it always works. Almost all the time, I find my pic based on the social security number that I received with my Medicare data. Because those are good SSMs, because you get benefits from this program. People say the thing. They give the program good data because they get something from it. The 2010 census, however, we don't ask SSM but it's that address variable here. Your 2010 census address to the 2010 tax address that we have. And look at the people and see if they're the same. They almost always are. 94% of the time, is that block that's the key to our being able to assign a pic. It's not always the same. It's whatever. Not everyone. People know things happen. But you can see we get everyone, most everyone else, in a comparison space to their first initial at their last name. So we'd pick, as I said, about 90% of the records and these files. These are just surveys that we could look at the census to the decennial censuses that have the whole population. So I'll kind of cut to the chase on this. The pics we assign, we use very conservative techniques. They don't really have errors. They're very small. The links we make are very good links. That is not our problem. Our problem is bias because the links, we only make a link if you have a social security number. If you don't have a work permit, you don't have a social security number. So we do really well with these groups. Good for us. Not as well with these groups. In fact, this group, we don't pick a single one of them. We use our data, our longitudinal data. They're excluded from your analysis because we have no way of linking them reliably. Okay? And this is my last figure. Kind of forget about don't try to read the small words. Here's the takeaway. For us to link a case, we basically have to observe it in our administrative records. If we don't, we're not linking them. This is administrative records coverage. Purple Coverage is extremely high. People who live in these areas we will be able to pick because they exist in our administrative records. People in these areas more immigrants, less English speaking. Much lower levels. They're still high. So this is mostly somewhere around 80 to 90% of the population we can see in our administrative records. This is based on a match between the census data and administrative records data. But I guess my point is that this bias I was describing isn't just by individual level variables. It's a clear geographic bias in who we're able to link because of the way we link. So I'm going to leave it there. Should we get any? Unless there's any questions. I'm going to be difficult. We're supposed to write two sentences then time to file a link. You said about using the same thing. And do you do any other side of the process way? Where they would be still same. I remember. I'm not a fan of that. I'm not a fan of that. The time has been amazing because actually it would be happy to know that this is about one man on a boat with a different concept. So I think we have to work on that. Work on the concept. And we're going to create a new concept. Katie, you want to say that one? I don't know. I was like, great! Yeah, but no. I don't know. No use of response rates. I don't know. There's an undercount. There's also an overcount though, right? There are also people counting twice on that. I don't think it would be a sentence if they don't respond. No, no, no, no, no. It's really high. It would be really small. It's a small. They go after everyone. The sentence is very, very expensive because if you don't respond we go after you. And we kind of know where we are because of the administrative record. I mean that's the deal. 67% is self-responsible. So you welcome the rest of these both? Tons of people are hired. Lots of people you talk to will have been like, oh, I've worked for the census for you. I went to people's houses. They needed a job. They were young. They go and do it. They try to get local people. So you're in your neighborhood. So it's under your neighbors that did it. Yeah, it's really high. It's close to longer. I'm working on the vibe. So it's a basic thing that in the come time it's going to be a little warm. One of the things that you think that would be very favorable for those people is that it's going to be very interesting to the census to be aware of the counts of violence. What did you say? And from that if I understand it correctly those particular areas that I thought the census would be most interesting to open up are those ones that are under representing the administrative records. Yeah. But different than what you just explained that it's very expensive is that we cannot be looking at those at all. I mean, it's just a general question of like surely it must be physically possible to get the other. So what's going on here is that we're not linking and going over time because we don't have them in the administrative records. Some of them are in the census records. They're in the American Community Center. We do not ask directly the citizenship question since I've been here is being added to the short form. We know and on our yearly American community survey they do ask if you're a citizen. We don't ask like but are you here legally? That's not a thing. We do estimates of it in every place. I mean, it's kind of there and it's whether that's important to the Census Bureau or not I don't know. I won't speak for them right here. Don't tweet anything about that please. So I'm going to Yeah. I'm glad Trent introduced us. This is good as you can tell we have a long history. I'm pretty new to the Bureau. Trent hired me and then he left me much like he did at Iphums as well. But we still work together right? A long drawn out history of doing very cool stuff and I'm excited to talk more about it. Like I think we're making our way and doing a lot and that's partly what we're going to talk about. We've been collaborating in different ways through university and government partnerships for a long time. So it's just sort of fortuitous actually how it all worked out timing wise. So and one of the things that we have been working on a lot that we've talked about right, we have this 2010 but the other big innovation was adding is 1940. So the 1940 data became public in the US in 2012. This is an image of it. All of the images went online. This was a pretty big deal. We had a big party when I was at Minnesota right? I mean obviously we all dressed from the 40s because it was cool. So Minnesota has the 1940 data. They worked with Ancestry.com to get it. There was money exchange, there was entry and then the Census Bureau worked with them to get it and then Trent tried to link it into our infrastructure. So link it in, get these pics on it. However right, we definitely don't have us as then back then. And we don't have birthday, which is kind of a big deal on the individual level. So what we use are names and then sex, age, spate and county birth and parents names which is a really big deal for children and where people live. It's a little bit different than these other ones because we, that are classic, it's really different than using 2010. So we had to really change how we do things to try to identify people in 1940. We also really hope that we could do better than some of that historical linkage that Trent was referring to where we get about 10%. Sometimes even down to seven. They're also messy, those older files, they're hard to read, it's really inconsistent that you get names right. So we've changed stuff here to make it work for us. And this is what we got in 1940. So we have everyone in 1940. On average, we got 42% of the population. So not like in these 90s of modern data. So you can see for the children, for the young people, we're getting to about 75%. So it doesn't look like we'll push that there but that's what it is. So it's 75% because we have those parents names. So we're able to use their parents and them to find them. So we are linking 40% of these people into our system. And then we can link them forward. And then as it gets older it gets lower. So here are the numbers. In 1940, we have 53 million cases. So this is a little different than where like your study, which is about 5 million, we have 50 million we can link in. Where that now a pick. When we link them up to 2,000, we get about 26 million who are still alive there. 15 and 20, 10 and 3 million into our American community series. So this is pretty exciting. This was really exciting for us but also really excited in the US because there's a big renewed interest right now in the United Nations. Later life outcomes, social mobility, things like this. So this is going to facilitate a lot of research and that's what we wanted to do. So now we're talking about 2,000 right in 1940, we have a big gap. You guys have some nice spots there in the middle of that gap. We have a big gap. So as Trent mentioned we digitized data for 60 to 90. We have the full long form termos years. In the early 2000s it's restricted data, I'll talk about researchers can use that, I'll show you how. But we don't have names on the digitized data. In most cases we don't address even all that well. Like 30 digitized. It's really difficult and expensive to add names by hand. I mean just doing like, which is why they were not added. When we have names though, then we can put a pick on these and link them to other things. Ding ding ding. What's next? So what's next first is the 1990 census name recovery project. 1990 is sort of the next one, it's the easiest one and I'll show you why. Partly because names were hand written on our forms. We'll show you it. And we have these forms on microfilm reels right now. And most of the other variables are already in the microdata file. So this is what it looks like. This is the form people filled it in. All the names are there. So we need to capture all of those names. And the big key here is that they also have a person number. You can see up top it says person 1, person 2, person 3. That we put in the microdata. So we have that order in the household in our microdata. And the other thing we have is this household ID. So if we can capture the names, the person number in the household ID, we can link them to the microdata we already have. So we did a test to do this and see how much it would cost to see what would happen. So this is, we did a test of two scanners. This is the one that Trent showed you before over there, like the same as the positive and then a new one. The eclipse one, old technology. And what we did is we took 40,000 cases across multiple reels and had people hand enter them into this app. So that was our truth data. So we had just people at the Census Bureau hand entering them. And then we tested out two optical character recognition companies. So right, this is where the image of the computer is grabbing the name and then turning it into letters for them. So we gave them 20,000 cases as training data. And then we took it and we ran it on our other 20,000 cases to test that. This is what we got. We think this is pretty good. I was not optimistic about this. I will admit that. If any of you tried them, like the over the counter figures, like I tried this on a PDF and it was like absolute shit. I mean like I think we got like 6% and we were like screw it and we stopped, right? That's the thing that worked. So these we gave to some like pretty high tech level companies and they got household ID 85% of the time, first name and last name close to 70%. So we think this is pretty good. In the end something else happened that makes this extra good. And that's that we found a digitized list of addresses for every house in the 1990s. So we have street addresses, rural routes, department numbers and we think we can actually use this without any use to identify anyone. So we have a project right now Trent and I with some other people where we're not even using like the names at all and we're identifying people using our administrative holdings there. So using the tax data in 1989 to get with people. So this is really exciting for us. We're really hopeful that in the next couple years we'll be able to share with people. Again it probably won't be where we are with 2000 and 2010 in terms of percent wise but people are really excited about this right? We have another time point. So that's sort of our big infrastructure stuff. I'll talk about our next round and future plans because that's what we've done so far. But now sort of about how we get this data right? So now we have all this data, we've talked about the Bureau can use it, can researchers use it. Well the researchers have been using linked census data for a while. So we actually have linked employer or employee census data coming from the state, our state, the 50 states on employment files linked to business data. So the Census Bureau in the U.S. actually does a ton of business stuff, not just demographic and those have been available to researchers for a long time. Also it's pretty common to use linked data across surveys. Things get a little funnier with the census in the U.S. because they're mandatory to fill out. Which they probably are here too right? Our surveys are optional. They feel like so people opt in to being linked to even let they even have to uncheck if they don't want to be linked across time. That's a really easy thing. Mandatory surveys of census is much more like they're very serious about it. Even though this is completely protected, their names are stripped off, the addresses are gone. You need a lot of permission to use them even within the Census Bureau. Obviously there's some technical issues that I believe to hear which we don't have all of them. But even with the ones that we have they've never been available to external researchers at all. So you had to be a Census Bureau employee to use them. However, it then came along the Census Longitudinal Infrastructure project. So Trent was on the team that created this and what it did is it worked on linking the mandatory response and survey data, just like we did in 1940 there. It integrated this with extra data, the list that Trent showed you. And then using it to sort of create this panel and it's supporting research projects through Census Bureau and academic collaborations to do this. So to do research using it. So this is pretty, it's very this one too. So this is why it's good. Yay, this is great. We had to sell this as good to the Bureau. So actually all research I'll talk about that that goes on with the restricted data has to benefit the Census Bureau. People write like 10 page statements about how their research is going to benefit the US Census Bureau. And basically what we're saying as Census Bureau employees is that we would do this research but instead this academic collaborators are going to do it for us. So it's a pretty big, it's a big ask. So this is why we sort of sold this to the Census Bureau to help them and help with linking. And that started, Census approved it in 2014. That right away there's a group there's a governance board that runs it. Our first projects were approved early 2015 and there were publications out by 2016. So people are clamoring to get this data in the US. I get probably, I don't know, an email every other week asking for it. So this is the core. So you saw that 2010 and 2001 through President in 1940. This is sort of our core of CLIP, the Census Longitudinal Infrastructure Project. But as Trent mentioned this is where what else sort of happens is that we can link in other Census surveys so other surveys that we do at Bureau and we can also link in our administrative records like Medicare, Medicaid and then other data sets as well. So this is even like we have long running panels run by universities. Anywhere where there's like PII we can bring it into the Bureau and link it in. So if you have a small different schools things like that. Also we're hoping and planning to link in that 1850 through 1930 data. So Minnesota recently got a grant to finish or to work on that project to link across Trent and I are involved with that as well. So we will this is sort of our big full CLIP infrastructure that we're building. We have nine collaborative research projects and they're working out of seven federal statistical research data centers and I'll talk about what that is so that's how people are accessing this data. These are the projects you can see in them a lot of mobility migration, immigration so as we talked about immigrant intergenerational incorporation gentrification stuff we've actually heard about here right things that we're doing you guys are doing with the data here as well the samples this is what's sort of going on with the full data we have some different policy impacts loads of people want to do stuff all different across fields across all sorts of social clients. Here was our first NBR working paper on intergenerational mobility that came out in 2016 and it got covered in the news and then Trent and team just had a demography paper published on second generation outcomes of movers in the great migration great migration covered in the news there so this is really exciting for us so part of this pilot project we should show this was feasible and that good researchers coming out of it and that researchers are really responsible and can handle this data and that the Bureau can handle it that's a big part of what we're doing as well so how people access this data and other restricted data it's slightly different than you guys what's going on here but similar as well so a federal statistical research data center is probably like you have here it's a computer lab and there's no internet in there and there's computers that you can work on the data and a pretty difficult units environment but there are US Census Bureau facilities hosted by research institutions so hosted at universities and universities pay a big part of the fee cost for it but the Census Bureau directs all the work in there and there's a Census Bureau employee at all of them that's running them so here's where they all are around our country which is kind of vague I'm the middle one so I run the Boulder REC and work in the middle of America but there are lots on the coast you can see I use to have people out there that would travel all the time fly to California, fly to DC would drive to Kansas City all the time so people go to these places and work with the data there we do not have an option like you have this is very encouraging and interesting to hear where you can like send them code later my people have to go and do all of their work with their body cells we don't do anything as on-site employees we don't get their data ready for them you just get access to the big data files that you have approval to use and you have to run wild with it on your own so it's a little bit different model than here and every project has to have a formal approval process and so as I mentioned most people have a big thing about how it benefits the Census Bureau you also just have a big research proposal like you with the Census Bureau and any other agencies whose data you're using so the IRS or whoever has to approve your project so it takes people like six months to a year to get and part of that time though is also doing going through a thorough background investigation so you mentioned like you had to do like three trainings I tell people to do like a half day a day there's training, you have to get fingerprints there's a background check so you're basically becoming an unpaid federal employee and so it's a costly investment this is like a Census Bureau's making investment this also means we want you to have a big of project so people come in there most people do their research out of there for at least five years plus they get a bunch of papers out there's some pretty high level work that comes out they usually place really well it's good stuff all results are formally reviewed just like they are here it sounds like there are people here on the disclosure review teams we also do that too so everything goes through disclosure there's no intermediary output here you're not allowed to do output analysis outside so it all has to be in the form of final tables so what are we doing now what are we getting excited about well these pilots have been awesome kind of excited but want to do that I mean if you guys know that this is cool so we're trying to get them formalized so that anyone can do it and we can expand access to anyone we're also improving the 1940 data Trent and I have a project recovering some historical tax records in the middle of this period so from 1969 to 1989 so that's pretty exciting we would like to recover the 1960s or 1980 decennial data as well so we're in talks with foundations to get money to go in and scan those images again get names, whatever we can to link those to the data we already have and as I mentioned link in the historical data going back to 1850 1850 was our first micro data in the US it's our first individual level micro data before that the census was just household level with counts of people on household so that's our first sort of individual line one I think you guys 71? 41 41 data delivery improvements public metadata, improved documentation helpful data processing tools we're trying to make it easier for people to get access and then once they get access to the data pretty hard to use it's really difficult in our computing environment since I've been sitting in this room I already leaned over and I was like wait a second once we have all the data on the public file of link data that's anonymous it's been kind of inspiring that's not on the back of my mind it's a little further out but that's there so that was like a big one you said the projects have to benefit the census view one of the conditions we have for you to hear it gives reasons for doing research and it says you can challenge official statistics so you could do research with the LS legitimately to demonstrate problems in the LS would challenging census view be in order to improve in the long run your statistics without common benefiting kind of so we have 13 criteria so we sort of give it to you we say we have these 13 things and we like you to pick some so one of them is like improving our data improving our waiting methods um telling helping with yes like if something is wrong or there's a whole project that's telling us how we like do our geography incorrectly that they think that and that is really their whole project is talking about how the geography is they don't think it represents well like how we do our census geographies is wrong um so yes I think that is in there but we do give we have a list of and one of them is making input like giving information by the population there are some lines that are pretty easy to get and then there are some other things that are a little harder and we that's what we also that's part of the role these FSRDCs and the people working them is to support researchers doing it and help them write these statements yeah I'm not sure which one you're referring to but sure big the well so that the math that I have and that was bad and something purple and I was good what that that was the result of a project like you're describing it took the 2010 census and actually it's desperate to our admin data in those yellow areas where where we had a bunch of people in the census who weren't in the admin data we have no way to represent other than to conduct a census now we can work harder to get more admin data so some states say yes a lot of it is at the state level so we could probably do better and you might be in position to do a lot better than we are if you have more of a better life system of administrative records but right now in the United States we need a census do you find it also to be in the way around that you've got people in the admin data that are not in the census there may be a little bit of that but it's a great question I just watched the IRS give a talk and they didn't get much they didn't find a lot of people paying taxes that were in the census I think there's another thing going on that didn't come up there that could be for you too which is that I don't think so right now our census is a short form so it asks very few questions that's one thing to replicate it's another thing like the long format at ECS which also comes out too to get all the questions there I think from administrative sources not all of them but a lot I think that actually the Nordic countries that do all admin records when you talk to their researchers it's sort of amazing but then sometimes they hate it so like they're talking I do some research on contraception stuff and birth infertility and you can see in their stuff who's going to get contraception if they get it from certain places then they have the records and you're not getting fertility they're frustrated they wanted some survey questions and I can see that that's where also things I think that it could be a loss of research if we get away from all of our surveys it sounds like they're facing that up there right now there's some really great things about it and then there are some really good questions this is part of that because of the cost and the number that they're working on what I'm getting rid of it it's mandated they have to do it it's in our constitution it's a pretty big deal can we switch to sampling if they can't I think they're counting everyone they're not getting away from it there are some cost issues that's another the long form six of the population are long form which was 28 pages of detailed questions we did away with that in 2010 it's just 10 questions to everyone so that's part of what we were referring to but ending of this annual census I don't think it's in the cards in the short term I guess in the UK we had a sort of feasible spine from the national health service so everyone you're a medium of a new side of the national health service if you were born here you were born in the national health service there's not that many people I think we get on so I'm not from national health service I don't know about that which is maybe why I mean you guys have really high linkage rates it looks like I mean we don't know any of that on us the people that do the linkage now it just looked from your numbers like whoa that's you guys have a lot of people yeah just all of them to take as much in the avenue for a time note on so now is it playing this for the 2031 that O&S will use as much as they can for the admin but they'll have a separate survey to sort of calibrate for all the other things stuff and then that survey might be say by percent I'm not sure I can't do the number but just in the same way that the census amount of three didn't make that certainly amount of three and then all the stuff that comes from the admin scales that was feeding to the sort of people in the census didn't that's the current plan but who knows any one more quick question I have just one quick question as in UK such a best way so the linked data right now we're not taking any more pilots if you're in the US we say wherever it doesn't matter right now until we have in terms of getting the restricted data in general you can be you can be anywhere from anywhere but you have to be working in an RDC in the US so if you're associated with one of those universities you can or have a research partner there you can so I definitely I've said this to I have US citizens that live over here that are still on projects right and like come back and have people there I have four nationals that live in the US in our out of university and go in and use the data too so it's like very I mean I think if you really want to do this to find a partner yes it's actually three it's actually not consecutive either it's three years in the last five so that is that is a rule but I okay so it's a rule for you to like go in and access the data so if you really wanted to do something which I can see like or something if you had you can definitely be on a research team with someone who else who goes in and does the data so they do the data work you can talk to them about it without talking about numbers they can do all the analysis and bring it out and you can like write a paper on it and you can be involved in that way too so I have a couple people like that that they're on research projects and they live over here but they're not right there on the papers we're not like telling the Bureau like who can be on the papers but they have a partner over there that's doing their work no a list of eligible partners and we're not a dating service well then if you google Census Bureau researchers or something there is a website that lists everyone who an email all staff at Census and say hey if you're a researcher sign up for this thing and it says your background and you're interested in what you do that's the closest thing you can do yeah like a Census Bureau one because what we say is if people want to be on base they're not in UK or they're not in a that's the public sector or when I say the public sector and actually we'll try and we might do this once and you know and that has got a grant right to me because it's the idea of me acting as staff and it's thinking it's thought because really I like it