 I'm going to talk today about linked data, data linkage. This is linked data is something that is almost universally relevant to research. It props up in lots of fields and service provision and computer science and data science and health research and social research and all sorts of areas. And there's lots of different terminologies around it. Now, I come from a background of health research. I'm an epidemiologist primarily. I also had some training in economics though, so particularly the statistical terminologies can vary a lot across disciplines. I try to sort of highlight where that might be the case, but if you've got any questions or you don't understand the terms that I'm using, please feel free to interrupt me or put your hand up and don't be afraid. It's better that we're all on the same page. If you don't understand something I've said, chances are somebody else in the room doesn't either. So, linked data is really... So there's going to be three parts to today's talk. The first part, I'm going to be talking about some practical aspects and sort of what is linked data and how can you access it and what can you use it for. And then the second two parts of the day, I'm going to be talking about methodological problems that you might run into. Firstly, problems generic to the use of administrative data because when we as researchers talk about linked data, we're often talking about secondary data, so this is data that hasn't been collected for research purposes. So I'm going to talk about problems that are generic to use of secondary data for research. And then I'm going to talk about linkage error, which is what goes wrong when data are linked. So, first part, practical aspects. We're going to talk about what is data linkage? What can you use linked data for? How are records linked? How does data linkage actually operate? And how can you access it? I'm going to run through this section fairly quickly and focus more on the methodological topics later on. Because for the large part, this is stuff that varies a lot depending on what sort of data you want to use and you'll have to navigate these minefields for yourselves. So, OECD has a definition for data linkage that is emerging that brings together information from two or more sources of data with the object of consolidating facts concerning an individual or an event or an entity that are not available in any single record. Now, the classic example are when you've got two different files that you want to merge together and the same people appear in both files, but you've got different information about them. So, you might have information about their hospital attendances, admissions in one file, and you might want to link it to information from a register of deaths to work out how many of the people or how long it took for people who'd been admitted to hospital to die afterwards. But data linkage also can happen within a single file. So, when data are not recorded in the unit of analysis that you want to use, so say if you're interested in people, and a file of hospital admissions, hospital data is recorded in the level of admissions and linking those admissions together so as you can tell, okay, these two admissions relate to the same person. That in itself is data linkage, even though it's only one file that you're working with. So, what is linkage used for? Obviously, to merge information. That's a bit of a redundant definition. It can be used to evaluate data quality by triangulating against external reference data. It's used just in service provision and sort of core business activities. You can think like if services need to be coordinated or merged or companies need to be merged or lists of clients need to be merged. These are data linkage problems. But what we're primarily interested in, well, I'm primarily interested in anyway as a researcher is what new research questions you can address by bringing information from different places together into the same place. So, some examples of research questions by linking information on flight arrivals to hospital data. Researchers are able to demonstrate the increased risk of venous thromboembolism. So, this is blood clots in your veins that you get from regular long-haul flights. And this research utilized administrative data from about passenger information from flights, linked that to patient information from hospitals to be able to demonstrate the increased risk of thromboembolism associated with long-haul flying. Another example is the researchers were able to look at the increased risk of mortality experienced by patients undergoing treatment for drug use problems in the time period shortly after discharge from hospital. So, when patients who might have had a heavy drug problem were in hospital, were off the drugs, then are discharged out into the community, go find themselves in the supply of drugs and are at increased risk of overdose because of their reduced tolerance from the time spent in hospital. And so, this is another example of research that was able to be conducted by linking together data, in this case from drug treatment registrations with death registers and hospital data. So, how are data linked? Very quickly and broadly, there are two different approaches to linking records. There's deterministic approaches which are based on rules of agreement. So, if these records agree on name and agree on date of birth and agree on address, then we'll treat them as if they're the same person. So, that's a rule based on a pattern of agreement over a set of matching variables, name, address, date of birth, that kind of thing. Probabilistic linkage, in probabilistic linkage, every possible pattern of agreement is assigned a score that can be translated into an estimate of the likelihood that a pair that exhibits that pattern of agreement across matching variables is a match or is not a match. So, an estimate of the probability associated with that or the likelihood. Depending on how these are implemented, they can produce very different results or they can produce exactly the same results. So, there's a lot of opportunities to tweak and modify each of these procedures in terms of defining how you set the rules and how you set the parameters up in the probabilistic linkage and they can be perfectly equivalent procedures or they can be very different. Generally, probabilistic linkage is more flexible because it allows you to accommodate larger numbers of matching variables, much easier. You can imagine if you have to work out which rules to use when you've got ten different matching variables, there's two to the ten different possible patterns of agreement that could appear in your data. So, it's very difficult to work out which rules of agreement should incorporate into your linkage. Probabilistic linkage makes it much easier to accommodate large numbers of matching variables. I'm currently doing a linkage with 23 matching variables. It's easier to accommodate distance measures of partial agreement. So, you know, John with an H and John with an H, we might go, okay, well, these are pretty similar. They're not perfect, but they're pretty similar and maybe that's sort of 75% agreement. You can have these sort of continuous measures of partial agreement. You can factor those into probabilistic linkage much more easily and you can accommodate frequency-based weights. So, agreement on Deutsch, which is quite a rare name, is much more likely to mean that a record pair is a match than agreement on Smith. So, these are things that are difficult to incorporate into deterministic rule-based approaches. And so, generally, probabilistic linkage is more flexible and can be more sensitive and so can work better. So, it can detect more of the matches than deterministic linkage. But deterministic linkage is generally simpler, easier to implement, easier to interpret. So, how to access linked data. Basic questions you need to work through are, you know, who own the data. You've got probably two different sources of data, at least that you want to join together. You need to work out who the data providers for those different data sources are. And they may operate within different legal constraints that might have different administrative processes. A very relevant question is, have the data already been linked? Some data are linked routinely? Or are you asking for this data to be linked specifically for your research project? Where are you going to store the data during analysis? The requirements can differ between data providers. Some might expect you to run your analysis on their systems. They won't let the data leave their own safe havens. Some will be happy to hand it over to you. Sometimes you might be able to use a third-party safe haven like the ones that are provided by Office for National Statistics or the ADRN. So I mentioned ago, you know, are the data already linked? So there's sort of two models for data linkage. One is this sort of project-by-project ad hoc linkage, which is often what's done in England, but in various other places around the world, including Wales and Scotland and Australia and Canada, and there are establishing systems for routine ongoing linkage of administrative data where various high-value data sources like health, you know, high-value health data like hospitalisation data and high-value social data or education data are routinely linked every year in order to make them available for an unspecified range of sort of future research project. One of the big advantages of routine ongoing linkage, apart from the efficiencies and economies of scale, is that the more data that are linked together, the better the linkage becomes. So the more likely you are to know that somebody changed their name when they got married or moved addresses and so you can then pick up these differences that occur in the data set. If you're only linking two files together, then you've got much less information about the different names that people use and the different addresses that they've lived at. But there isn't much routine linkage happening in England, unfortunately. Some examples are the Oxford Record linkage system, the clinical practice research data link, hospital episodes are routinely linked to ONS mortality data. These are all pretty much within health, not across disciplines, unfortunately. And they're also often local area or geographic, restricted to certain areas like Oxford or Bradford. Now, in terms of who's going to do the linkage, sometimes a data provider might do this for you. It's very rare that you'll be doing any linkage yourself as a researcher because it generally requires access to the identifiable data. Names, addresses, dates of birth, which providers are much less willing to share. And you're not actually interested in as a researcher. You're generally only interested in the substantive variables, the clinical variables, the payload data. You don't care who somebody is, only what happens to them or what they do. A sort of best practice model for data linkage is known as Trusted Third Party. And in this model, you might have two data providers who will send a file of just their identifiers. So just the names, addresses, and any variables that are going to be used for matching for the record linkage to a Trusted Third Party or a data linkage unit who performs the linkage on these variables without ever seeing any of the sensitive clinical data or payload data, creates a pseudo ID, a random number to assign to each person, sends that random number back to the data custodians who then have data that is essentially linked already because study ID or some sort of pseudo ID appears in both of their data files, but they don't have access to each other's data. But now they can send to researchers a file that contains the clinical or service data that researchers are interested in and this random study ID with all of the identifying information, names and addresses stripped away. So the linkage unit or Trusted Third Party only handles the identifiable data and no sensitive clinical data and the researchers only handle the sensitive clinical data and not the identifiers which are sensitive in their own right and there's no unnecessary disclosure of information between the data providers either. So this is called known as a Trusted Third Party model of data linkage and it's generally considered a sort of best practice. So what are some of the things you're going to, practical things you're going to have to consider in applying for administrative data at least, a whole series of approvals that are likely to be required. So approvals from the data providers and there can be multiple stakeholders within the data providers that you usually need to get through. So, you know, the Caldecott Guardians or committees who oversee, sort of approve the release of data for research, they might need to separately approve the release of identifiable data for linkage, the data processing team, the people are actually going to process the data and give it to you, they've got to be on board and understand what you want them to do with the data, how you want the data to come out of the provider. The data linkers will need to be on board, there may be ethics committees and there may be different ethics committees required for different data providers. If you're accessing health data without consent, then it will need to go to something called CAG, the Confidentiality Advisory Group, your own institutions likely to have data protection officers, et cetera, et cetera. The time scales for navigating all of these approvals can be anywhere from in my experience. I certainly have never had anything close to three months. I'm currently working on... I'm currently... I've got an English project that is pushing three years now and an Australian project that we have the data after six years of negotiating and renegotiating with government departments and you have this long list of delays and changes in this process. Across a period of three or six years, the administrative processes can change, the administrative personnel will certainly change if it's in the public sector. Data security requirements can change, legislation can change, agreements between the data providers can change, the scope of your data request may well evolve in that time and the approvals that you acquired early in the process are likely, you know, may need to be renewed or may expire before you get to the end of it. It's worth acknowledging, I think, if you want to... if you're talking about accessing public administrative data, that the average public servant or bureaucrat really doesn't stand to gain anything at all by letting you have their data and they have the world to lose if it blows up in the face of some media fiasco about inappropriate use or access of government data and the media love to make a fiasco about anything data breach and confidentiality and privacy related, then, you know, their career may go down the toilet and so there's a lot of incentive... there's some sort of perverse incentives built into the system, I think, to encourage these processes to be strung out and scrutinized. I mean, it's right that they should be scrutinized but it's not right that they should take six years. Costs can vary wildly anywhere from nil to, I know, sort of one routinely linked health dataset that a full extract of that will cost $200,000 per year to access. So on the note of these sort of costs and timescales, if you're doing a PhD and you want to incorporate linked data into a doctoral program or anything, if you're talking about accessing new linked data or data that you don't have yet or that your supervisors don't have, I strongly suggest that you have... that's fine, that's great, it's ambitious but have a plan B. I did that, that's what I did for my PhD and now I've got this project that is up and running and I can keep working on it in my post-doctoral years but I couldn't use it in my PhD. So onto the interesting stuff, methodological challenges, administrative data. So often when we're linking data, one of the datasets we're linking might be a primary dataset so there's data that's been specifically collected for research. We might be working with a cohort study or a trial and sometimes these are supplemented, augmented with data linkage to administrative datasets or registry datasets. So there's sort of a spectrum of data. You've got primary data at one end which has been collected specifically for research. You've got administrative data at the other end which is a sort of a by-product of running a service or a business and registry data falls somewhere in the middle so there's like register of deaths, register of births, register of cancer registries, things like that. These are sort of semi-administrative data but they are collected for research, just not necessarily your research project specifically. They're being collected for a non-specific range of research and monitoring purposes. So we're going to talk about secondary data here, particularly administrative data that hasn't been collected for research, understanding the recording and quality of that administrative data, understanding the population coverage of the dataset and highlight how it's really important to understand what is actually causing this data to be recorded because it's not your study. So administrative data, I think we've covered this, routinely collected data could be related to financial management so hospital episode statistics, if you're working in health data you should know what those are there, all about hospitals being reimbursed for the services they provide. They're not about surveilling their population of patients for research or any other purposes, they're about getting paid and they're coded with really that one purpose in mind, recorded and coded. And it affects what's recorded and how it's recorded. It can be clinical management and audit, registration services we've covered, sometimes it's collected for service evaluation and delivery, etc. So some questions to ask when any time you're using administrative data will help you understand some of the quality issues, why were the data collected, which data had to be collected for this purpose and which data didn't have to be collected for this purpose, but were collected anyway, these might vary quite differently in terms of quality, which relevant data were not collected because they weren't required, what unit were the data recorded in? So were they recorded as people? Not usually when you're talking about services or administrative data, usually at some sort of event like a hospital admission. If it's not your intended unit of analysis, how was the data internally linked or how are you going to link it? So a lot of work is done with single data sets like hospital episode statistics without the analyst or researcher really appreciating that this is data that has been already linked and because it's already been linked, it wasn't recorded at the level of the person, it was recorded at the level of the admission there, that linkage might have errors, it's certainly not going to be perfect. And failure to consider an account for these errors in your analysis can mess up your results or the conclusions you want to take from them. How were the data recorded? So like who recorded the data? Was it the person who provided the service? Was it a coder working with clinical notes? Was it recorded in free text fields? Even if people only had a certain set of responses that they were meant to put in that text field, there's a big difference between whether they're allowed to enter that set as free text or whether they have to choose it from a drop-down box. Even just things like this affect the quality of the data that is recorded because drop-down boxes of course are less subject to mistakes. Perhaps. Anyway, was there any validation or quality assurance processes that went into the data after it had been entered into this database and has the recording practices changed over time or have the validation and quality assurance changed over time? And the answer to this for most administrative data sets is almost certainly yes and you've got to be really careful if you're doing any sort of analysis over time that you're comparing the same thing, that your analysis of differences over time aren't just reflecting differences in how the data were recorded. So are there differences in recording practices or quality of recording between the people who contribute to a data set? So between providers, between hospitals, between general practices, etc. What actually triggers data to be recorded in this way? And particularly is the recording of data or the likelihood that data were recorded is that related to any variables that you're interested in as a researcher? So if you're working with general practice data from something like the CPRD which I mentioned before, then are people who visit their GP more likely to have their weight measured if they're overweight or if they have diabetes? And how is that sort of selection bias in the data going to influence your analysis? And what is the coverage of your data set in terms of the target population for your analysis? And we'll talk more about coverage in a moment and how is that coverage changed over time? So coverage might be geographic things but it can also be in other dimensions. So most administrative data, most administrative services require for somebody to be included and reflected in that data set they require that the person had access to the service which can be limited by geography, by language, by disability, by their lifestyle or work pattern. If they spend half their time working overseas or remotely then that might affect how often or how likely they are to be reflected in the data set. And whether they're included in the data so they have to have access and they have to actually utilise it and whether or not they actually utilise it can be affected by well like supply side economics so how much of the service is actually available to be used it can be affected by whether there are competing services so other options for them to use local variation in the quality of the service or the perceived quality of the service and cultural factors can impact on whether someone actually utilises the service and therefore is reflected in the data. Often a service is provided generally area based and people come and go from areas and people die and are born people immigrate and emigrate both locally, regionally and internationally and these events are often not captured in the administrative data so you're often not sure what your population denominators are in your statistics so you don't necessarily have a good handle on you might know how many people used the service but you don't know how many people didn't use the service and variables might not always be observed so if somebody's emigrated or somebody was away then they might have died or they might have had services elsewhere but you treat them as if they didn't die or didn't have services elsewhere so all of these things really can be translated into effects of information bias or selection bias these are epidemiological terms economists and so many social sciences might have a different idea about what I mean when I say selection bias so economists often use selection bias to refer to what epidemiologists would call confounding which is differences in the characteristics of different groups but when I talk about selection bias I'm talking about differences in the probability of being included in your data and information bias is misclassification so you treat someone as they're alive when in fact they're dead so it's a categorical variable that's been incorrectly measured and measurement error is the same thing but with continuous variables so how long somebody lived for is a continuous variable and if you haven't picked up that somebody's died then you might, if you're doing any sort of survival analysis or time-based analysis you might overestimate how long they lived for so that would be measurement error once you understand the previous problems in terms of information bias and selection bias you can adjust for them and I highly recommend these texts by Lash and others there's a textbook and there's a journal article which is a good place to start on accounting for information bias correcting for information bias and selection bias so now last part linkage error this is what's specific to link data and we're going to talk about how to understand it how to assess it and how to address it these are all things that are currently done really poorly in most research that uses link data partly because we haven't been working with link data on a big scale until quite recently and partly because it's just really complex it's difficult to understand and so I think people for a large part ignore it and also for a large part don't even realise the linkage that's gone on behind the scenes when analysing things like hospital episode statistics so linkage error, so two types of errors that can occur in linkage you've got missed links between records that belong to the same person or the same entity and false links between records that belong to different people missed links are primarily caused by errors and missing data in matching variables or variation in the values of those matching variables over time so women generally have higher rates of missed links why do you think that is? change their name when they get married false links occur primarily because of a lack of uniqueness or discriminatory power in your matching variables so people with very common names are more likely to be subject to false links and so certain ethnic groups that have much less variation in their surname for example generally have higher rates of false links in their data so if you've done any work in epidemiology then you might be familiar with sensitivity and specificity from screening tests and the same two by two tables can be applied to linkage error and the same statistics in terms of sensitivity and specificity can be derived from this so this is just a two by two table of the link status whether the records are linked and the match status which is sort of the true whether these really do relate to the same person or really do relate to different people and if you can populate a table like this and get estimates of sensitivity and specificity of linkage then you can correct for linkage error so I said linkage error is complex and that's because it can corrupt your data in a lot of different ways so it can cause misclassification if you treat someone as alive when in fact they're dead because you've missed a link to the death register because measurement error if you overestimate how long someone lived because you missed a link to the death register it can cause missing data so if you're linking to a birth register you expect everybody to have a birth record and you're linking to get some additional information about their early life circumstances but for some people you can't find their link you've got missing data for those people you know there was meant to be a link you couldn't find it so you don't have the data it can cause selection bias so if you exclude everyone with missing data then you've got selection bias if you've got misclassification or measurement error on criteria that you're using to define your study sample the people you want to analyse and that's been misclassified because of linkage error then that can cause selection bias if you've got missing data that causes if you've got less data then you generally got less statistical power and if you've got lots of measurement error and misclassification creating noise in your data then you've got less statistical power this means wider confidence intervals less precision in your estimates and lastly linkage error can cause something really horrible that's a sort of combination of most of these things which is this phenomenon of splitting of units of people into multiple parts and merging of units together so a missed link might mean that instead of one person who was admitted to hospital twice you treat them in your data as if they're actually two people who were admitted to hospital once and in effect that person then has a 200% chance of being included in your data and then they're misclassified in both of those cases so this is a really horrible form of corruption that involves both selection biases and misclassification or measurement error and it's really specific to data linkage and of course the inverse happens as well where you merge people together into single units so some useful questions to ask to help you understand what the impacts of linkage error are in your analysis is the presence or absence of a link meaningfully interpreted so for example in the link to the death register we interpret the presence of a link to a death register as meaning that the person has died and we interpret the absence of a link to a death register as meaning that the person hasn't died so understanding that if you are interpreting the presence or absence of links like that then what will be the implications of linkage error for your analysis and if not then generally you're expecting there to be a link for everybody and how are you going to handle missing data missed links because there are different ways to handle missing data and some of them are better than others does inclusion in your analysis which is what I'm talking about when I say selection depends on successful linkage if linkage is not successful are people going to be erroneously excluded or erroneously included in your analysis and is there possible splitting and merging if there are multiple possible records in one file or both files then there's always going to be possible splitting and merging if you're connecting more than two files then you're working with then there's always possible splitting and merging too and really importantly is linkage error likely to be related to the variables that you're interested in because something we know about information bias and selection bias is that they're much worse when the misclassification or the probability of selection is associated with the variables that you're interested in analysing if they're not associated with the variables interest that you're analysing then there might be less of a problem there might not be a problem at all in terms of selection bias or there might be less of a problem in terms of information bias and there are some ways that we can test this so five ways that you can assess linkage quality or that you might be able to assess linkage quality sometimes we can make comparisons of, well sometimes we can look at a gold standard subset of our linked data so there's a subgroup for whom we know the link status we're confident in it for some reason maybe they had a certain high quality well recorded unique identifier like NHS number that have been well recorded and we were able to link on that but we didn't have that information for everyone else if that subgroup's representative of the whole group then you can work out all those values in your two by two table and you can adjust for linkage error but gold standards are like gold, they're rare something that is much more common to be able to do is comparisons with external reference statistics so this might be statistics that have been generated from another source or with the entirety of one of the data sets that you're confident in and that your analysis sample should be consistent with the same population or represent the same population and you can look at differences in the characteristics of your sample compared with the external reference statistics sometimes you can conduct procedural sensitivity analysis so this is using different decision rules and deterministic linkage or different thresholds and probabilistic linkage that's something that I didn't have time to talk about the detail of the thresholds but different estimated probabilities of linkage beyond which you choose to classify people as links and below which you choose to classify them as non-links you can change that threshold just like you can change the decision rules and if the data linker provides you information on those probabilities or match scores or the different rules that were used then you might be able to do this yourself comparisons of linked and unlinked records isn't going to be any help at all if you're talking about linkage to a death register but if you're talking about linkage to a birth register then you can compare the characteristics of the people who you could find a link to the birth register with the characteristics of the people who you couldn't find that link with and that will tell you about the distribution of missed links compared to your variables of interest hopefully if they're in your primary data file anyway and something you can do to look at the distribution of false links is to identify unlikely or implausible scenarios so like people having activity after they're supposed to have died or people being admitted to hospital twice at the same time and things that are unlikely or implausible scenarios that may exist in your data but shouldn't exist and they won't let you measure the extent of false links but they might let you measure the association of false links with the variables that you're interested in at least which is quite helpful in terms of addressing linkage error the first and most basic thing that you should do is acknowledge and discuss it and try to unpick what the likely impacts are for your analysis using those questions that I gave you before you... or there are sort of two approaches to correcting or adjusting for linkage error either you can correct for it in a sort of post hoc way by conducting sensitivity analysis or quantitative bias analysis after your main analysis which attempts to adjust for the impacts of linkage error in terms of information bias and selection bias currently working on a sort of classification system for studies of link data that will help you identify what the impacts are for your analysis and that will also help you with your informal discussion too or there's sort of missing data-based approaches to analysis of linked data like imputation-based approaches and there's actually some inverse probability weighting-based approaches that are starting to be developed as well these are stuff for the enthusiastic analysis analyst, sorry they're pretty new they're pretty complicated but they're being developed now and there's a... there's a technique based on how we handle it missing data applied to linked data that sort of prospectively account for the uncertainty in the linkage process and feed that directly into the analysis and I've listed just some useful resources for you so I teach a course on missing data with Katie Harron through the ADRCE there's some good textbooks on both data linkage and sensitivity analysis quantitative bias analysis that I think is relevant some articles that are good places to start if you want to work with linked data or data linkage and some of the work that will be coming out soon so, thank you