 This module, as I'm sure you're already aware, is going to be on ethics of data usage and security. And so it's going to be kind of broken up into three sections. I think I've given the learning objectives on the next slide. I think I have four items. So there's identifying harm stemming from genomic data privacy breaches, understanding anonymization and its limitations in preserving privacy, understanding how controlled access works as an alternative or complement, and finally understanding why and how SSH is used to securely interact with the virtual machine. But my presentation is kind of broken up into three parts. So it starts by asking some of the big questions around privacy and bioethics in genomic cloud computing. Next section talks a bit more about specifically some of the duties you might be asked to undertake in order to access people's data. And then finally talking about some security concerns and focusing on SSH connections to cloud virtual machines, but talking about things a bit more broadly. So the first half is more privacy-oriented. And the topic is evolving rapidly and is quite timely. I mean, it's kind of maybe been timely for the last 20 years, but it continues to be. So this is an article from this Monday in the Globe and Mail talking about Canada's move towards a genetic anti-discrimination law. I'm not sure if people have been following this or not, but it's one of the debates that kind of recur in the fields of genetics. Many other countries where genetic or genomic research is happening already have such laws. The US, of course, has 2008 Gena Genetic Information Non-Discrimination Act. Canada is perhaps moving towards this. There's a bill that was passed in the Senate and now Parliament's debating it. And so this is an article from three Toronto academics arguing for its adoption. And so they ask to consider the scenario where your child is suffering from a potentially fatal disorder. After several arduous months of searching for an answer to its cause, you are offered a genetic test. The test needed to give doctors a clue to explain the illness and help treat it. But the concern is, then, if this information becomes available to, say, employers or insurers, is it possible to be discriminated against in all kinds of ways? We've seen this kind of example in the US where there have been employers who have as part of their pre-employment medical screening run genetic testing. And this is one of the things that prompted the US Act. But it may seem a bit remote from the stuff we're doing where we're analyzing people's data in cloud and the bulk. But there's a number of issues that come up that can be related to this work either directly or indirectly. So I've listed, sorry? This is the first of many questions. Yeah. So are there cases in the US? That's a good question. I haven't encountered those. Because most of the cases I've heard about came pre-2008. Like, there was a bunch of famous cases, including I think it was an NFL player who was, I don't know if you followed that case, who was alleging genetic discrimination. But this came in 2006 or 2007. Since 2008, I'm actually not aware of, there may be, Gina, case law now. But I'm not actually aware of it. Is it something you've heard about? I don't know. I mean, it's either that the law works and it's breaking it. Yeah, it seems like in areas like insurance that are already pretty highly regulated, they tend to, you know, sometimes they'll push boundaries, but they tend to kind of comply with clear laws. So I wouldn't be surprised if there wasn't one. Yeah, the insurance case is interesting. That's the, without doing genetic tests, you can ask a lot of genetic questions that disease. You don't have a genetic test that you know you have. Yeah. Yeah, the last time I presented in this room, actually, there was kind of an intense debate that broke out between people, because some people don't see a problem, right, because they're like, we already are insurers are already asking all kinds of health questions. What's the harm other people see? And it becomes, you know, the way discrimination works in law is, I mean, discrimination happens all the time and is accepted. It's usually there's prohibited grounds of discrimination, such as race, gender, all the things we were normally used to thinking of. And so it becomes more, more a social question is, is discrimination based on, you know, your genetic predispositions or genetic history, something we are willing to accept as a society or not willing to accept. And I mean, you can get into more concrete that are less values laden debates about it too. So for example, there's some people who argue that if we prohibit all genetic discrimination and insurance, what that'll mean is that people will go and get tested anyways. And then those who aren't predisposed won't buy insurance and then insurance will go up. And so there's this argument too, although it will depend on your values. If there's a genetic predisposition, predisposition to smoking, and they ask you if you're a smoker or not, you say, well, I'm a smoker, but I have a predisposition for it. So it's a genetic, a genetic predisposition to have it like smoking? Yeah. That would be, are you aware of it? They wouldn't be such a test, you know. It would be an interesting link to make. It would be a past test. Anyways, I should move back. I'm worried that I've put too many slides in here and might not get through them. But so I've listed a bunch of harms, including the discrimination, and especially laws will target insurance and employment. And of course, genetic information is kind of inherently health information and really itself has the potential to give information about a predisposition to certain kinds of diseases and conditions, but then also with the increasing amount that we're storing genomic data alongside phenotypic data, it's possible that for people who have access to that, having their genetic information released could release that phenotypic information kind of indirectly because you can associate it with those people. So if you're, say, in a database of people with a specific disease, but your name isn't included, but your genetic data is, it's possible that people could identify that. Paternity information obviously is one that comes up fairly often with its genetic testing and some of these issues have led to the creation of this new area of genetic counselors that are mandatory in lots of cases now. When genetic data is used to the biometric identifier, it can be linked with identity theft. We talked already a bunch about the kind of legal jeopardy issues of post, especially people have been aware of post the Snowden revelations. So there's on the law enforcement side and a national security side. There's a lot of concerns. I haven't talked so much in this, or I'm not planning to talk so much about this presentation about the kind of international border crossing stuff that Francis alluded to a bit before, but we can if you like. But kind of the biggest issue actually is future uses. A lot of the harms that I've listed, we haven't seen so, so much of, but things are moving so fast in terms of research and our ability to sequence DNA cheaply, to have people's DNA available, to do different things to it. And so it becomes, it's a bit unclear to kind of gauge, but one of the things is, for example, with other kinds of private information, like say my credit card or my bank information, if that's somehow compromised, it's bad. I could have quite a bit of money stolen or maybe not as the case may be. But at the end of the day, what I can do then is I can, once that information is compromised, I can close my bank account, close down my credit card, get a new credit card number. But if your genetic data or your genomic data is compromised, you can't close down your genetic account and open a new one. It's kind of the DNA you're stuck with for your life. So this is what I'm talking about when I'm talking about DNA stasis. It's static over the course of our lives. And it's tricky because it also reveals information about others. So one of the big debates currently actually is even whether we should be specifically making laws and policies around genetics at all. So there's the non-discrimination one is where there's been a clear trend for genetic non-discrimination kind of legislation and policy, but in some cases people are arguing, well, we already have laws that address privacy. We already have laws that address and regulations that address confidentiality. We already have even health specific ones. Why do we need to talk about genomics? And it's a debate that's ongoing that's unsettled, but this graph comes out of a fairly recent academic piece where the authors were trying to make a case that there is something specific about genetic information that does require, in some cases, not all specific regulation. It's the coincidence of these six factors they identify. Because even the stuff like I talked about not being able to close down my DNA and get a new copy, that's actually generalizable to kind of all biometrics, right? I mean, to some degree, maybe some biometrics you can change, but for the most part, not. You're not gonna get a new set of fingerprints. You're not gonna get all this other data. But so in these authors' kind of estimation, the coincidence of these six factors mean we need specific rules around genomics. Others disagree strongly. It's an ongoing debate. And so now, again, Francis kind of alluded to this before, but what we end up with at the current time, apart from this non-discrimination, kind of the laws around non-discrimination, what we have are basically general legal duties and general policy duties that apply to handling of genomic data that aren't so specific. And so there's, for example, personal information protection law. And again, it's quite fractured. So we have different distinct sets of laws in Canada, at least, that govern kind of the public sector, privacy law, private sector and health sectors, all of which can apply to genomic research depending on the context. The health sector, you might think, would cover everything, but those laws tend to apply only to kind of a hospital context, or that kind of more specific, which doesn't apply to in a lot of research contexts. We've also got different laws across provinces and federally, so it becomes a bit confusing. Although they tend to be similar, it can become a bit confusing about which specific laws even apply in a given research project, especially given that projects can span provinces and even countries quite easily. Another kind of big area of law is ethical research oversight, and so people in the kind of academic non-profit and other contexts where you're getting a lot of kind of public funding are often subjected to institutional review. Increasingly, now there's data access compliance offices that will ask to review projects. And we'll talk about the ICGC, especially DACO, but in general, DACO's a bit later before getting access to people's data. And the third one I've listed here is there's an increasing desire to kind of get access to people's clinical data, which is more controversial because people haven't explicitly consented to that the way you can consent to having your data used in research. And so there's patient confidentiality duties that come up in that context. But that last one for me, if it was consensical versus... Yeah, I mean one of the tricky things is when you go to consult your doctor, even if they say, oh, by the way, do you want to mind if all your data is used for research? There's a sense that you might be feeling a bit more compelled to say yes because you're going there seeking treatment than if you just voluntarily show up to participate in a research project. Is that kind of addressing what you're saying? Well, I mean I'm saying that there's some clinical data that is not used available because it wasn't consented. Well, not so much... Not opposed to... Definitely you have to give informed consent in order to be treated, right? But the idea is you haven't necessarily consented to having your data used for research. So let's say I've got pancreatic cancer, I don't know what I want to see. My doctor, he said, oh, I know, I heard you know this project, DNA, and you're kind of like a process. That said, I think most clinicians are probably able to... This is one of the reasons I think there's extra protection there, right? Because you are in a more vulnerable situation. And so not to say it's impossible, but... But if it's consented properly, the issue is if it's not consented. Yeah, so there are certain, through kind of the research ethics board, there are sometimes exceptions where placed under strict controls that are supposed to safeguard people. Excuse me, confidentiality. There can sometimes be exception, other ways to get that data other than explicit consent at least. But it's kind of a new field. And clinicians also often take their confidentiality due to these very, very serious things. There's much more pushback than from researchers as far as data sharing. So there's a bit of a tug now, right? Yeah, definitely a real one. But what the ICG seems to be doing for right now, I think it was very clear to all clinicians from the start of the project that they got so much pushback from the clinicians. It's a really different kind of culture, it seems like, between clinicians and the researchers side. It'll be interesting to see how that project goes forward in particular. I'm kind of glossing over, these are all kind of big topics, but the thing I wanted to bring up was, we've already actually been talking about it, was one of the unifying threads between them is the idea of informed consent. And so through all these, whether you're dealing with kind of personal information protection or privacy law, researchers' duties and clinicians, informed consent is kind of supposed to be the idea that people have some kind of control or say over their data. There's a tension there too as sometimes it becomes reduced to more of a checkbox, I agree or don't agree kind of thing, maybe I don't wanna say it's like a cell phone contract, but there's a tension there too. And so again, this is something Francis was talking about, but when you get a project like the International Cancer Genome Consortium, that's worldwide, that has projects in a variety of countries, where the consents that were collected, it's tough to get them all to be uniform, so people haven't necessarily all consented to the exact same thing in the exact same way. Yeah, he tried. Yeah. So they basically provided a template project was allowed, I think they tried to. And the legal, those cultural kind of reasons, but there's also legal reasons where certain, rules of informed consent don't always work the same way the rules of research oversight, et cetera. I mean, we talked about, I already talked about the different laws across just Canada, so if you think worldwide, it becomes a real... There was one country which actually... Okay, well that makes it easy I guess. Yeah, that makes sense. And so it becomes tricky, especially as genomics projects like the ICGC become so internationalized to fulfill these duties at once. But moving on, I talked a bit about harms to people whose genomic data is being analyzed, but there's also potential harms resulting from privacy or confidentiality breaches to researchers, which in case of breach or misuse, you can, the generalized notion that people will lose confidence in the field will be less likely to be willing to share their data, which if we want this research to continue is important. Individual researchers can risk having their funding canceled. There's also kind of reputational issues within and kind of community sanctions, which can be really significant. There are certain fines that exist under different privacy laws for breach. And occasionally, we don't see this so much in Canada, but there can be criminal or penal sanctions. I've cited one case here in Europe, where I don't think it was actually carried out, but there was a six months jail sentence ordered in a Google Italy case. But also in the US, I'm not sure if people were familiar with HIPAA, but it's the biggest kind of health information privacy law in the US. And it also allows for criminal sanctions in certain cases. On top of this, some of the novel risks we're seeing can complicate and form consent, right? So we've been talking about stuff like law enforcement surveillance and Patriot Act surveillance, and we've been talking about genetic discrimination. How exactly do you, because the idea of informed consent is the, the idea of being informed is supposed to be very broad. You should have an idea of the risks, which you're subjecting yourself to. And so how do you communicate to people, oh, by the way, there's also a chance that the NSA might harvest your data and do, we're not sure what with it. There's also this, sorry? I don't like your mutation. Exactly, I like your mutation, give you back results. So it becomes, all these things are, there's really a strain happening kind of in the privacy and policy making field around some of these issues. And regulation that becomes overly, and needlessly cumbersome, can stall research. So if there are incompatibilities between different jurisdictions or different organizations that are not based on cultural or other factors we might think are significant, I can cause huge delays. We've already kind of talked about this a bit, but so again, pushing on this tension, since the beginning of the Human Genome Project there's been a huge recognition that this data is extremely important to human health. And we've seen kind of through the open data and open science movement since the very beginning there's been strong pushes for it. So even as early as 1993, there was an NIH DOE joint subcommittee that put out guidelines encouraging the rapid release of data because of its benefits to other scientists. And we've seen this increase kind of over time with now seeing basically all funding agencies requiring genomic data to be released rapidly so that the most kind of health benefits can be made out of it. So it's this kind of tension between increasing privacy risks but also increasing pushes to release data. Again, kind of the same, I think Francis might have had this exact same graph but we can see the exponential growth in kind of genomic data being generated. And then again, kind of I'm maybe repeating a lot of what Francis went through. Okay, maybe I'm still seeing someone else's thunder. But the idea is that with the amount of data we're now generating, cloud computing seems to be the, if not the only way then at least a much more efficient way to process and analyze this data. And so that's kind of pushed for this turn to happen but there's also certain risks associated with cloud computing. One of them is especially if you're dealing with a big company like Amazon, your bargaining power, if you're a smaller research team might not be as much, if you're a much bigger institution you might be able to get a better deal for yourself. And I'm not just talking economically but even in terms of setting the kind of contractual rules about whether they can change your agreement whenever they feel like with or without notifying you on this kind of thing. I mean, one of the things that Amazon has done is that since we changed, and so we changed the agreement between, and so basically the opposite then if it's not the cloud provider, then it's OICR that becomes the next sort of target. And so we've actually ordered it so that OICR doesn't take any vitality either. And then now we're having Yan instead of taking about a week to accept an application. In some cases it's taking as long as a month. And as a deal with the lawyers at the institution, they don't want to take the risk either. So they have their tech office saying, well, no, we don't want to take the risk. You guys take the risk. So nobody wants to take the risk. So an agreement now that used to take a week is not going to be a month. That's interesting. Yeah. Yeah, I'll come back to that in a bit too actually. But so the tension has kind of resulted in obviously people searching for some way out of this to kind of get the best of both worlds. And until maybe 10 years ago, or perhaps more, people really had looked to anonymization as kind of the solution to all these concerns that could solve the problem of informed consent. Privacy breach, it can allow us to have data for research. And it's based on, I mean, from the legal side, it's based on this idea that privacy law and these other laws only, their duties only regulate to what's called personal data. Accessing policies again. No worries? Exactly. So yeah, sure. So I'm happy to talk about that. But the basic idea here is that if there's no way, as opposed to having a private ID, they don't have a private ID. Exactly. So it's kind of two main ways of thinking about this. I wasn't necessarily going to go into it, but it is important, especially in the research context. So you can either take data and kind of strip off, you can either just strip, there's a few different ways actually. You can just strip off direct identifiers. You can say we don't store this person's name or their telephone number or anything else like that with the data, therefore we can't figure out who it is. And therefore it's anonymized, therefore they can't experience any harms. So it doesn't need to be regulated. There we've got kind of the best of all possible worlds. We can use the data without exposing anyone to privacy risks. This is kind of, I guess this was kind of the idea maybe 10 years ago or more. Another possibility is instead of stripping off all that data, you can keep some kind of random number or a code that will allow you when you link to another database to re-identify that person for purposes, like the ones that Francis talked about, which in Europe they often refer to as pseudonymization. In North America more often they call it coding. There's a bunch of different names. One of the things when you're talking about anonymization is it's important, assuming you want to be clear to define the terms like you just asked me to very carefully because they are also used in different ways. So there's words like de-identification and anonymization that are used totally differently around the world. And so it is important to be precise. Unfortunately as maybe I've been kind of suggesting already there's been a real loss of confidence in anonymization as a means to kind of reconcile all of these different concerns and especially as kind of more and more sophisticated re-identification attacks have come up. And again, this isn't, anonymization isn't specific to genomic research but is important within it. It kind of applies more broadly to privacy concerns. And so the first kind of things that started coming up were similar kind of open data, big data things where AOL for example said, we're gonna release for kind of the public good or for crowdsourcing purposes, release a whole bunch of thousands of users search queries. And people can analyze this and get kind of interesting information about people out of it. But we've anonymized it because we've taken off people's names and their phone numbers and whatever. But unfortunately they hadn't realized I guess that people have a tendency to do things like search for their own name. And so within a few weeks people came out saying, oh, we've identified a dozen or more people, kind of re-identified this supposedly de-identified data. That's kind of an easy example or a simple one. But over time there have been more and more sophisticated re-identification attacks. So for example, Netflix released just a set of people's movie viewing habits without their names and information to say, can someone crowdsource us a better algorithm to recommend the next movie they're gonna wanna watch? And soon after that people said, oh, I figured out that this person is the one who watched all these movies. And so after kind of all these escalating breaches and creative, ever new creative ways, there's been a real, sorry? Yeah, yeah. I mean that's more in the more health specific context rather than generally. But so this is kind of increasingly the consensus is now that you can't fully anonymize data and have it remain useful. I mean this may be some exceptions to that. But this is especially so in the case of high dimensional data. Sometimes people talk about geolocation as having some of the same features, but I think genomics is really kind of the ultimate in high dimensional data. So this is one scholar named Paul Ohm, who's, he argues, I don't think I would go this far, but that privacy law should continue to apply to even fully de-added to buy data, at least for more sensitive forms. But like, and also like Francis just mentioned, there are concern criticisms of anonymization that have come up that aren't specific to privacy, but so this is basically exactly what Francis said that on the basis of research duties or ethical duties. So one of the ideas of informed consent is that consent isn't like a contract in the way that you're making an agreement with someone and you're both bound to it after. Consent is it's just kind of something that you agreement that you willingly give but you can withdraw at any time. So in the context of research you're supposed to be able to withdraw. But if we can't even figure out which data is yours because we've fully anonymized it, we can't do that anymore. So there's some people who are criticizing it, anonymization based on this ground. And similarly with the duty to return incidental findings. So that's referring to cases where say you're not looking for a specific kind of disease or condition, but you find someone in your dataset who has, it's normally kind of the key cases or the kind of paradigmatic case is someone who you come across who has a really serious disease that we also have a very strong treatment for. There's an increasing kind of argument that you have an ethical obligation to return that information to that person so they can act on it. And so if the data is anonymous, it doesn't allow that either. It becomes tough for you for researcher liability at what point, like there's no obligation to actually actively look for these things if you're not looking for them. But the question is more, if you happen to come across something, what point, how far does this obligation go? And there's some different, I was reading a case about this kind of thing yesterday actually it's interesting because it's not totally clear exactly how far it extends. And so some other criticisms just based on the fact that as anonymization becomes increasingly difficult, it means you have to kind of strip more and more data away to try to be sure that the data is actually properly anonymized and that kind of hinders the research value. Now getting kind of more specific to the genomic context, even as recently as about 10 years ago, there were arguments being put forward that at least my reading of what this academic is saying is that so long kind of as you strip off the person's name, phone number, et cetera, we can consider data anonymous. And so I think even more recently have been people saying, come on really, how are you going to re-identify someone if you just have a bunch of their base pairs you don't have a name attached to that. But more recently in this field like others, there's been kind of a series of ever more sophisticated re-identification techniques. I've got a sub-sample of them here. There's others I've left out, but so in 2004, one paper found that 75 single nucleotide polymorphisms could uniquely identify someone, which is not the same thing as saying, tell you what their name is, but they would uniquely pinpoint that person as opposed to someone else. A really famous kind of an important study was done in 2008 by Homer and colleagues who showed that partial genetic information could be used to identify a person that's belonging to either a studies control or effective group. And this one had a huge ramification. Sorry, go ahead. Today she had seven billion genome sequencing somewhere in the cloud. So the first statement would be very true, but otherwise if I had the 75 SNPs in somebody, I mean, how can I identify a person? Well, wait until the Gimreq paper from 2013. But first, I just want to talk about, the Homer paper is important because it was specifically using, so in the U.S. there's a giant database, you may know of, called the DB Gap, the database of genotypes and phenotypes, and it had initially been largely or exclusively open access, so people could access it very easily. And the data actually in the Homer paper that they were analyzing was from that database, and it was supposedly actually aggregate, or it was aggregate data, so you can almost analogously think of it as something, not quite like census data, but we're, you know, if you think- But it's GWAS, so it was the SNP data, so it's thousands of SNPs per individual, but if you only had to take 75 SNPs, as you suggested, you could just compare them and see, are those 75 SNPs more common in a schizophrenic group or in a different group? And so, and you could therefore say, ah-ha, you're more like 75 SNPs for somebody as few as 12, by four SNPs. Yeah. So you know which group they belong to, control or- Yes, you could see if someone had the disease being researched or not. No, it's high in schizophrenia, there's a bunch of other diseases where you- Yeah. And so this paper kind of sent shockwaves through the field, and it ended up that they took the DB Gap data off of being open access, that was no longer available to the public at large, which I'll come back to. So then the Gimreq paper in 2013 was interesting because they actually were able to, through sites like ansysree.com and others that have come up, they were able to actually go back and link bioinformatic profiles to databases to re-identify people with their names up to 10% of the cases. I think it was exclusively males because of, I forget exactly the method, but it was using patrilineality, and so people sharing their last name through the male's line. But it was an interesting and another new creative kind of way of doing it. And then most recently, this was a paper by Kai who was able to re-identify based on 25 only randomly selected SNPs based on welcome trust data. So I think it would be hard at this point to say that a whole GM sequence is anonymized, the way that people seem to maybe be thinking it was 10 years ago. And so there's been kind of this kind of crisis of confidence and anonymization. It's important to be aware of because it's still used quite a bit and talked about quite a bit, but it's no longer seen as the solution that can solve our kind of tension that I was talking about before. So with that kind of collapse has been a bunch of new approaches on all kinds of areas from technological to organizational and legal to try to fill the gaps and so that I'll briefly mention some of the technological ones. So for example, some of them that are quite interesting I think are a new novel cryptographic methods that are somewhat anonymization like the one I've got kind of in the graph here is an example from a project called Data Shield where essentially you're doing it's similar in the sense that it's similar I mean to aggregate statistics in that you've got data distributed across different repositories that you can run kind of aggregate studies on and statistics on without having to reveal the individual's data and get results back. A second example, so the generalized version of this kind of technique is called secure multi-party computing. The second technique I've got listed here is homomorphic encryption which I'm not sure if people have heard of or not but it's kind of in theory it's ideally situated to the cloud context because the idea it's I think quite amazing the notion is that you can send people's data encrypted into the cloud which that's pretty standard we can do that quite easily but not only that you can actually send encrypted instructions for analysis into the cloud have Amazon run those that analysis without knowing what it is return a result that remains anonymized and so you're able to kind of leverage this cloud infrastructure without disclosing any details to the cloud it's pretty ingenious the drawback unfortunately is that it hasn't so much there's been several kind of academic paper proof of concepts that have come out but it hasn't scaled yet to kind of real world situations as you know in as generalizable ways we might like but it's a promising kind of new technique. So so far these haven't filled the gaps but are interesting. Similarly there's I don't know if you call it technique but at least an analysis or a new approach being put forward by a researcher from Microsoft especially called differential privacy and here the idea is it's interesting in that it's actually intended to be future proof so you're intended to be able intended to be able to use mathematical formula to be able to say based on the results that we're publishing and making available never in it's mathematically provable that no matter what happens in the future a person won't be able to be re-identified and the way you essentially do it with this is again dealing usually with kind of aggregate results is by running statistics on the result to kind of figure out whether any individual's participation in the study or not if that will affect their privacy in any way up to a very small threshold and essentially what they do is if it doesn't meet that threshold they add a bit of noise into the equation until they get the result they want and so it's similarly has had some real world uses but is limited because noise injection is going to degrade data one of the problems is there's kind of inherent tension also between anonymization and the kind of genomics research people are doing where they're actually looking more often and more interested in the outliers and what you want to do in anonymization is explicitly get rid of the outliers in order to so there's kind of an inherent tension there that makes things hard to scale to the real world so now I'll move on to what's been one of the more commonly used in genomics alternatives since this kind of crisis of confidence in anonymization which is controlled access kind of regimes this is what DB Gap and the US moved to after the Homer paper was published and so we used to have kind of fully open access data and now we've got kind of two different tiers we've got open access and controlled access data this is an example of the ICGC data sets that they keep in each and my sense is that the distinction is based more on maybe Francis can fill in the gap but my understanding is the distinction is that controlled access is considered to be more identifiable and open access not so does it have, does it also take into account the sensitivity or is that not kind of is it more just identifiability, identifiability and so, well I mean generally the way the kind of laws and policies work is they have big categories of sensitivity but usually it would do things like include all health data as being sensitive and so maybe you don't, maybe it's not broken down in that way but you could imagine some health data being less sensitive than others, right? So for example, like geographical location was concerning so they had to be kept in the TNR so they had to be seen to work with them but then in the flip side of that it's you know if you want to find out something yeah it's okay so you know hiding might come out of the way to cancel it yeah and of course if you have to name it yeah with respect to, I mean the major one of course is things that could identify the individual so genomic sequences, so 92 that's too, 92 the new A lot of this stuff you're talking about is similar to, so the law I talked about in the US HIPAA has a specific rule about how you de-identify data and it says you know you can only use the first three digits of the person's zip code because all five or two narrow people who are over 80 years I think similar to what you're saying and so it's interesting because there's been some there was one famous case about 15 years ago where there was a researcher, I forget her name right now that people refer to her as the queen of re-identification but she found that based on someone's birth date, sex, and the zip code? Yeah I think she'd identify like 86% of the US that would uniquely identify 86% of the US population so like almost everyone which at the time was seen as surprising because it was seen as you know this data how are you going to put that together to Yeah I mean there's that point I think that's the only thing you know somebody actually challenges the premise and actually goes to the same thing with the zip code Yeah things have really changed and so although yeah I think she's maybe sarcastic but I actually, I mean Well except they wrote a paper about it so I mean they declared them something hackers are thinking more conservatively Yeah people sometimes distinguish between black hat hackers and white hat hackers right where the black hat ones are trying to get your information to do devious things and the white hat ones are trying to expose flaws in order that they can be fixed and so I think they would see themselves in the latter category I find them to be helpful too I mean it's so long you know if they're not trying to compromise things they're actually trying to shrink them to security and privacy systems I'm on the other side of the resource it's by all the controlled access data which basically it's hardly no eyeballs on that data It's very easy to consider the amount in millions of dollars they've spent a billion days before the actual access data I think it's like the bars was too high I think it's set too high so it's too conservative in my mind so it's protecting everybody but it's protecting it so that nobody's looking at it so I mean I think we're hurting the mission of the ICBC by not allowing making the bar lower Yeah, it's tough with the, because people are always looking for the win-win, but at some point it becomes a bit of a tension between the privacy research and it's where you're going to set the bar. Is it too far one way? Is it too far the other? Yeah. Yeah. Yeah. So I'll go through a bit just, oh was there a hand there? Sorry. So the thing though that's every second person is going to be staged in our life. Yes. Yeah. Is it treatable or even preventable in some sort? Yeah. So maybe in this case, this information is going to be more likely to be shared. Yeah. And I think in part of it, the bar should be, you know, I have a scientist, you know. I have an orchid ID, so I'm a scientist with an identifiable, you know, identifier. That maybe is all I need to be able to access the data. I don't have to go through this whole complicated sort of process that prevents me from doing it. Yeah, I see. I totally respect and work with, you know, all the rules and regulations that are imposed on me and so forth. But at the same time, I, in a way as a patient advocate, you know, I could argue for we need to lower the bar to have more scientists in the data. Because if you have low eyeballs in the data, maybe it's going to make a discovery. Yeah. There's a real tension. So I was going to go through next some of the ICGC data. No, no. But it's dealing, I'm not trying to brush off your comment. Like I'm wanting to delve into it more deeply, actually. Because, so there is all this data that you can, it's indexed in the ICGC data portal. You can look at it as of now. So the, the collaboratory is set up to work with this ICGC data, including the controlled access portion. And I'm going to show a bit of kind of if you want to access this data for a future project, the process you go through, which is essentially by heading to the data access compliance office. And you essentially set up an account and ask for the data. And kind of like Francis was just saying, there's certain requirements you have to fulfill. One of the big ones that he was mentioning was, so I think I've got a few slides here just on kind of going through the process, you know, log in, create your account. And then you get to the kind of next step is you get to the application. The application form has a bunch of sections. And including, so I think B, so first you have to give your name. B is name of the authorized institutional representative, including affiliation and contact details. And I think this is what we're getting at where at this point you have to have an institution that's willing to kind of sign off on your access and is being, is going to take responsibility for the proper handling of the data. The idea is that someone is keeping you accountable, I guess. Yeah. The idea is that the views grew up. Which, by the way, nobody has ever done the access to data. Then they can go to the super fund. Let's say, here at the university would be the, let's say, the meeting research. So the office that signs off on the grant, so they would sign off. So they are saying, yes, first of all, this person is, and he or she will. You were saying a minute ago, one of the next steps might be, and people are looking at his camp. So I had kind of those two databases, the previous one, and people are looking at his camp. So I had kind of those two databases of controlled access, open access. People are looking at maybe there should be an intermediary zone somewhere between, or either data that's a bit less sensitive or it's less identifiable, where you would still have to go through a process, but it might be slightly less longer. Especially, I think it's focusing especially on this kind of second you have to have institutions willing to sign off on you. And I think they're thinking of cases, for example, where, I mean, these are more rare, but there are cases where, for example, someone will have someone in their family with a particularly rare disease. It'll touch them very deeply, and they'll start looking into it more and more, and they'll almost become, I don't know if people have any of citizen scientists, but they'll almost become experts in that field of this one rare disease. And so there's an argument to be made that that person has a good interest in having access to data about people who are afflicted by that disease and should have access to the data they don't have. They're not linked to an institution, so they have no way of having someone sign off on them. The difficulty becomes a bit is how do you maintain accountability, but also give as much access as possible. So it's something that's still kind of being very much worked out in the policy and regulatory fields, as far as where this bar should be set, essentially. But for now, at least for the ICGC data, this is the process we've got, is the ICGC controlled access form. So not only the institutional affiliation, but you have to give information about your project. It looks a bit like institutional review, kind of ethics review, if you've done that before. It's a similar kind of process. There's a data access agreement that you have to sign off on with a whole bunch of conditions, including, you won't try to re-identify any of the individuals. Because in both of the open access and controlled access databases, we still don't have any direct data into buying information. There's no one's name in there. But part of the idea is that the controlled access data is more identifiable, and so you won't try to do that. You won't inappropriately disclose it. And you'll follow kind of all the other policies that are listed in the application. And I think I've got them kind of listed here. There's a bunch of... I kind of tried to group them under categories. You're asked to... They're mostly pretty short, I think, two or three pages each, some run longer. But you're asked when you're signing up for Daco approval to read and abide by the policies, including ICGC's guidelines. Oh, they're not coming up. All right, here we go. So I tried to group them a bit. So there's a bunch of different kind of smaller obligations that come into play. So the ICGC guidelines 2008 is kind of a broad policy document, although it does include some more specific updates. For example, there's a publications guideline that has some information about if you want to publish in an academic journal or elsewhere the results of your research. There's certain obligations, for example, to credit ICGC or the people who's the project who collected the data. But there's also obligations around... Before, we were talking about kind of the obligation to rapidly release data. In some cases now it's... for quite a while, some people expect within 24 hours of generating data it should be released into open access databases. And one of the trade-offs for that was people worried that... there was a worry that people wouldn't be willing to do that if someone else was going to start analyzing it and scoop them and get a publication out first on the same data. And so one of the techniques to avoid that is there's an established practice of setting up embargo periods where you upload data for everyone to use, but no one else is allowed to publish on it until you have first had a chance to or some time limit expires. We got a hand up there? Yeah, it's actually a bit more... So you're allowed to publish it... So the genomes are available. What your ICTC is trying to protect against is people doing like analysis of 100 genomes right now. They want... so there's an embargo period, but if there is 100... so if you publish, let's say, you release 100, you know... you have... then you have to... then people... if you haven't done anything, and it's like... I think it's two years, actually two years. If you haven't done anything in the two years since you've released a genome, then somebody is able to... can and is encouraged to do the whole genome analysis. Yeah. And so... but if the first genome came out, you didn't reach 100 for a year, but then you haven't... and then when you do reach 100, then you have one other year. It's at the end... Yeah. It's two years from the first genome. And so it is relatively complicated when the genome show up and so forth. So we have a website now that has all the dates saying that this is non-embarko, so you can do whatever you want with it. Or these are embargoes and tell such and such a thing. Yeah. This is one of the things that... the US has been moving a little bit away from embargoes. TCGA has done as well. And that was the reason that it's because they said scientists are having a hard time figuring out whether the data they're analyzing is under embargo or not. So TCGA did first this page which has the dates and when the embargoes came out. Yeah. And we finally did it. And TCGA is part of ICGC, but it's US, and so it's separate rules, separate repositories, separate data access. So we always talk about the ICGC and talk about the TCGA and ICGC and the ICGC really means that non-TCGA is part of ICGC business. TCGA is part of ICGC, but it's separate. Yeah. So there's kind of two separate data sets. Yeah. The access on this test or on cancer in the US. Yeah. So yeah. If you look at the publication guidelines, I mean, a lot of these I'm kind of glossing over quickly for purpose of time, but I want to be touch on kind of the main points. So if you're interested in publication, you might want to check out the publication policy and the the duties are essentially the same. My sister said there's a bunch of privacy and security policies. Each are fairly, I mean, so the Global Alliance for Genomics and Health, ICGC is a fairly high level are kind of principles oriented approach the other two are shorter, although the I'll go into more detail actually a bit later about the security or security best practices for controlled access data. There's also, similarly to what we were talking about before, there are data sharing obligations of people who sign on based on principles that come out of the genomics research community kind of rapid release of data. And then finally, there's intellectual property policies and essentially, I mean, again, you'll have to read them to get the full idea, but essentially the idea is that people who make use of ICGC data aren't allowed to apply for intellectual property that would apply directly to the ICGC data that they got. And they're also not allowed to apply for intellectual property rights that would block people's ability to access or use that data. And then beyond that, I believe there's a those kind of obligations are actually the ICGC guidelines I've listed at the top. The bottom two principles are kind of extending a bit beyond that and they take a bit, kind of a pragmatic approach to IP and say that so the NIH best practice is essentially if I want to kind of boil them down, they say that if you come up with a genomic discovery that is basically at or near where it needs to be in order to be kind of turned into a marketable commercial product, in that case you shouldn't apply for IP, we're asking you not to. If it's going to require quite a bit more R&D work before it can get to market, then it's permissible to try to apply for patents and IP on it. The idea is to make the data be as likely to produce results that help people as possible, so if it's going to need private sector investment, at least this is my understanding of the impetus behind it, if it needs private sector investment then we'll kind of allow more IP rights, in order to get to market, then we'll encourage you not to. There's two hands here made another way to manage a sequence. If it's related to the sequence itself, the DNA sequence, there's also that principle to name, so you can not put IP on the sequence. If you have a biomarker or a sequence itself, we don't want to go into the... Yeah, there's a big Supreme Court case in the US about that. So obviously you want to encourage drug development in the cancer world. So obviously developing a drug is very far away from the sequence and so the idea requires a lot of intellectual knowledge. So that's very much possible. But the sequence itself is not. And so now it's getting closer to saying that raw data so raw data now just has this discussion in the process. It's clinical data so raw data so that becomes a little speed in that data. My question is there starting to be some challenges in the courts in Canada too, so we'll see where they come down on it. Unlike the US, the Supreme Court kind of weighed in pretty decisively. What was their hand here doing? No, what I'm saying, was going to mention the market one and two. Oh, okay. I think a lot of companies haven't had the CDNA so that's not a problem. So they're getting around to packing that CDNA of the virus. That's been tried by other companies. I mean, they've tried it. Pat. So getting back to the DACO stuff this was I think a suggested reading for beforehand but this is an article that was published and that was maybe referenced on kind of the experience of the DACO over its first five years and here they were excited to say that they were able to process people. When people submit the application that we saw online they were able to process them within a week and get people access to the data for ICGC data at least 24 hours after that. So this is now saying that it's taking a bit longer given the new context because of the cloud context. So they've been reporting. Yeah, this paper was what date? I don't see a date on it. You're okay. But they've had a steady increase. They're cheating a bit here because you can see it's cumulative applications process but you can see there's a steady increase in new projects access. So with that maybe I mean I've kind of I've given kind of where to look for a lot of I've given the big principles and where to look for more on the privacy side we can talk about it more in discussion. I want to move a bit on to the security side of things which is distinct but somewhat similar and so I think it's important to keep privacy and security in mind as kind of two separate things. Security breaches can result in privacy breaches but they're not quite the same and so they are characterized by trade-offs similar to kind of privacy if you remember that you can't have perfect anonymization and perfectly usable data. Again you can't really have perfect data security unless your computer is shut off and not connected to the internet at any times or similar to you can't have zero highway deaths and still have people driving in cars and not to say that we should take this lightly I think it's important to follow best practices there's as mentioned before the consequences can be serious but if you're looking for something that's going to absolutely make sure you have no problems it's unlikely that you'll find it. So I'm going to focus a bit on best practice what we're talking about for the next couple sections too. Talking especially about SSH encryption which you can connect, make an encrypted connection to your VM through a password protected SSH but there are some shortcomings including people using weak passwords, people losing or forgetting their passwords, various other problems so we're going to be talking about connecting to a virtual machine using SSH key pair authentication which is a common alternative you can also use both at the same time you can have a password and have the key exchange happening if everyone is very familiar with SSH already perhaps I can breeze through this section but I'm not sure but people often also use if you're not required to use a two factorish kind of authentication you can also use the key pair authentication on its own so just to explain kind of the general idea of kind of the key pair exchange the idea is if you have an insecure channel represented by this orange line between two computers the general idea is you want some way to communicate securely without anyone else who's listening be able to listen in so for our purposes it's essentially going to be probably your laptop on one side, on the other side it's going to be the cancer genome collaboratory that you're wanting to connect to your cloud genomic research facility in between we've got the internet the way it works is you have your machine I think we're going to be doing this in the next workshop George is going to be presenting how to do it if I understand correctly you get your laptop to generate a pair of encryption keys first of all what's called a private key from that is derived a public key and you can think of maybe it's bad to think of them this way but you can think of them a bit like a username and a password in a way the public key is kind of like your username it allows you to be identified and the private key is kind of like your password you don't want anyone to have access to it although the analogy kind of breaks down because there's a special cryptographic relationship between the two keys where the public key is derived from the private key and so it doesn't always hold but it's one way to think of it so the first thing you need to do to kind of set up a key pair authentication is to send a copy of your public key over to the cloud and if you'll notice there's kind of already a big problem here where the whole point was we had an insecure channel and we've just sent a public key across this insecure channel and so someone could easily have the idea is you've got your ISP and a whole bunch of other intermediaries here someone malicious person could have swapped it out with something else to try to kind of do an attack so in practice one way to get around it is you could copy your public key over a USB key or some physical way but in practice there's ways I think we'll be doing it in the next workshop to securely copy your key even over the insecure theoretically insecure network once that's set up essentially through the magic of encryption there's a key exchange process where the virtual machine you're connecting to without disclosing what your private key is can verify your authentically who you say you are and you can set up basically an encryption tunnel where all the communication passing between you and the virtual machine are encrypted so people along the network can see that you're communicating but they can't see what you're communicating about and there's no worries about anyone either listening in or being able to kind of inject other malicious communications between you the nice thing about it is initially SSH the shell part of secure shell just refers to the ability to execute commands remotely but through SSH tunneling you can also do things like transfer files or basically run any arbitrary service remotely including remote desktop so you can essentially be set up to act as though you're running a control in the cloud from your or a virtual machine in the cloud from your laptop the general idea as far as security obligations is to kind of minimize the number of ways things can go wrong in every way you can think of and so you're going to be running the SSH listening service on a specific port you want to kind of keep as much as possible every other port closed one idea that people often do is to try to have the SSH server running on a random port number rather than the kind of default one so if someone's trying to find a way to connect they won't be able to find this is kind of a security through obscurity approach so it's maybe definitely best to combine with more reliable techniques but it's one thing that people sometimes add on your firewall should be blocking as many of the remaining ports as practical you shouldn't be having unnecessary services running and it's also obviously crucial to limit any physical or other access to your private key a bunch of kind of basic techniques such as shutting down your virtual machine when not in use consulting for the resources including the security guide which I'll go over in a bit more detail shortly it's a good idea to prohibit password only SSH connections to the virtual machine so that no one can try to guess if you have an easy or no password and so that only kind of keep pair authentication can be used and in general there's some other basic ideas like not to ignore any SSH authentication warnings that you see if you don't understand one you should ask someone who does before you agree to override it one common one that might come up it's something that looks like this if you remember when I had the diagram of the SSH communication we had transferred our public key to the cloud which allows it to identify us but we never got the cloud's public key to us making sure we can identify it so this is essentially a warning message that will come up if their cryptic fingerprint seems to have changed maybe they're not who they say they are which can happen for a variety of legitimate or unlegitimate reasons so getting close to the end now before we could have a discussion if people want I'll talk a bit about kind of the what the ICGC security best practices call for again you'll have to read the whole thing to get the full idea but the basic idea is it's based out of an idea that you should be running a kind of independent assessment when you begin your project depending on the tailor to your specific project to figure out what security measures should be in place that will differ based on the size and the nature of your project but the agreement does list a number of specific considerations to take into account when doing so which are largely based on kind of minimization of vulnerability potential vulnerabilities so including things like policies for when data will be destroyed which should happen when it's no longer needed abilities to log who's entering the system to be able to audit it a variety of kind of physical safety measures so if you're going to avoid physically transferring data as much as possible if you have to do so data should be encrypted at rest the expression they use is that you should treat physical data like it's cash and so transport it in the ways that you would and using the security measures that you would use as though you were transporting cash in general as far as network security end-to-end encryption kind of the SSH variety is always preferred people who are going to be using the system should be trained and as far as cloud specific recommendations they're specifically about the relationship with the cloud service provider often to make sure that the cloud service provider is someone who's reliable that the agreement is going to allow you to fulfill all your obligations that there's going to be a clear division of responsibility between the cloud service provider and your project and so on and so I think I've covered most of the things to get the full details you can look at the security best practices document it's I think it's only five or six pages it's not it's not too painful it's also sometimes helpful to consult other other outside resources that can have specific I've pulled up one here from a blog I mean it's good to make sure you're looking at a kind of reputable reputable advice rather than non reputable but some articles had specific ideas for hardening SSH security and specifically cloud virtual machines context I've already talked a bit about reviews and auditing so to one thing is to avoid considering there's a tendency to want to consider risks and set up security procedures only when establishing a new system rather than prospectively but periodic review is important and when possible or when called for by the nature of the project third-party auditing and not only because your project evolves so for example you've got different people with access to your data who've all sent their public keys to your virtual machine you might want to review if there are people's keys who should still be in there or they should be removed but aside from your project evolving there's also the fact that data security best practices are constantly evolving there's constantly new attacks and new vulnerabilities coming out not to be overly alarmist as maybe this following article headline is new cloud attack takes full control of virtual machines with little effort this just came out I think a month ago but the idea here was I think it only works in very specific context but it was a kind of a lot of these techniques are unforeseeable the idea here was that you could actually by writing very aggressively to one space I believe in RAM you could actually flip some bits in another part of the not the user space but the system space and take control it's a bit isolated but it is good to be doing ongoing review as security vulnerabilities are constantly changing and being updated and so with that I think I'm ending a fair bit early but if there's any discussion or further questions that people want to raise I'm happy to