 Thanks. Hi again, everyone. So I introduced myself for a brief earlier on. My name is Mark Phillips. I work at the Center of Genomics and Policy at McGill University in Montreal. As I mentioned, I have a background in computer science, going a ways back, worked as a computer programmer for a while, then went back to school, went to law school, became a lawyer, and I look a lot at issues of privacy, data protection, some other ethical issues. You may have noticed already that I have not mentioned life sciences anywhere in any of those qualifications. I have no formal training in life sciences, but I picked up quite a bit up as I've gone along. So I'm going to try to stay somewhat in my lane. I think it seems that this field is increasingly interdisciplinary with the computer science, especially with the talking book ethics stuff. So I'll be talking partly about some of the legal ethical issues involved, partly about a bit of background into cloud computing, et cetera, using virtual machines, starting to move into the more practical aspect of this whole workshop. But yeah, that will be the focus. I'm not going to try to lay out specific rules of exactly what you need to do, because obviously this field is developing and emerging really rapidly. There's a lot of debate and to some degree confusion about what's happening. And so my goal is going to be to try to raise the different ethical issues, help you to be able to notice when there's one kind of in play, be able to discuss more with others when things come up. So that's the idea. I've listed a few objectives, kind of what I've already probably discussed. So understand the importance of the cloud for genomic research, identify the illegal issues, and then we'll talk a bit more specifically about SSH that's used to interact with virtual machines, which will be kind of the first step of some of the work that you'll be doing this week. So I've kind of broken it up into four sections. This first section is kind of the cloud portion. So in the field, we're noticing kind of a huge move into the cloud. This is the first thing that comes up as a report from the European Commission. They're planning on building what they're calling the open science cloud that's going to kind of link existing infrastructures in Europe to allow for medical research in Chicago. There's an initiative you might be aware of called the Genomic Data Commons. That's more genomic science specific and a similar project that's the one that we're going to be primarily working with this week out of Canada, but it's an international project called the Cancer Genome Collaboratory. These projects have really been pushed by the increasing amounts of genomic data becoming increasingly difficult for researchers to analyze, download, share, et cetera, work, let alone on their own laptop, even on university, high-performance computing centers. And there's other reasons as well that delve into some of the legal and ethical reasons why researchers also would like to have control of their own infrastructures and be able to establish the rules and procedures of how they work. So first question, what are we talking about when we're talking about the cloud? Just to kind of lay out the baseline, the kind of most common definition you'll see cited is from the US National Institute of Standards and Technology. What they say is that cloud computing is a model for enabling ubiquitous, convenient on-demand network access to a shared pool of configurable computing resources. And there are some examples, network servers, storage applications and services that can be rapidly provisioned and released with minimal management effort or service prior interaction. Which is, I mean, so you see a lot of words like ubiquitous, ubiquity, convenience, rapidity, and minimality of management, which in my mind are not very concreting there. They're open to interpretation. So to me, this whole definition, I mean, it does say quite a bit, but at first it can be a bit confusing what exactly does this mean, what counts, what doesn't. So another way people tend to think about it is, and this kind of breaks down some of the elements that you saw in that previous definition, is breaking down what cloud computing means into a few of its essential characteristics. So when we talk about on-demand self-service, we're thinking about not having to interact with a person to set up a new account, a new system. It's convenient in that way, you can kind of do it yourself. Broad network access, meaning you can access it from a variety of places, not necessarily the entire internet, but often it is you can access the cloud from your phone, from your laptop, et cetera. But it could be within a larger scale, wide area, more private network, et cetera. When we're talking about resource pooling, what's meant there is it's often kind of similar to the idea that these are multi-client services where you might be running a virtual machine that seems to be all your own in the cloud and is all your own, but it might be running actually on the same physical system as someone else's software. These services are shared or serving a bunch of different people. Usually that's not a problem. In rare cases, there can be difficulty related to that. Rapid elasticity is the next one. So you can easily scale up and scale down what you're doing in terms of both the storage you're using for your projects, but also the amount of computing power, the amount of memory, et cetera, which is different from a conventional kind of setup. If you're running, say, an academic computing center, when you're not using it, it's kind of sitting there idle, you've paid for it, you're having to maintain it anyways. Whereas in the cloud, there's this idea, at least, that you pay only for what you use, which is related a bit then to the last characteristic I've got here, which is measured service, the idea that what people are doing is being measured, not for billing purposes, obviously, but not only for billing purposes, also to try to figure out how to optimize the use for quality of service, et cetera, et cetera. So these are just kind of some of the basic notions that people talk about whenever cloud computing gets brought up that are good to familiarize or re-familiarize yourself with so that when they're discussed, it's clear. So people generally talk about three different service models that can be provided by cloud providers. They're kind of on a continuum. So infrastructure as a service is any kind of cloud provision that's giving you pretty much raw resources, either close to raw storage space or raw compute power or raw memory or some combination thereof. The opposite end of the spectrum is the bottom one software as a service, which would be things you're used to maybe viewing through a browser. If you're viewing your web mail through a browser, you don't have, it is still a cloud service. You don't know exactly where your data's being stored, how it's being sent to you, but you have to control only through a very limited portal. The platform as a service model is somewhere in between where there's some tools or pipelines that you have access to, but it's still a kind of a lower level. So what we're gonna be working with this week is pretty much the first two layers. So infrastructure as a service kind of in the sense that we can fire up virtual machines, that we can do pretty much whatever we want with, but we also have some, we can draw on an existing library of analytic pipelines and tools that are platform-like. You'll also hear people talk about the different deployment models as far as the cloud that exists. So the first two are public and private cloud and to some degree those might be, or at least I find them counterintuitive as far as what you might think. Public, when we often think of public services, you think of, or at least I think of something being provided by the state. But this is kind of really the opposite. Public cloud, we're usually talking about things like Amazon that are made available to anyone, whereas private clouds are more made available to, there are clouds that have been built for specific purposes, specific purpose. Another deployment model is the community cloud, which might fit in, Francis? Is there often academic work? Yeah. Yeah, or you could think of, yeah, exactly, that makes sense. Was there another question here, or you know? Yeah, so I was thinking that the Collaboratory Project would fit more in the community cloud where it's groups of people that have gotten together, but would you say that's more private than what we're working with here? It's an academic, it's an orthogonal. I think the Collaboratory is public in the sense that it's publicly accessible. You'll see that the cloud is not hidden behind a far longer area. So it's public that way, so it can be seen, but it is not a commercial, it's slightly commercial. It's difficult to have yet to make, but it's run by an academic group as well. Yeah. So I feel like a lot of these things kind of, yeah, defy categorization, but I find it's still helpful to, because the terms are thrown around quite a lot, I feel it's helpful to orient to them. So I'll kind of quickly go through this, but just again to remember why we might, kind of a different context, why we might want to be doing this, why, what's pushing the drive to the cloud. So traditionally, email is kind of, I think what I and a lot of people first became oriented to the cloud through, and the whole idea is you can connect to a machine from anywhere, it got rid of the previous paradigm where you had to fill in a bunch of information about who is your, what port are you connecting to? What's your SMTP service? If you send a file, it's too big, it's gonna get an attachment, it's too big, it's gonna get sent back. This all became kind of outdated and people using these services that the kind of, very short, quick list of the benefits so that it's easy, cheap, and powerful. And so you can see in a context like the one we're in with the International Cancer Genome Consortium where we have groups from all around the world who are sequencing data, analyzing data, sharing data, very quickly that these data sets become very large, it becomes very difficult to do, as I mentioned, on your own machine, let alone an academic computing center. And so this is one of the reasons that people are turning to the cloud. And similarly, it's not just the size of data, but similarly, if you've got, this is a slide of board from Collegate OCR. If you've got kind of these sets of data in different parts of the world that are quite large sets of genomic data, the goal is to kind of do something similar here. To be able to have a kind of harmonized portal, this is through kind of ICGCRGO project, including also clinical data, to be able to, in the same way, from wherever you are in your laptop, connect to it without having to spend hours, weeks, or months downloading all this data to a local machine over very slow links. And then also to have the analytic tools and pipelines you need that are already kind of, as many of which are already pre-built for you, you can combine them as you like. Here we've got getting more into the ethical legal side of things, data access, compliance approvals, because often to have access to different data sets, you need to go through some kind of process to get approval as being a kind of bonafide researcher, et cetera, et cetera. So with your authorizations, then the idea is to make it simple to run a specific analytic pipeline across all the donors in the different clouds of a specific kind and to be able to do this conveniently from wherever you are. So this is kind of where things are petted, where we're trying to get to. This is another slide that's kind of trying to show the benefits of the cloud over academic computing centers. According, this came out in an article a few years ago, it's trying to make the case that the cloud is needed now. There's still, I think, a fair amount of, there's still a lot of people working in academic data centers, obviously. In practice, I don't know if people have had experience, but my sense is that things are still a bit more even, but it's quite possible, the trend seems to be towards the cloud. So the collaborative kind of infrastructure that we're working with this week is built based on OpenStack, which is an open source kind of cloud infrastructure that allows us to do things like fire up virtual machines. The reason we use virtual machines is it's basically an idea if people, whether people are familiar with it already or not, I'm not sure, but you can fire up, say, an instance of a certain operating system running, say, Linux Ubuntu, you can tell it, I want to run on so and so many processors to have so many megabytes or terabytes or gigabytes of storage space. It can be created basically instantly and remotely from wherever you are without having to have access to the actual power near you. So a lot of what we'll be doing is being able to fire up. There's kind of a web interface of the OpenStack stuff that's been customized a bit by the collaborative people. You'll be able to fire up a virtual machine in the cloud and then through basically a kind of command line or other interface we'll be connecting to that securely so that no one else can kind of listen in, no one can intrude on what we're doing. So I'll move into kind of the second aspect, which starts to raise some of the privacy ethical legal issues. In the broader context, what we've been seeing in recent years, especially in healthcare is, but if you've been following probably the news at all, I mean information security is probably something that you're aware of. We've seen a kind of wave after wave of seem to be tectonic things, everything from if people are familiar with ransomware that's been growing over the last few years. The idea is these malicious software that will be implanted into your computer, encrypt your entire machines. You can't read anything and then we'll demand a ransom out of you usually in Bitcoin and you won't be able to get any of your data back until you pay the ransom. And there's been large institutions that have fallen prey to this. But not only this, we've seen kind of with wireless networking, this was a huge WPA to vulnerability disclosed six months ago, there were specter kind of wave after wave of large breaches, large vulnerabilities disclosed. Healthcare has seen a huge amount of breaches, obviously something that's getting closer. On the research side, there have been breaches problems, but they've been to a much smaller degree. There's been very, very, far fewer high profile breaches until now. But I mean my suspicion is as now they're starting to be more and more integration between research and clinical data as we're seeing and talking about as people need to rely on clinical data in order to figure out what the significance of different variances, there's at least the risk that things could go more in this direction. And so this is the slide. I don't wanna feel like we're in a high school class and I'm a police officer talking to you about the dangers of drugs, et cetera. But so in the U.S. there's a law called HIPAA. I'm not sure if people are familiar with it or have heard of it before. It's the Health Insurance Portability and Accountability Act. So different countries in the world have really different approaches to privacy law. This is one of the big important ones in the U.S. and it's one of the things about it is that it's health specific. And so I should back up, it's not actually a privacy law as you can tell by the name it talks about health insurance and portability rather than privacy. But a few years after it came into effect there was an aspect of it that was added in called the privacy rule that has, it's a pretty comprehensive framework on dealing with the privacy of health information. Explicitly it's supposed to deal with clinical data but when that clinical data gets shared with researchers that data is still subject to this law called HIPAA. It's enforced by the Office for Civil Rights here in the U.S. There's been big, coming out of some of the breaches there's been a lot of big kind of claims that have come out and so there was recently a settlement for $5.5 million that this organization called Advocate Health System in Illinois had to pay in 2016. It was related to earlier data breaches that had compromised four million people's records. So you can see that the more people's records were storing, not only is the bigger kind of risk of the number of people that could be harmed through a breach but also there becomes more incentive for malicious people who are looking for a side to say this is a large and comprehensive, enriched database that's gonna give me much more incentive to try to get in. We'll be talking more about HIPAA and the specifics of it a bit later on but so this is just a kind of introduction. In the European Union, sorry, I don't know if people have ever heard of the General Data Protection Regulation. It's a new, I see a bit of nodding more than with HIPAA. It's gonna come into fully into force this May 25th, so in a few months and it's quite different than HIPAA. So it deals with all kind of privacy and personal data in Europe. So it's not health specific in any way. It's not specific to public sector, private sector or health sector but it's highly influential. Europe is seen as kind of being kind of the driving force behind a lot of privacy and data protection in the world and so for example, for Canadians here, I'm not sure if you would have heard of but one of our, probably the most famous privacy law in Canada is called PAPIDA. It's a federal private sector law. If you've heard of privacy law, it might be that one and that one was adopted actually in the late 90s. Basically, and people don't like to admit it I think but it was basically in response to the precursor to the European regulation to say, oh, this new thing is coming into force in Europe. They're only gonna allow international transfers of data with countries that have significant and strong data protection so we're gonna enact this. So this new regulation that's coming into force is gonna up the potential penalties a lot too and again, this is not health specific but so the fines they're gonna have are up to $20 million or 4% of a company's annual light turnover which is at first greater which is a steep penalty. It's seen as being kind of larger than anything we've seen before and there are cases in which EU data protection laws carry penalties of imprisonment. There was one case where I forget which it was one of the big tech companies, it wasn't Google, I don't wanna get their own company but some Italian officers were, they didn't end up serving, they ended up being either an appeal or so but there was at one point a verdict saying these people were, should be penalized by imprisonment for a data protection violation. Who's in Facebook? I don't think it was, I should look up, I don't think it was Facebook either. There's been, we'll talk about it a bit more but there's been some litigation around this Facebook in this in Europe as well. So just to, this is kind of another headline that's saying that the subtext is insiders are mostly to blame. Just to say this isn't, what we're talking about when we talk about breaches isn't only necessarily a hacker sitting in some shadowy area. Sometimes breaches do also, when we're counting the full number, they do include things like either an employee accidentally or maliciously does something wrong with data. You see it sometimes even things and say, I feel like coming out of California often in hospitals, it'll be like someone will really wanna look at some famous person's health records and we'll go look it up and they get in trouble. So it's partly things like this that are not necessarily very high tech but at the same time, if you're designing a data protection or privacy system well, you shouldn't probably grant access to everyone in your health system, access to anyone's records. So there are ways to think of designing these things to kind of minimize the impact of whatever arms can take place. Coming specifically to cloud stuff, there's kind of, this is not that recent anymore, I think it's a year or two ago, but there are new attacks kind of coming out every day. This one was a pretty wild one that came up and it was very specific to the kind of things we're talking about. This was an attack that was called, they were referring to it as a row hammer attack. And it's so unforeseeable the way these things work. This was an attack that basically if you were, so if you were running on one of these shared cloud environments and it turned out if you were to flip the bits on and off in memory really quickly between ones and zeros, you could have an effect over on some other client's memory. And there were attacks, there were ways to kind of get into their stuff. People were very surprised when it came out. But so it's kind of just to show that as things are developing so quickly, I think it's good to remember the general principles to do your best to keep things secure and to perspective the update. But it's hard to foresee everything that's happening and to remember this. So back to kind of this slide that was initially about email. What you see kind of that I was talking about before is the ease, the power and the convenience. But there's also stuff happening that's not necessarily seen if you're just seeing the convenience. So when you have emails, for example, in the cloud, those aren't just sitting there. They're actually being looked at and analyzed not necessarily just by you. We talk about often, if you're receiving a product for free, it might actually mean that you're not receiving it for free. It might mean that you are the product and you're actually being sold, in this case, to advertisers. So your emails are being come through to target you with advertising, which some people may be comfortable with, other people's might not be. Of course, there's other stuff going on here too. You're, as we learned in 2013, there are state agencies who are quite interested in all the stuff that's happening in the cloud. And not only this group of these group of state agencies, but there also is this consortium of, it's called Five Eyes that includes Canadian kind of intelligence authorities, as well as those in the UK, Australia and New Zealand. What's going on also is this journey is happening across borders, which adds difficulties. So say something happens with your data you're not comfortable with, and say you're in Canada and it happens in the United States. How do you, can you take legal action? Do you have an illegal right to do so? Against whom and how is it practical? So this is kind of just to show some of the things. I mean, there's additional issues that I've raised here too, like this often subcontracting happening. If there's a company that can't, a cloud company that can't do everything it needs to do on its own, it might pay someone else too. If those people mess something up, how are they responsible? Again, we talked about hackers coming in. So there's a lot more happening here, kind of in the cloud than you might first see. These aren't exactly the same model that we're seeing exactly. And it's kind of what we're trying to get away from I think, with resources like the Collaboratory. But I think it's important to keep this well backdrop in mind. So as well as the kind of ease, ease the low cost and the power, some of the fears people have are opaqueness, dependence and loss of control in general. And so when you have, for example, when you're using your own laptop, generally there's, I mean there's some now increasingly laws that are saying you can't do certain things with your own laptop, but in general you can do what you like with it. Whereas if something's out there in the cloud, there's fears that suddenly it's out of your hands. Sometimes those are warranted, sometimes those are not. It can also be difficult if you want to leave the provider you're on, if they're not providing sufficient portability to take what you're doing elsewhere. And you honestly know what's happening behind the scenes. So this is just a citation to an article on this, more by someone at McGill, a professor. So I talked about this a bit already, but one of the fears is that we're basically usually made subject to standard form contracts, unless you're a really big institution or consortium, if you're buying cloud services from Amazon, you're not gonna individually negotiate your terms of service, they're gonna kind of impose it on you. And some of those are, some of the, they offer certain privacy protections, but there's also some standard terms that are not advantageous to the consumer, such as often the cloud service provider. It's fairly standard that they'll say, we are allowed to change the terms of this agreement without your consent. Sometimes they'll notify you of the changes that are happening, but it's good to be aware of these kinds of things. Stability depends, can be a concern. There have been some kind of mid-sized cloud providers that have kind of quickly shut their doors overnight, leaving people in a bit of a lurch to say, well, I have all this data in there, I don't actually have the capacity to store it myself. That's why I put it in the cloud and how am I supposed to get it out of there in time if you're closing next month? Not as, probably not as much of a concern if you're going with one of the big cloud providers. But again, similar problems can come up if suddenly the terms of service change and you don't wanna be there anymore. Something to be well aware of. Similarly, the contracts will set out who is liable in case of a data breach or something goes wrong. And it's good to be aware of who will be responsible for what, who will take responsibility for what. I mentioned before third-party surveillance, a lot of times the laws, there's not a lot of exceptions to them. It's not as though you can kind of make a contract saying there will be no third-party surveillance of this information. But there are some mechanisms that can at least try to minimize that. So for example, there's actually a big case before the US Supreme Court now between Microsoft and the US government. I think it's the FBI. The dispute is Microsoft has a subsidiary in Ireland that's holding a certain amount of data. And this is a penis saying that Microsoft in the US should be able to retrieve that data from Ireland and give it to the state. They're saying, no, no, you can't control things that are outside of the jurisdictional territory of the US. There's been a huge amount of court litigation and fighting over it. So I just brought that up to kind of mention you can try to either contract or work with organizations that are gonna more explicitly state that they will try to take whatever legal measures are necessary to protect information from kind of widespread surveillance rather than ones that are just gonna hand it over right away, although often this is happening in secret so it's difficult to tell. Getting back to kind of more the specific context that was kind of more general data protection stuff but now coming back to more the health, genomic, et cetera context. One of the things that's difficult with this is there's, we'll talk about some exceptions but most of the privacy data protection stuff is not genomic or even health necessarily specific at all. I talked about HIPAA before being an exception but so it's not always clear how the laws are gonna apply to these different contexts and there's actually a pretty big amount of debate about whether they should be. Some people fiercely say no, genomic data is not actually that different than any other health data or maybe data at all. It should just be governed by the same rules. Other people say no, there's something specific about this that kind of merits a certain type of protection. So to talk about some of the things that could be specific in the cases of genomics, one thing that comes up a lot is the possibility of discrimination. I'm not sure if people are familiar with this idea so there was a law that was passed in the US called GINA, the Genetic Information Non-Discrimination. I feel like act. And the idea is that there's people who are interested in people's genomic data to make certain decisions about their lives. It could be any number of different actors but the ones we usually end up talking about are insurers and employers. And so there was a case I think just before GINA came into force in the US where for example there was a, I'm pretty sure it was an NFL player who didn't have their contract renewed because the team had access to his genetic profile, decided he was at risk for serious diseases very soon and did not want to extend the contract. And so Canada very recently actually was one of the last kind of developed nations to adopt a Genetic Non-Discrimination Act a year or two ago but it's of limited scope it only applies to certain federal public sector stuff. And so the idea is usually what these laws are saying are things like insurers can't force you to have a genetic test against your will. They can't, if they somehow get your data through other means they can't use it against you. Similarly with employers. But there's a certain amount of controversy about this too and it's interesting because it becomes really a question of values in some ways because a lot of people are saying well this is kind of ridiculous. All that insurers do is collect health data in order to develop risk profiles in order to figure out how much a person might cost or might not cost. So how is genetic information any different than any other type of information? And of course there's an opposing view that says no this is going in a really bad direction if we're starting to allow people just to not be hired at jobs, not have access to adequate health care because of their genetic profile. So generally the trend is toward this. And the reason I'm bringing up here is it's one of the things that can impact say even your say an informed consent statement in a research project because normally the idea in an informed consent is you have to make people aware of all the risks they might be subject to. So it's possible that you might have to make them aware of the possibility of discrimination if their genetic data is leaked which might be mitigated by the fact that if there's a framework like this in place. So that was the first thing. And then in general there's just the idea that health information is generally thought to be sensitive especially medical information. So directly sometimes the information and these databases will talk about a person's either their susceptibility to disease things inferences you can make based on their DNA. But then there are also kind of indirect things you can say. So if you have someone's DNA and you can find out that they participated in a study and you know that that study was only studying people with a certain disease you can kind of infer that the person had that disease which is something that's happened in some of the attacks because I think I'll talk about it a bit later on. Paternity information is when it comes up sometimes I'm not sure how much this has resulted from like breaches and things but it has been something that comes up often in genetic ethics if people are getting genetic tests and discover that one of their biological parents is not who they thought their biological parents were that can have considerable impact on them. There's possibilities of identity theft I'm not sure how practical these are yet but there's increasing people are increasingly turning to using biometric identifiers so biometrics obviously become sensitive in a different way that way. Yeah, as well as some other legal issues. Another thing I wanna get to too is along with kind of these vulnerabilities that we talked about so far there are some of the things that are specific to DNA in that things are moving so fast that we're not actually necessarily sure what kinds of things might come up in the future and there's an argument that they're kind of exacerbated by the fact that DNA is static or immutable in a way that some other data is not so for example, I suppose there was a breach I mean there was a, people might have heard of the big breach at Equifax there was a big credit reporting agency that had a really terrible data breach people's data was leaked or if your credit card company was breached and your credit card information was leaked that could be a big problem you could have some issues around identity theft et cetera coming up the difference that there might be there in here is if my financial information is breached my credit card's breached I can close my credit card, get a new number kind of hopefully move on even if I've lost some money it's trickier with your genetic information you can't, if your DNA is breached and it's out there you can't close the account, change your base pairs you're kind of stuck with it, right? So it's one of the, I mean there's a whole bunch of, I guess, cliched metaphors about Pandora's boxes and unringing bells that you could use but that's the basic idea here so another kind of set of harms is of course those to researchers which are often going to be reputational if your project is one of the ones that is breached it can create problems for the field as a whole it can also cause problems to getting future funding there are the possibilities of fines those are, in practice in research I think the consequences tend to be more reputational and career oriented, I haven't heard of a case where a researcher has either gotten a huge fine or sent to jail for data protection breaches but it's something that's possible under laws that are coming into effect oh here I've got actually a citation to that Google it was Google Italy so I was wrong in saying I thought it wasn't this was the jail sentence case was in this case in Google Italy versus I think that's the Italian data protection office so like I mentioned before also some of the novel risks that we have complicate giving informed consent to participants so if you want to make someone aware say even something like using data in the cloud if your data is being stored in the cloud by Amazon say you're in Europe or somewhere else there's a possibility that intelligence agencies are going to suck it up in one of their big drives but we're not actually sure what the consequences of that or should you communicate that in an informed consent form if so how as well as kind of all these novel genetic issues that I've been talking about there's also the risk that as the regulation gets overly cumbersome that medical research can be stalled especially if it's kind of disproportionate and not actually geared towards the possible effects so kind of shifting gears and moving in the opposite direction I want to talk now I think it was touched on almost a bit in the previous one about a different set of rules that are kind of coming up that are almost pushing in the opposite direction so since the start open data has been a huge part of genomics kind of in my previous computer science days I was a big part of I mean I was at least very interested I don't know that I played a big role but interested in open source software projects contributing to those using those and so those kind of values have been encapsulated in genomics since the start so pretty much right away it was recognized that because initially at least the cost of sequencing even one genome was so great and now even as the cost is coming down we're sequencing so many more genomes there's still a lot of money going into it that there was an idea that this this kind of resource shouldn't be hoarded is the way some people are putting it it should be shared it should be used to the maximum effect for potential change and there's also the kind of the fact that a lot of the R&D money that went into this was public money that the state was funding it so that's kind of an added reason why this data shouldn't be misused or not used properly and this kind of over time and it actually very quickly became part of the kind of policy principles that have guided the field ever since so we saw kind of in 1996 the Bermuda principles were a pretty landmark example that already says that kind of researchers should be releasing any sequence assemblies that they come across that are larger than a kilobyte preferably within 24 hours so rapid release is really a key principle they also asked researchers to waive their intellectual property rights so that sequences are freely available and people can get maximum benefit out of them and then the idea was also that finished annotated send sequences should be published immediately and so what they were kind of pushing back against at that point I think was mostly fears of kind of scholarly competition the idea is I've done all this work to produce this result I don't want someone scooping me I don't want someone to publish my findings before I had the chance to when I'm the one who's spent all the money he got in the grant, paid the money so one of the ways that it's kind of I'm sure people are probably aware of this if you're all in health fields but one of the things that people that have been drawn on is the idea of publication embargo so that even if the data is shared no one else is allowed to publish on it until the kind of primary researcher has had the first chance to do so those seem to be, embargoes seem to be falling a bit out of favor recently I think they weren't in the most recent genomic data sharing policy that I mentioned here from the NIH I think they're much less heavily focused on but at that time that was kind of the competing one of the main competing factors was the push towards open data and open sharing The other thing about the view to agreement but also at the same time at the European phase the U.N. was starting to come out and it was also a salara that competition between the private sector and the U.N. That makes sense, thanks so then I also, I reference here a similar more recent, this is a Canadian and so tri-agents here is referring to the three big Canadian funders in the National Medical Sciences Natural Sciences and Social Sciences that have their own open access policy sometimes related, this is more focused on publications but they're also kind of data sharing policies that exist in Canada as well So the overarching principles that we're seeing kind of being embodied here I mentioned kind of rapid release of data it's often, I think as we've mentioned in the previous presentation it's mandatory to deposit them and approve open access repositories so people can have access to them the results are supposed to start increasingly there's a desire to see results published in open access journals I mentioned embargoes already touched briefly on one of the issues I'm not touching on too much that I actually probably should touch on more is there's also a bit of a there's a bunch of discussion as well over the degree to which intellectual property rights are legitimate over genomic sequences people may have heard of the U.S. Supreme Court case of myriad where the Supreme Court found that at least naturally occurring genomic sequences weren't subject to patents and so that's part of there's been a push for that as well and again the kind of potential sanctions are largely the risk that you might lose funding or a reputation which is already pretty significant I'm not, I mean we'll get to this a bit later but I'm not sure that a hefty jail sentence is needed although in certain contexts there's people pushing in this direction which I think is wrong headed but despite kind of all this there still seems to be this is coming out recently a publication kind of reminding people to deposit their DNA sequences because it's not still not happening across the board for whatever reason whether people are worried about the competition or whether they just have not gotten around to publishing their sequences they find it too cumbersome, et cetera but I think it's important to kind of contrast these two things because it puts researchers and people maybe in your chairs in an awkward position a bit where you have these really competing policy and legal obligations where the open data push is really telling you share everything as much as you can and some of the privacy restrictions are saying well you should protect things to a certain degree so it's almost like I feel like it's coming to a stage where you kind of have to share everything except for what you're prohibited from sharing so it comes a bit trickier there's less discretion available. Great, can I share my point of view on that is that the DNA sequences there is so much to discover that the scientific sequence of DNA will not even have time to go look for it and so I think in this obviously knowledge and discoveries to be made in activating large data sets and having access to very very large data sets so that and then in the ICTC consortium, so the International Cancer Institute consortium there was no IP on the DNA sequence itself. Of course people and companies and whoever can have IP downstream with drugs and so forth so it's not against people making money but it's against people making money on the knowledge of the data which should be shared by all so all can make discoveries and so we can get more eyeballs in the data I mean it's sort of the the important thing is there's so much data there's so much that we don't know yet we need more people to access to look at it to make discoveries and prevent that from happening we're doing, we're shooting ourselves. That's my understanding of kind of the compromise that ICTC correct me if I'm wrong but the compromise that I understand ICTC has tried to come to around intellectual property in its policies and I know that in general the information that it's storing and it's producing shouldn't be, you shouldn't be if you're a company or someone accessing it you shouldn't be trying to patent data or to place intellectual property on data that's going in. But if you, because of sequence information you're able to find a target for which you can make a drug so you can make all the IP on that as you like but not on the protein or DNA sequences. Yeah and my understanding is the underlying principle is really to try to get at this idea I think the idea is kind of like avoid intellectual property when possible to enable the most sharing but there are certain times it's been argued that certain products just won't get to market unless the company has enough interest in having a certain amount of intellectual property in the final leg of it and so that the door has been in the policy left open for that. So now moving into the third part of the four part presentation this just raises kind of some of the privacy and data protection issues that come up. The first one that I'm going to talk about is kind of the cross-border issue talking about the way it's playing out in Europe because I think that's kind of the specific context. I mentioned before, I didn't mention this is the precursor to that data protection regulation that's coming out soon. The precursor came out in 1995 was called the data protection directive. The new kind of successor I mean it was published in the law books in 2016 that's only coming into force this May. And they're similar although there's some legal differences for example, a regulation is directly enforceable throughout the entire European Union whereas a directive each country had to adopt a version of the directive into its own law but it's kind of a similar but amped up version of it. And so immediately the idea came up and it isn't only to prevent certain uses of data part of the kind of subtitle of this law is and to promote almost the sharing and harmonization of data. And the idea was if we have harmonized privacy and data protection rules around the Union then we can allow data to flow pretty freely through the European Union and be assured that it'll be provided the same protection. The immediate question is then well what if the data goes outside of the European Union? So what if we have an ICGC member that wants this end data outside? How does that work? And there's a mechanism that was adopted in the law that says that the European Commission can look at another country's legal instrument and decide that that provides adequate protection and you can be approved that way. So I mentioned before Canada's Tepita was adopted pretty soon after applied. It got approved to be deemed adequate. In the US they took this fairly unique self-regulatory approach where there wasn't a law that they adopted but the Department of Commerce put together this kind of policy where you could self-certify yourself as a company to say we're gonna follow these principles that have been published and we'll make public claims saying is this as much? And that means that the FTC can take enforcement action against us and it tries to meet kind of a similar level of protection and they were also approved. This is all probably, well Tepita was approved probably around 2000. I think the Safe Harbor was around 2002. But this also is, this is kind of almost, as you may be aware, 2000 and 2002 are kind of almost not prehistoric but are old in internet years. So after this guy, his name is Max Shrems but now we're moving to kind of the post Edward Snowden era. Obviously things really changed when Edward Snowden revealed that there was widespread surveillance happening. This young law student named Max Shrems decided this was a problem and he ended up starting a court challenge based on his Facebook data. He said, look, my Facebook data is being shared across the Atlantic. My Facebook's in Europe is headquartered in Ireland. He was Austrian but so he started this case that was Shrems versus Data Protection Authority of Ireland. When it climbed up all the courts all the way to the European Court of Justice in 2016 and he was essentially challenging the adequacy of the agreement, not the agreement itself but the fact that it had this kind of green check mark. The European Court of Justice ended up, oh, sorry, I'm getting ahead of myself. Before, while this case was ongoing, Quebec, the province in Canada where I am coming from also has its own private sector public, private sector privacy law. They started making moves to try to get adequacy status. Again, this is post Snowden. People were surprised that the regulators made science to say that they were not going to pass and they ended up abandoning their attempt and what was, some people were criticizing it because the Quebec's law was widely seen to be stronger than the federal torpedo law and offer better protections and some people have argued this shows that the regulators are just totally incoherent. My thinking is a bit different is that things really did change after 2012, 2013 with Snowden and they started looking at things a lot more strictly. So after this, this is when the case finally came down and then the Safe Harbor Agreement was struck down as no longer being adequate which really threw things into confusion because this is different in the Quebec case where there are already companies relying on this, right? And then suddenly overnight, okay, this isn't deemed illegal anymore or what do we do? They ended up quickly putting in place a successor mechanism called the Privacy Shield which is not that different from Safe Harbor and it's been subject to court challenges of its own so far it has survived but it was deemed adequate right away. But there are ongoing challenges. There's the European Commission has said this needs to be renegotiated. They pointed to a number of weaknesses. So there were things going on with, especially with Safe Harbor, such as companies were self-certifying that they were following these principles but it turned out they actually weren't doing anything about it in their daily practice. That was obviously a problem. But the main things they were focusing on I think were the kind of bulk state surveillance stuff as well as some of the issues. So this is kind of, so I'm actually putting a question mark too beside the Canadian law, Papita here because they're currently under the directive there was no mechanism to review adequacy so once a law was deemed adequate it was kind of adequate forever. Under the new regulation now they're actually gonna be reviewed every, I think three years. I forget the time period. And there are a number of people saying, well, you know, Quebec's law was seen to be stronger than Papita is but they're gonna withstand the scrutiny and actually a couple of weeks ago there's a bunch of amendments that were proposed to Papita to hopefully bring it more in line. But all this to say that when personal data is crossing borders, it's definitely an ethical issue that's being taken seriously now. There's a lot of confusion around it. It's good to know on what legal basis you are sharing data across borders and these adequacy decisions are kind of glossing over it quickly but they're not the only mechanism that allowed. It doesn't mean that if you're in the country that doesn't have an adequacy decision you can't have European's data ever. There are other alternative mechanisms but it's just good to be aware of some of the controversies that are playing out now. Another separate kind of issue that you may or may not have encountered. And this is, I'm kind of honing in on data protection and privacy law here but it applies also in other contexts like research ethics law, kind of medical informed consent law. One of the big issues that traditionally was seen as the way to address these issues and to kind of get the best of both worlds is to notice that the way that the laws are defined. I'm using a very US based term here personally identifying information whenever you see that kind of term you know it's like a US person talking. Europeans would call it personal data. In Canada we tend to call it personal information but the idea is that these laws are defined so that they only regulate what's called personal information. Other kinds of information that aren't personal meaning they don't relate to a person are not regulated at all and you can do what you like with them. But if that's not clear, maybe the simplest way to think of it is back in the day I think it was imagined that say you had someone's hospital record with a bunch of information that some of which is interesting to you some of which is not. You remove their name, you remove their phone number you remove their patient ID you remove anything that can identify them you have what's left and that could be seen as not being personal information that's just information that could be about anything or anyone but it's still helpful to us because it has some value. So kind of implicitly in these laws there's something else called non identifying information or we sometimes call it anonymized or anonymous information that's not regulated. So the idea was that this was the way that you could have both satisfied this open data this push for open data but also have perfect privacy because no one's at risk but also we can still use the data. So the first question to ask though is how do we decide what is personal data identifying information what is not? In the law there are kind of two different approaches there's a contextual approach which basically is like an ad hoc case by case thing where you just on any case look at the data say well this person has part of their zip code was included but only the first number or first two numbers you probably could never figure out that would be referred to thousands of people and the way the legal terminology they usually use is whether it's reasonably foreseeable that alone or in combination with other data it could be used to read and identify the individual. Less common you'll see like a rules-based approach and we'll look at specifically that law HIPAA that I talked about before because it's pretty exceptional in that it's only the it actually has a almost algorithmic process that you can go through to say if all of these steps if you do so our law will consider that your data has been de-identified you're not subject to any more of our regulations anymore and they actually have a list of 18 fields that you need to remove to satisfy this. It's seen although you can see the advantages and drawbacks I think to both obviously if you're trying to comply with the law it's way nicer to have it just a set defined algorithm go through these steps and you're done you know you're positive that you're complying with the law whereas the contextual approach who knows what is and what is not the downside is that the fields that you need to remove don't necessarily correspond to what actually will anonymize something it's very possible to have fields removed that you wouldn't need to remove and they would still be anonymized even if you kept them in that you might want to be using for things and it's also quite possible to remove all those 18 fields and still have a set of data that's highly identifiable. And so my last point here I say it's an even less appealing approach in the big data era but just to look at it what the hyper privacy rules it's called the safe harbor but it's a different safe harbor than that international data sharing agreement I was talking about before they're totally separate things but for some reason they were called the same thing so they say you gotta remove names any geographic divisions smaller than a state except for there's a you can have part of a zip code any dates any telephone numbers any fax numbers if you can tell it's not a brand new law electronic email addresses a whole bunch of different things and then but then they have at the end I don't know if I include it here they have an additional one I don't see it here there's any other identifying code or a sequence number or et cetera so that one and then also you can see P here biometric identifiers including finger and voice prints one of the questions here is well the next question I have is what about DNA you didn't see DNA listed anywhere if you have either say one single nucleotide variant is that anonymized or not if you have a whole genome sequence is that can that be considered anonymized or not and interestingly we don't have really an answer for that I've known some people who've tried to specifically ask the US regulators is DNA covered by some of these more vague terms that we've got here and they kind of have so far refused to answer so we don't necessarily know but the home of paper yeah I will no no problem I will get to it soon so there are some people who until recently and I probably shouldn't pick on I mean science moves this thinking that anonymization could reconcile all the problems that we're having went far enough that there were people arguing fairly recently that DNA even seemingly even whole genome sequences could be seen as as de-identified there's you know there's just these strings of letters how are you going to link it back to someone's name what how would that be possible so there's a paper even as recently as 2007 that said if you're storing DNA without identity data and if there's no clue from whom the sample originates such a sample is de facto anonymous and anonymous DNA they're defining here as sample stored without data or a code which is to be clear this is I think this is no longer I don't think it was probably even true then but I think it's not true now and I'll get to that more in detail later but for some reason first I put in a slide well I think this ties in but this is kind of further the idea that there's a tension between open data and privacy because if you look really at the definition of open data the way it's defined and they use phrases like that the data must be provided as a whole it must be free to use there can't be legal restrictions on its use has to be usable for any purpose and it has to be freely redistributable and these are pretty much the mere opposite images if you look at all the techniques we have for privacy protection so things we do when we're trying to protect data is either we don't provide the whole thing you know by anonymizing it either we make restrictions on it so you can't share it with certain people so it's not free to use or we say you can only use it for specific purposes and you can't redistribute so it's so I think there's we're still looking for ways to reconcile these two things but there's a really core tension there so now I'll get to kind of the the kind of sadfall from Grace that we've seen mostly for anonymization over the last few a few years to decades but pretty early on so there were some famous examples even where for example Netflix at one point decided that it wanted a better algorithm you know when it suggests to you what movie you might want to watch next they wanted a better algorithm for that so they decided they were going to release anonymized viewing histories of all I don't know if it was all but a large number of Netflix users took a take off the person's name and their account number people can feel free to play around with this data and pretty quickly there were some researchers that said oh we're able to re-identify a whole bunch of people just based on their movie-watching habits which seems maybe surprising and you could see this as kind of a hack and not a hack like in a derogatory sense not a real trick but what they did was they said well there's some people who are going to have have a tendency to watch a Netflix movie and then right away on say IMDB or Rotten Tomatoes review it and if we just compare the timestamps of when they watched the movie when they rated the movie and if we see matches all down we can figure out who's who so you would have to say okay this person would have to be one of those people that's rating movies but increasingly we've seen more and more clever ways that you would never have initially thought of to re-identify data there's one famous example here that's quite old where there was de-identified medical data on this kind of almost Venn diagram you see where they removed almost everything or they removed names etc but they just left zip there's zip code, birth date and sex and it turns out in the US I think it's about like 89% of people are uniquely identified by their zip code, birth date and sex and it turns out those three things are also included in voter lists which you can go down and get a copy of the voter list so this data that had been anonymized was able to be re-identified oh I mentioned the Netflix one there was another case with AOL for similar reasons publishing people's search history it's the search requests people were sending out without their name saying this is an anonymized without recognizing that people might have a tendency to search their own name which would be included of course in what they sent out so the consensus that's kind of come up now is that this might be too strong to be true but it's close to the idea that data can't be fully anonymized and still remain useful this is a slide again this is kind of bridging the two things I'm bringing it up here again to talk about how unique DNA is which bears on anonymity but this also kind of ties into what I was talking about before about this debate of whether genomic should be specifically regulated this is a paper that came out I think a couple of years ago where this set of researchers were saying okay there's nothing totally specific to DNA but here's a set of six features that are you know some of them are also shared by other biometrics but some of them are if you take this constellation we can see some specificity as far as re-identification of genomic data I'm listing a whole bunch of papers here I guess I didn't list the actual publications but hopefully they're still findable and over the last say 10, 15 years we're seeing an increasing ability to re-identify data or re-identify genomic data two studies that I think are of particular importance one was the Homer paper that Francis mentioned and these were all kind of seen as surprising ways to re-identify things at the time so Homer's was specifically using US's dbGaP database the database of genotypes and phenotypes and if I remember correctly and you can correct me if I'm wrong but I thought oh yeah sorry this is the results that had an impact on dbGaP yeah yeah it was a GWAS study so they ended up with again correct me if I'm wrong but they ended up with the output that they had was aggregate data so it wasn't any one person's data it was an aggregate result but I guess it was so such an amount of granularity that this guy Homer and his fellow researchers discovered that if you had pre-existing someone's DNA you could figure out whether they were in the study or they had not been in the study just based on the aggregate and previously there was a strong tendency to think that aggregated data is an atomized data that doesn't relate to one person it relates to you know hundreds or thousands of people and in this case if sorry the Homer paper was about to be second year so they're able to see if people belong to the control group or anybody inbind them to those two categories exactly so you can figure out whether they're diagnosed as whether they have this disease which is so obviously this paper in particular here laughing it said shock waves kind of through the field and it ended up previously but this is why I was thinking of dbGaP is because there was no it wasn't dbGaP yeah but previously also dbGaP had been as I understand it fully open access and it was right after the Homer paper came out they said no no we have to we have to close this down and only allow kind of certified researchers to have access now because we're realizing there's actually sensitive data that's coming out here one other notable one I'll draw attention to also these papers kind of follow an interesting trend but the Gimreq paper from 2013 was one of the first ones where they were actually able to go back from they had just DNA and they were actually able to go back to re-identify people and figure out who their names were they used if I remember correctly they used kind of an interesting and complicated way of going through Ancestry.com but also through figuring out patrilineal bloodlines and that the family name would be shared I can't remember exactly how it worked but similarly people had been saying some people had still been insisting that there was no way to from a pure genetic sequence to go back to someone's name and at least in some cases now we know that's not the case I think that's the one with the code data yeah 50 cell line yeah and so in recent papers we're seeing things even and I think since this CHI paper in 2015 they've even maybe gotten more granular but you're seeing things like being able to I'm not sure now if re-identify is the right word here it might just be uniquely identified but in any case identify or uniquely identify an individual based purely on 25 randomly selected single nucleotide polymorphisms from Wellcome Trust data which is much different than people kind of saying or very recently before an unlimited amount of DNA is not shouldn't be considered identifiable one of the other difficulties in this area of anonymization is that there's really poorly standardized terminology especially in the regulatory context that makes it unclear what people are talking about this is kind of a paper that I co-authored on this topic and others so anytime you are like the point here is kind of kind of try to be skeptical of anonymization it's not, I shouldn't overly pan it, it's always useful to reduce identifiability of data when you don't need that extra data to minimize kind of the risk but it's not kind of the panacea we thought before and when you are using it it's helpful especially according to guidelines and policies it's really helpful to always define the term you're using because a lot of the key terms are used by different people in really different ways and really contradictory and confusing ways so it's always good to make sure but generally when we say anonymization what we mean is that something has been irreversibly well at least we think we can't foresee that it would ever be re-identified it's seen to have been irreversibly re-identified so I mean at the extreme I think you can think of some genetic data that you could or genetic information could be this if you say X percent of the population has this variant that's still genetic data but it's hard to imagine unless you're talking about a very very rare disease of one person or something how it would relate to any one person de-identification can be used it's sometimes used in a way that's a synonym for anonymization but sometimes it means much broader things and so for example the EU law uses a word called pseudonymization which basically means you add in some mechanism to allow the possibility to re-identify someone like a code that you're going to store elsewhere with their name so if you ever do need to go back to it you can but you can still distribute the data with the kind of reasonable assurances sometimes de-identification is seen to include that sometimes not but the idea is just if you ever are making recourse to these concepts be very clear about which one you're using so continuing kind of the criticisms of anonymization de-identification is techniques there are some also coming from opposite directions or different directions saying that you know that some people are arguing that there are cases in which we have to return results to a patient say we find a clinically significant variant that we weren't looking for we didn't know was there and it's highly dangerous and it's actionable we should be able to re-identify that person and find who they are and alert them of it similarly there's under privacy law so I guess this isn't not related to privacy law but there are usually duties to allow someone to withdraw if we don't know who they are they might have a hard time withdrawing and there are also people saying that anonymization sterilizes the data so much that you lose so much if you're trying to properly identify it takes so much out and it might even encumber you if you want to do longitudinal studies and follow someone over time how are you supposed to do that if you don't know who they are so other people are saying we shouldn't be kind of doing it for this reason so basically there was such so much focus on anonymization before that since the conference has kind of crumbled there's been a bit of a scramble to search for alternatives to take its place at what kind of mechanisms are we gonna use and some people are saying we should use more technical solutions find new technical ways of doing this others are looking for legal solutions others organizational et cetera this is a map I found it kind of interesting it wasn't from that same paper as the kind of flower you know make exceptionalism diagram before this is one set of researchers arguing for the parts of the genomic research process that they think some of them there are technical solutions to privacy possible others are not and you have to rely on the law I'm not sure I agree with the way they categorize everything but I think it's an interesting approach oh I noticed am I going till 12 when is this 12 30 okay I may have to rush through some of the stuff then that I've got and I may rush through this in particular there was an interesting I'll talk about it in a couple sentences but there was an interesting kind of one of the next steps that people are looking at one that I I'm I think I'm an outlier but I strongly disagree with it is some there's a certain set of especially law and policymakers that are saying okay anonymization has failed the technical solution doesn't work for us anymore we should turn to a legal solution instead and what we'll do is we'll say that even we'll try to technically identify the data as well as we can still release it to the public and then we'll just make a really severe criminal punishment if you try to read and defy the data and then it'll be kind of just the same as though we had made it technically identifiable which I think I'm really opposed to this idea I'm surprised it's seeing as much success as it is and so this example is one example not to pick on Australia they're doing a similar thing in the UK but there was an example where public Medicare data was published they actually did a pretty good job for researchers online and I'll zoom through this pretty quickly people can look into the articles about this they're pretty easy to find if you want and they included their methodology for de-identifying which I think is good to do from an open standpoint they relied on encryption what ended up happening was that there were academics who were able to re-identify the data and essentially the government's response was to say we're going to adopt a criminal penalty if anyone tries to re-identify the data it's too late now it's already out there it still hasn't come into force but it would make it a crime to intentionally re-identify data published by the government and we're seeing this in other like I mentioned in other countries like the UK I think I probably don't have time to talk about it if anyone's interested to talk to me about it more it's an area of interest of mine to be happy to so some of the other more technical novel approaches that people are looking at are some cryptographic ones homomorphic encryption has been really interesting especially in the cloud context it's a technology that would actually allow you to not only upload encrypted genomic data or other data to the cloud but also to upload encrypted operations that the cloud service provider can't read but can perform the analysis in the cloud and then return an encrypted result so it's kind of the ideal technology in the case where you have a cloud provider operating across borders that you don't trust and you want to hide the information from them but you still want to be able to use the resources unfortunately it hasn't really although there's been proof of concepts and small small kind of operations carried out with it it hasn't scaled to a large size yet although there's active development in the area differential privacy people may have heard of it's I don't want to say it's similar but it's also a technical or a technical solution rather than a legal one I was going to talk a bit more about some responses to surveillance one maybe I'll highlight is people in the US the NIH has an interesting mechanism called the certificates of confidentiality that actually allow you to say they come out of actually research in the 70s when there were drug users say people wanted to do researchers wanted to study injection drug users but the injection drug users are kind of like I'm not going to participate in your study if you're going to give my data over to the police about my drug use so they adopted this new mechanism that allowed studies to be kind of have a certain level of immunity from surveillance and state disclosure and it's been kind of amped up over time so it's worth looking into if this is something of interest to you and providing additional confidentiality to users of your research but the kind of largest what's mostly been used in practice is moving from open access like we mentioned to controlled access so the idea with ICGC there's still some open access data there's one data set that's open to everyone there's another set of data that you only can have access to if you have approval of data access compliance office and generally the division is generally if we think something's either anonymous we can make it open or maybe if we think it's quite anonymous but it's also not very sensitive there's a calculator that's done around that so in for the ICGC project in particular through their online web data portal I think you'll be seeing this more this will be probably what we'll be talking about a lot this week but you can actually look through all the data they have it's all indexed there you can build up a manifest of what you want if what you want is in the controlled tier not the open tier you've got to kind of go through their online form you can fill out a form to try to get to ask for approval and I'm showing you some of the details in here there's some forms you have to have for it you have to create an account and sorry I'm flipping this a bit quickly but you're gonna go through it more okay and so just maybe say two quick things about it so one is there's this so there's an undertaking that people have to take here to say I that I won't I promise not to try to re-identify the data there's no criminal sanction attached to trying to read it but still it's still you're expected not to try to read into by ICGC data and you do also have to be affiliated with an institution and to have them sign off on it as well ideally there could be a way for citizen scientists etc who don't have an affiliation to gain access but legally there hasn't been a way to figure out to make that work yet and so just raising a bit of what you end up agreeing to through this maybe I'll go through it also quickly if Francis is talking about it more but you see a lot of the things we've been talking about so there's there's guidelines about how to use the the technology best practice is what you're kind of institute what your academic institution or sign off authority has to agree to they incorporate some of the data sharing principles including ones I didn't talk about so much but the kind of come out of those Bermuda principles we mentioned the Fort Lauderdale principles in 2003 and the Toronto principles of 2009 so you do agree to a certain amount of open data as far as intellectual property it's kind of what we mentioned before they incorporate some best practices and other guidelines from elsewhere and I've got also here some just background information on the experience of the data access control office which is actually out of my office in McGill and material so quickly at the end here maybe I'll go through just a quick some aspects of more more practical virtual machine security and usage best practices issues so security is obviously separate from privacy but but related going to be talking about VMs here so I mentioned kind of at the outset the way the way this works technically is you fire up a virtual machine and then connect to it through and the idea is through an encrypted channel so that no one else can listen in gain access it's possible to to connect by SSH using a kind of username password standard approach it's good to avoid that when possible for a number of reasons especially if you're working within a team where you don't know what other people's passwords are going to be so there's a way to actually develop oh sorry so the way that the slide is set up here it makes it out that password only authentication is distinct from SSH but you can actually use SSH with passwords but what we want to do here is have you end up with SSH in any cases you have using encryption keys essentially to connect so you have a public key and public and private key pairs we can develop we want to build strong keys and the SSH ends up being convenient if you want to use a number of services so the general idea what we're doing here is we want to connect two computers together through a secure encrypted channel one side is going to be basically your laptop today the other side is going to be your virtual machine running on this is the logo for the the collaboratory in between them is the internet so a bunch of people you might not trust who can potentially listen in so this is why you want the encryption on your machine what you're going to be doing a little later today is generating a private key out of that is derived a public key we need to securely then communicate the public key to to the virtual to actually to the collaboratory in order to that you can start the communication we have a way i mean you might have noticed that the the key is going over the public internet but we do have a way to be sure that it ends up at the right destination and not at some malicious hackers in some malicious hackers hands although it is because the public key anyone can have access to it through the magic then of SSH and the key there's a key exchange process you can create an SSH tunnel which is essentially so your communications are still obviously passing between the two but they're encrypted so no one can read them we're mostly going to be i think just opening up an SSH terminal window to execute commands but it's also possible through this to do things like transfer files i don't think we're going to be transferring files because part of the idea also is that it's much easier to be analyzing this genomic data on the cloud where we have a lot of compute power rather than our local machines you can in theory also run remote desktop etc over the SSH tunnel but as i understand it's not what we're doing so um here's kind of an example of something almost like what you'll see with the web the web kind of interface to firing up instances and virtual machines with i mean this is just a command prompt but it will be some kind of interface um so you're going to connect over a certain port it's a good idea to have the listening on the SSH server in general i'm not sure how it'll be happening today but listening on a random port in the dynamic range through this set of ports i've got here the idea is then it's maybe that's a small amount of security in the sense that if you avoid standard ports someone an attacker might not know to look at the port your the port that you actually have your server running on i wouldn't sleep 100 really on that alone this is a security mechanism um and then you want to have your firewall blocking as many possible of the remaining ports unless it's something unless it's for services that you need and trust and then mentioning that i mentioned the private key before that's what you want to especially limit access to either physically you never want to put it online there are cases where i mean you can do searches on google and find people's private keys it's not not a good thing you want to have happened the public key the way the magic of SSH works is that that's public anyone can read it without without problems um and then an ongoing way um if you if you have people you're working with and you've shared your keys it's good to replace them regenerate them uh it's a good idea to shut down your virtual machine whenever not in use both for both for security reasons but also if you have something running that you're not aware of and it's eating up a whole bunch of cycles and you're gonna end up getting a giant bill for it which once in a while i mean heard stories of you know professors complaining about oh i had this postdoc that just left the machine on we got a giant bill for thousands of dollars um and then beyond this i mean this is a quick overview but it's good to consult further uses uh resources and there's ways to harden security configurations further uh i was mentioned trying to mention this before but it's good to prohibit password only ssh connections and to use a kind of key based one and when you see warnings it's good to make sure you understand them so for example here's one that's trying to avoid a man in the middle attack um to try to make sure that the machine you're trying to form a connection to is actually the machine that you um that you that you want to connect to um and then i've got the start of the uh just kind of a picture of the the icgc policy for controlled data access data and so this policy has specific guidance on a number of other issues obviously i'm pretty much at a time so i'm not going to go through them but um some for local infrastructure but some for cloud specific issues including uh guidance for specific providers depending what company you're going through and the last thing i might say i also realize i'm the last thing standing between you and lunch i believe is um audits and accountability are also provided for in the policy document and it's important in an ongoing way because you don't want to necessarily only consider security risks when establishing a new system partly because your system is going to develop over time but also because vulnerabilities new vulnerabilities come up all the time as i mentioned at the outset um it's good when possible to have a certified auditor reviewing um your system um and to regularly review your keys so yeah as i said not only does your project evolve so does the state of data security as well as best practices so um looks like there's 10 minutes left if people had questions or if people were dying for me to go over the australian example that i mostly well i i mostly did talk about it but otherwise maybe lunch time