 Okay, next up is Adam Hafman from the Big Data Institute, so another non-European person coming to talk at FOSDEM about the possible conflict between data security and high performance. Hello, is that audible? Is it on? It's on, yeah. No one says it isn't, so I'm assuming it is. So, yeah, I'm going to talk about data security and I specifically mean data privacy, that kind of thing, not whether your data is actually there. That's someone else's problem. I'm going to start considering the sort of assumptions that we've had in HBC and how they might not necessarily fit with sensitive data, clinical data, why the nature of the science that people are trying to do is changing, the sort of intrusion of the real world into the cosy little bubble of HBC and some approaches to deal with it. After all that doom and gloom, maybe a tiny little bit of optimism at the end. Why am I interested in this? So I started at the Big Data Institute in October and it's a new job, so you've got lots of new questions, you've got lots of new problems to deal with. I don't have many of the answers. I have some of the answers. I think I know most of the questions. I know people like Todd have already faced data security in quite a serious way for a long time, but I'm talking about other sites that have maybe never had to deal with this kind of problem before. And I'm generally talking about biomedical data because that's what I know the most about and specifically in England because it makes a difference. I'll come back to that later. So this is our nice new building. My office is about there. It's an office with a window which is a luxury these days. But as with all these new institutes, you get a new blurb saying how fantastic it is. But the thing that's relevant here is that we're talking about integrating large, complex, heterogeneous data sets. And also we're talking about crossing different departmental boundaries. So epidemiologists talking to biologists, talking to clinicians, talking to hospitals, talking to engineering, talking to mathematics. This is an example of one of the projects that we're hosting called Minerva and Me. This is where members of the public can upload images of faces and they have algorithms that can actually diagnose diseases based on features in the faces. As you can imagine, there are rather significant privacy implications on this. I actually had a contact from the PI of this project last week because he's got a new PhD student and he wants to use GPUs to process the data. But the machine that is the only one that has access to the data does not have a GPU. He can't use the GPUs that we have in our main GPU cluster, so we have to take one of the K-80s, put it in a different machine so this PhD student can actually guess his work going. So what are the kind of assumptions that we've had in HPC and how may they not be helpful? I think there's been a general attitude that if you've got your isolated network, then whatever you do behind the isolated network is fine because no one's going to get into that network, so there's no problem. And you've got your GPFS nodes where Root can talk to every other node, and that's fine. My previous job, it was biology data, but not humans, so it was just molecules. There are no committees for molecules, saying molecule rights. Or particles by previous job before that. There are no sort of certain, you know, protests about whether particular particles are being looked after properly. In order to get this high performance, we actually try and remove a lot of the security barriers that you would normally have. And also, who's going to check as long as people get their results? No one really cares. As I said, some people have always taken a different view. So I thought I'd find this guy who, he says things have changed. He's got a very red hand, so maybe that means he knows why things have changed. I don't know. So I think there is actually pressure from the hyperscalers, from the Google and the Facebook companies, because scientists hear about the deep learning, machine learning, and they think, why aren't we doing that kind of thing? Why are we stuck with these old clusters? We want to do more ambitious science. We want to take all these datasets and do something clever with them. The instruments that generate the data are becoming increasingly capable, so they're generating. You might have a robot microscope that's running for a whole day, taking tiny slices, and that generates masses of data, and then it leads the scientists to think, can we address more interesting problems? As I said before, the people are collaborating across different boundaries, and also the funders are saying, we want the kind of science that gets us onto the BBC News or onto the Google News homepage or something, not just something that's in a journal that hardly anyone outside of academia ever reads. An example of some of the datasets that people are interested in is the UK Biobank. That's a study of 500,000 people where they've been taking various measurements of them. That's a study in about 2004. They're still talking to these people, and they're now adding imaging data for at least 50,000 of them. That's another one, the Chinese Biobank with another 500,000 people. There's also the 100,000 Genomes Project. The clue is in the name, there are many people in that one. And this is an American project which was, I think, started by the previous president, and those people tend to prefer. That is a million Americans. Again, they're going to take a longitudinal study, lots of different measurements for these people. And the conditions of access for these kind of datasets, they're all different. So then you see the sort of problem where what if you want to quite legitimately take a UK Biobank set of patients at cohort and then one of these, what if they have clashing requirements? How do you deal with that? The sorts of data that people are looking to integrate electronic health records. I saw a talk by someone at our place where he had to get full ethical approval before he could do any computing on the electronic health records. That's a process that takes months. There are also things like hospital episode statistics. Whenever you go to hospital in the UK and something happens to you, then it'll be recorded. There was databases for prescription data. Although interestingly there, they've got no knowledge of whether prescribed drugs are actually bought or ever used. They just have no idea. That is an example of a project where they're trying to use big data for heart disease to try and improve the treatments. But it faces all these sorts of problems with integrating different datasets. He said that there was a really good database called CPRD, but he just was not allowed to use it. So he had to use a different database which is not as good because the ethics committee said well you can't use it that way. This diagram, the details don't really matter. It's someone who's quite senior in the NHS in England and he's drawn all the some of the different sorts of data within the NHS and it's a big mess. And no one really knows what all these things are. They all have different standards. They all have different formats. There are lots of people who have a say in whether you can actually use these. And again this delays everything. I think the theme I'm trying to get across is that it's not that the individual computation is going to be slower. If you look at the delay between when a scientist has an idea and when they can actually come out with their result, because of all this extra admin it's going to take a lot longer and you need to start thinking about these things as soon as possible. When you're using these protected data sources it usually implies there's a data sharing going on. And with data sharing that implies that you have to have an agreement between different parties, between different universities, could be in different countries and these agreements actually need to be audited. As I said before you could have clashing requirements. So you might get some sort of unpleasant Venn diagram. So the intersection set isn't really what you wanted it to be. Also there's reputational damage possibilities. It's actually quite difficult to lose your job in a university. But if you do something that means the university gets a load of bad press then your position might be in a bit more danger than normal. So here one of the departments in Oxford had a data sharing audit and this is a public website. There's no login or anything here. That's the data sharing audit there. So anyone can go to that website and see what was said in the audit. Maybe it said it actually did okay. But maybe it said we were really rubbish and we had to fix everything. That's going to be public now. So if your documentation of everything is not very good everyone is going to know about it. And this could include activists who have been to publicise bad practice. And this is NHS Digital which is specifically for England. So if we were looking at Scottish data that would be completely different. The point in this quote here is that the people who were doing the audit they looked in what he described as almost forensic detail at the computer that was downloading the data. The example there was that they had to use BitLocker to encrypt the machine. They said in their documentation they were using AES 256. In fact they were using AES 128. The auditors noticed and they complained. So they had to then have an action plan saying how they were going to fix this. Also these computers are all in a secure computing room. So to get any access to the data you have to be on this list. Have a card biometrics or that kind of thing. And this is not to get into a data centre as you might expect. This is just the people who need to get into the room where they can then talk to the data that they're interested in. So there's a whole other layer of bureaucracy just to talk to the data. I know of another big university where they are actually employing external contractors to check the systems are compliant with these regulations when they're looking at US National Institute of Health data. And if they didn't have this sort of proactive audit then the NIH would come in and audit all their systems which would again be that would probably take years. And I think it's reaching the point with these clinical data sets where you have to actively prove that you're doing the right thing. You can't sort of wait. Maybe they'll catch us out, maybe they won't. So this is DeepMind which is the famous Google company. People outside the UK may not remember but there was a care.data project which was supposed to take UK healthcare data and make it available for research. It turned out into complete fiasco. So now they're creating this kind of buzzword compliant semi-blockchain system which can audit every single access to clinical data which is addressing this idea of proving before anyone proves otherwise. And I think that there is a lot of complacency which can't really survive anymore. We need to be trying to engage with these people before the problems arise by the obvious method of the bubble of big data for an HPC in this sort of pin attacking it rather nastily. And I think the sort of Facebook fake news and the the sort of high profile errors that some of the other big companies have made they have made the public a lot more skeptical. So we need to bear that in mind. So some people might say why don't we just have these sort of simple separated air gap systems but actually this adds this makes things more expensive and it cuts down on the flexibility really you want to be able to run your analyses wherever you don't want to have to have different buildings, different rooms. This is one approach that I can't quite bring myself to say the name but it's another university in the UK and there's some so here's the patient data in the nice safe NHS network it goes over to the research network this mysterious anonymisation process happens and then it goes to the usual storage. So what does this anonymisation mean and really this arrangement is only acceptable if this is perfect. So how anonymous are we really? And the thing I'm pointing to there is an article where in Queen Mary in London some sort of patient rights activists put in a freedom of information request to see some clinical trial data which the researchers claimed it had already been anonymised but what did that actually mean and this all went to a tribunal and again this is months and months and this is affecting when the research can be published and it's actually becoming easier and easier to take data that's supposedly been anonymised and recover some of the original identifying information. So can we take supposedly anonymised data and just run it on a general purpose cluster? That's an open question and I found that there was a workshop in December a two day workshop with about 25 talks by the European Medicines Agency in England of course which implies to me that anonymisation is far from being a solved problem. So the sort of approach I'm thinking of is taking this idea of immutable infrastructure where everything is under source control it's in git and then you roll out your infrastructure in a sort of documented way and you can prove what's happened using OpenStack. This is not necessarily virtualisation this could be for bare metal as well I know that several major universities are managing their clusters using OpenStack and I think one of the talks earlier today referred to that as well and the advantage there is that you have to explicitly set out these sort of security policies how the networks apply how the users are able to use which networks to do what and there was actually a talk later today about GDPR which is this projective which probably has lots of clever things to say I think people are very complacent again about GDPR and there is a project within OpenStack called Congress which I must admit when I first saw it several years ago I thought that looks absolutely incredibly boring why would anyone ever want that it applies policies to the rest of your cloud so you can monitor when a policy is violated say you could have some rules saying VMs of a particular image should only be on this type of network and not another type of network and you can monitor this proactively or reactively and of course now I think it's quite handy that these people have been working on that or maybe my life has just become more boring one of the two here's an example from one of their talks where this VM is using the default security table so it's created an error so there's no error if it's in the secure table really you think this should never happen but the reason why we have people like me and Kenneth and Todd why we have jobs is because these errors do happen and we're supposed to fix them I've said that they're already and it gives the advantage of auditability when a non-technical political group is looking at your infrastructure they're not interested in your reassurance they want something they can take back to their ministers and politicians and prove that it's all working properly also you tend to just because you're working on so many projects you need to delegate some urban rights so can you can you be reassured that this delegated administration does what it's supposed to do people are using rights the way they should be and it forces you to document things really which is what you need to do I think there are some ideas in the containerization world that we can use particularly the sort of static analysis which is originally being used for finding security vulnerabilities but in principle we should be able to use static analysis to see if containers are looking at or treating data privacy issues in the wrong way I think it's generally a good move for lots of the really dependency filled pipelines particularly in bioinformatics to be putting containers and we can hopefully reuse someone up all the other people's good work here what about the cloud this thing you keep hearing about it's not as fashionable as some of the other buzzwords we also need to be able to run a lot of these analyses on public clouds because certainly the funders that I hear about they always ask the question when you're thinking about on-premises infrastructure have you thought about running in the cloud so could we possibly encrypt absolutely everything in our virtual infrastructure could we use AVX 512 to do that maybe but on the other hand Meltdown Spectre particularly effective EMS on the other hand maybe we could use AMD's secure virtualization someone actually asked about this in a dev room yesterday and whether Volte had support for this and they didn't but then in December someone found an absolutely massive backdoor in this quite similar to the Intel management engine backdoor so sort of take home story if you like is that you really need to start thinking about this when you're designing your systems that as a community HPC needs to be much more outward looking and we need to expect people to challenge us on what we're doing not just to be left alone and it's them in the corner the sort of slightly smelly ones we don't want to talk to them we need to be actually engaging with people and also we need to give more realistic estimations of job time when people talk to us it's this is the stuff we know about already maybe you've got an 800 terabyte data set that you need to move from one place to another because it can't be processed in the original location maybe it's going to take some time to anonymize maybe you've got to talk to all these committees so you need to factor that in when you're talking to people those are the images that I've stolen and that's it thank you questions any questions yes hi I think one of the main challenge between the question of privacy and security about medical data is not only the question to anonymize but also what about anonymized data that can be re-identified with various techniques for example if you have anonymized data for clinical study or medical study or epidemiology usually and this data are open data, open licenses and you can use it for research but one of the problem is what if some people some malicious people find a way to re-identify people after the anonymization I agree I did mention that in passing that it seems like all the time it's becoming easier to do that people are finding more and more ways of identifying data that's supposed to be anonymous so my question was I don't have a solution for that sorry more questions so we actually have some folks at Livermore who are trying to build a system for handling HIPAA data on our systems because it's more tedious to deal with than classified data just because of the restrictions and I'm not familiar with the area enough but in our case I think we were able to get to a state where we would store the data encrypted and then keep the credentials for accessing it somewhere and only ever decrypted memory and people were they were pretty confident that that would be approved has anyone looked into dealing with building an automated system to make this easier for clusters I don't think in the end that what they built would impact performance that much just when you load the data you get it in memory and then you can do what you want with it so it's a transfer cost the individual computation probably isn't that affected it's all the other fluffy talking and planning that you need to do thanks more questions if not let's thank Adam again