 So, let's pass now the floor to Jordi. So... Okay. Thank you for inviting me to this meeting. I will talk from a different perspective, let's say, because I am managing, co-managing an archive of human data, genomic human data, and I'm not a lawyer and not an expert on all these ethical and legal issues. I'm more a technical guy and a biological guy that is trying to run a service and make it useful for the users, okay? So this service, it's called the EGA, you see the logo up, sorry. You see the logo, I think, up there, and it's co-run by the CRG and the MDL EBI, this is the European Bioinformatics Institute and the Center for Genomic Regulation. This name in the legal context, when I'm sometimes speaking, is a little bit misleading. We are not talking about regulations in the term of laws. We are talking about regulation in the biological sense, okay? What opens or switches off the given genes in our bodies, in our cells, in different times in our lives, okay? So it's the kind of regulation we are talking about. We are a center that is involved in all kinds of genomic research, and especially we have this co-managing this archive, okay? These are our funders. One of the things that we are very grateful is to the BSC, the BSC is where we are having hosting everything, so and they are also hosting, I think, the RDA meeting today or so. So it's our own partner too, although it's not exactly running the service, okay? So data sharing is wonderful, as we all know by now, I think, and in the terms of the genomic world, it has started to happen that people start to do sequencing, initially from bacteria and these kinds of organisms, and very, very soon they started to sequence a human data, meaning some specific genes, for example, to detect some specific disease, understand better diseases, and other people is also sequencing, and they see clearly that just sequencing a few people is not enough for getting some biological conclusions. So they thought, if we can increase the sample number, this would be very wonderful, because sequencing is very expensive, so if we can share everything, then we can have better samples, bigger samples, and we will reuse money, and everyone says it's a good idea, the funders say this is a good idea, so let's do it. So let's put this data somewhere that we could just use it, okay? So we decided to share and so on. But because this is human data, this is sensible data, and very soon some people said, come on, this data is very sensible, and it's my genome, it's myself, let's say, if you want. So call you there to share this data without my consent, or without me knowing that you are doing this, okay? So the big archive at that time are just, as I said, hosting bacteria data, or animal data, or plant data, and they start to have some human data decide that something needs to be done, and the first solution was, let's put this data in a jail, okay? I never know if putting here a jail or putting a kind of political or king or someone that needs to be protected from the crowds, okay? Because sometimes it looks like the data is culprit or something is guilty and needs to be hidden because it's a bad thing, and sometimes it's just the opposite, this data is wonderful and needs to be protected because it's the best thing in the world. So we put in sometimes police who is actually this data, are you always to go to this data or not? We put some lawyers on it, who you will use this data, which are the conditions for using it, and which has the legal terms for doing that, and we also put some kind of body words to saying what you are doing to this data is good or is bad. So you can approach this guy to get an autograph, something like that, the signature, and we are looking at you that you are not doing anything bad to this thing, okay? So this is the first approach for all human genomic or genetic data that we start to share. Let's put everything in control and be sure that nothing bad happens to this information. This is relevant in the sense that this is your identity if you want, okay? If you look at these forensic films or TV movies, they are detecting the killers and the culprits of all the stuff by looking at the DNA in many different places, in the bottles and in the handkerchief and everywhere, okay? So they could determine if a person has been there in some point in time given their DNA. So having access to these could tell you many things about this person. The ethnicity, the kind of diseases it can have, indeed the pronostic of their life if you want, so you can have very relevant information about this person looking at their DNA, okay? So it's quite sensitive, not just yours. It's when you are sharing your genome, you are telling things about your mom and your dad, and you're also telling things about your brothers, your cousins, and everyone that is relative to you, you are telling things about them too, okay? So it's kind of delicate ethical stuff, okay? So that's the point, that we need to do something. So the EBI that has one of the biggest nucleotide or genomic archive in the world decided to put some things under control and they created the EGA. The EGA was created, the European Genome and Phenomarchive was created in 2008, and in 2013 the CRG has joined this effort in order to do both joint efforts to do this. What we do is we get information, genomic information from studies, research studies, usually clinical studies, and we store this thing in our secure archive, in long-term archive, so it's there for tens of years probably. And then we asked the people that is the owner of this data to set up what we call the data access committee, that this is two things here, for example, and with this data access committee when a user wants, the user here wants to access the data, we redirect them to the relevant data access committee, if you are interested in a given study or in another study, these people throw some tools, provide access to the users to the EGA, okay? So then we create the personal and user accounts and bring this information to the user and this, they could access all this information that it's under control access, okay? What this means then is that the data owners does a submission sending the data to us and the metadata to us also as possible with ontologies and so on, and then we provide them a set of tools to control who is accessing this information and also the tools for the end users to get this information, currently mainly downloading this information to their own premises and they are starting their own analysis, okay? So this is the, let's say the life cycle of the EGA, yeah? I hope this is clear enough. So we see that this is a kind of bureaucratic stuff, meaning that for everything that you want to access, you need to apply, you need to send letters or mails or whatever and sign documents and have commitments that you will use this data for the good and so on, and sometimes you are getting very big amount of data. This genomic data is big, okay? Currently the EGA is about four petabytes, just one copy of the EGA is when you have three, okay? It's four petabytes, this means 4,000 terabytes. You know if you go to any electronic market thing that you can buy this from one or two terabytes, we are talking about 4,000 or 2,000 disks like that, okay? And three copies of them. So it's a big information, not all the data sets. You don't need all the data sets, but we have some data sets that are about half a petabyte, it's 500 terabytes. So not, I would say, few institutions in the world are able to manage this volume of information, okay? So you need to apply to this, and after, let's say, two months you get access to this information, and then you go there and you realize that the information is not exactly what you want. Because when you sequence people, sometimes you sequence some parts better than others, you have issues, so maybe the thing that you're looking for is not there. So you spent an amount of time applying an amount of money and time and resources, getting the data, opening the data, looking at them to realize that what you're looking for is not there. So people start to say, what if we can share just something? Something that maybe is just showing aggregate data. So summaries of data, anonymized data, but in the sense of making them aggregates, and that's good, okay? So this is safety in the numbers. Quite soon, people, mathematicians, statisticians started to realize that, well, being in numbers is good because you are kind of hiding inside here, but having the right tools, like this guy here, this hook, could allow you to pick one specific personal people, animal in this case, okay? Starling, probably, I don't know, from this flock, and well, if you are the oldest one, you are happy. If you are the one being picket, you are not happy at all. Okay, well, you are nothing else from that point on, okay? Not happy, you are dead, mainly. Okay, so the thing is that this is not enough. This is an example of a paper that starts to look this, not in genomic data, but in proteomic data, meaning that just from your blood, you can get all your proteins in your blood, and from these proteins, we have the level of resolution to start to see what are the specific mutations you have in your proteins that could allow us to identify you between other people, okay? So the thing is not just for genomic data, your DNA, but from others. And indeed, there are now studies that from your macro biome, it means the bacteria that you have in your skin and so on, we can identify if you have been in another room or which is your house in a neighborhood, okay? Just looking at the signature of your bacteria there. And we can say how many people are in your family, if you have dogs, how many dogs you have, and so on, okay? So these kinds of things is starting to happen. And so the level of privacy is less and less is, well, in the country, is more and more blurred, yeah? Okay, so the safety in the numbers I said is not a good solution. So we need to do something else. Well, so we have this sharing dilemma that you already know. Here, we have expressed this dilemma in the sense of we have these data production initiatives or directives, and we also have these openness, a directive that are kind of conflicting. This is the same thing in this genomic area, as I explained it. One is looking for a sample size. The other one is keeping the privacy of the people as much as possible, okay? So this guy is able to handle four horses, not just two, okay? This is a demonstration he did in Canada. This is a French guy, I think, or a Belgian guy, I don't remember right now. And he's able to do with four horses, it's not bad. So I hope that you can do the same with these all legal stuff that we are talking about, okay? So one solution is to create these connected silos. So some institutions has decided to sign contracts between them in order to share data maybe in just the same repository or maybe in distributed repository. So several hospitals, for example, or several academic institutions could agree in sharing this data. But again, this is just for the sake of this small community if you want. So it's not a good solution either. So in one point in time, something called the Global Alliance Has Born is I think it's 2013 towards 2014. This is a community of all the biomedical research and centers that are working in genomics data in order to see how we can do to better share, to harmonize what we are sharing and to foster the reuse of the data, okay? The Global Alliance has several stuff, I will go faster on it, but it's looking in both aspects. The technical aspect and which are the formats, which are the anthologies, we are the APIs that we can share. The API is the programmatic interfaces that we can share and the web services, let's say. And also all the ethical and security stuff related to this. So there are initiatives in the, this is a global level, there are something in Europe that it's called Elixir. Elixir is both looking at this kind of stuff, but also at the infrastructure. So where we put the data, which are the clouds we can use, which are the network communications and blah, blah, blah. Okay, so it's a broader thing. This is just two half standards. The other thing is to have standards and infrastructure. So it's a little bit more, it's broader, let's say. In the context of the Global Alliance, we started to work in two different models for sharing data and describing the data use consent. This morning, I had to left this morning for a meeting, but in the first session, we said that the consent is a very relevant thing. And this community is working on that because this is one of our issues, okay? One of the issues is that we need to share this data as much as possible, as I said, because it's expensive in many different senses and it's very valuable in other senses, okay? So we face the thing that in some cases, yesterday, for example, I get a request from a researcher here in Barcelona that tell me our consent for our patients, the donors that have given us their DNA says that we can only use this data in the context of alcohol abuse that was the context of this study, okay? So we can only reuse this data if the study that wants to use the data is related to alcohol abuse, not just drug abuse or addictions in general or something like that. If it's not related to alcohol abuse, we cannot use this data. This, of course, is very bad because we cannot reuse this for other stuff like using these people as controls, like people not having cancer or any other kind of stuff that is very useful, okay? So this is the kind of things that we are trying to do. This, again, is writing in papers and in documents, and we are trying to classify different rules that are quite common in the consent in order to allow both the requester and the data owner to describe their consent to share and their consent to reuse. So we can match them, and although it's not fully automatic, what we will try to do is to provide the data access committee an interface that is saying, okay, everything looks matched, the your data consent and the request or the proposed of the study of the requester, everything needs to match, you just need to say, okay, and it's on, okay? Or so you don't need to read all the documents and so on. This is one model that it's simple. It's kind of 20 different categories. Some of them are mandatory, other ones are optional, but there is another initiative that it's called Adam, that it's much more detailed, much more granular, and it's much more oriented to be fully machine readable, okay? So the idea is to describe everything in a format that a machine could read, and if possible, could take decisions. For example, if this says this is open to everything, then you don't need to check anything, okay? It says it's just for general research that it's something that is quite common, hopefully, or luckily, then it's not much more than that. If it starts to say it's only usable in this ethnicity or in this geographic context, then you need to look a little bit forward, okay? But well, I think it's an interesting step. And similar to the data tax that Peter has shown us, we have started with three different layers of access. The first one is fully open or public, that we said it's also green. The other one is to register it. This means that we are trying to share some aggregate data just for people that we know their identity, okay? And in this context, we are trying, or the LXE in this case is trying to define what they call the bona fide researcher, meaning that if you, for example, has had any paper published, or we can check that you have an institutional account in an academic place or something like that and you are not administrative staff or these kinds of things, then we will give you automatic access to some data that it's not sensible, let's say, or not sensible enough in order to identify yourself. And then we have the more restrictive level that it's the control access, one that's in red that asks you for applying to this thing probably signing some papers or your institution signing some papers and getting some commitments and so on, okay? So in this context, we are working in all this stuff. They're trying to make data useful. I haven't talked about making the code to the data. It's something that I haven't thought for this talk, but it's another thing, okay? As I said before, our data is quite big. The resources for analyzing this data is quite demanding. So what we are trying to do is allow people to do the analysis close to the data. This is nice, but this means that the burden of paying for the resources for doing the computing and so on should be sustained by our institutions and this is another story and it's not easy to solve, of course, okay? Because we are talking about very, very big computers, okay? And it's not laptop or something like that. It's computers that are hundreds of processors, no, thousands of processors, hundreds of RAM memory, big disks and so on. So it's a very expensive machine that you need to process this kind of thing, okay? So these are my teammates. This is a little bit updated, but these are my teammates in both teams, AI and CRG and these are our funders. You can see plenty of them, okay? Thank you and we are open for questions, I think. Thank you, Giorgio. I think with this third presentation, we're pretty much close to a cycle where we've seen different ways in which you can actually solve the question of how you minimize the transaction costs in an economic sense, transaction costs when you try to access data. And every time, and I think actually this is one of the primary objectives of the European Union policies with regards to what is now called the European Open Science Clouds, also what is called the European Cloud Initiative, actually how you combine all these little elements of the infrastructure in order to minimize transaction costs. However, before I give the floor to the audience, I'm always whenever we go through these discussions, I always think the old saying of the science fiction author T Clarke that the answer to the machine is in the machine. But I'm really curious, we're doing all these 10 rounds and we're trying to find technical solutions for basically a legal problem. And some of these solutions require as we see extensive, really extensive development and work to develop tagging schemes or workflows and expression, technical semantic expressions of licenses or access protocols and facilities. So how much do you think this is viable as, keep doing like other than us finding each other in these lovely places. But other than that, what is the benefit of actually going for technical solutions for essentially a legal problem? Well, I saw some similarities but also some differences. And I will focus on the, let us say the text and data mining similarity and difference between Stalios and me and on the data tagging versus the levels of access with Jordi. And well, I think there is also a different starting point mainly between Stalios and me because we even do not have this. Well, I mentioned it as an example that there is this private digital library of this 160,000 works which is on the computer, probably he has backed it up and so on, of this person. And he may not even give it to us so we cannot even yet begin to make it accessible although the approach would seem very interesting. We have so far looked into the material, well in the metadata of the material and indeed it is a jungle of copyrights that apply there and I really would be, I would be so grateful if there would be a possibility to harmonize or to standardize that but I'm also afraid that that will be really, really very, very hard. And that is why we thought let us try to in this, in the kind of permissive society that the Netherlands still is similar to the use of soft drugs to get at least the permission or have the publishers tolerate that in a protected environment researchers can start using that body of knowledge. And then, so I'm not sure that the technological solution will certainly be not the only thing. You need to do certainly something in advance. And then with respect to the layers of access, I indeed think that is a very important idea or not idea, also practice, it has been a practice already for some time but it is still as far as I know actually not really solved how many levels of access you would need. The Harvard example gives what is it six or seven levels but to be honest, we are not entirely sure whether that would apply because for instance privacy legislation actually says only you can have access or you cannot. And to try and get those different levels will be a bit of how should I say interpretation but that is also not very different from the situation today where I see under the same copyright legislations there are existing very different interpretations. For instance our central Bureau of Statistics is very, very strict and the social and cultural planning office and the two are often working together, they are even collecting the same kind of data. They are much more liberal, you know, and this only says there is more to it than just the law and there is also more to it than just technological solutions. I'll pass it to Stella just to say two things on what you said which are important, clarifications, I think the access layers very much relate to consent. So personal data are black, white, but consent is not and I think, and we know that from work of other groups, Oxford has done a lot of work on this issue and Shulet Packard has done a lot of work on consent management and that's an interesting thing to combine these research one day. With regards to the question of statistics, offices and cultural institutions, again these are separate legal regimes on the top of copyrights and we have great fragmentation in Europe and we've seen that with Stelios also in the context of PSI. There are sub legal regimes within the context of public information and it's a nightmare. Stelios, about the tags as well, you also had the laundry tags. Just very quickly, first of all, let me tell you that it is a very interesting linguistic but also a computing problem to try and see how one can bridge the inherent feature of language which is to be variant. There are many different ways to express the same thing and this is what we call language variation and we observe it in many legal statements. We observe it in many legal documents. People essentially say the same thing, mean the same thing, but they use different words. So what one has to do is when analyzing the documents try to get, if you want, the logical form of each and every legal document and if the logical form is identical then the two documents are identical or depending on any sort of ontology that you may have the documents can be made interoperable or can be decided as being interoperable. So it is a very interesting research question. I fully agree with Peter and with Pro. It will not be just technology that will solve the problem or just the law that will solve the problem. Of course, the law would really do us real favor. Yes, absolutely, if some of the problems were retracted. Just one comment. Technology is evolving at imaginative speeds. Recently, rather not recently, it's a couple of years ago, we managed to package engines, sheet the engines to the Sanders library and extract from the Sanders data some of some information without moving the data set from its original place. So things can be done. There are many different things, many different solutions. In Open Minded, actually what we're developing is a solution where you have a workflow that performs certain tasks, you can package it in what we call a docker image and you can dispatch the docker image or you can download the docker image and use it on your personal machine. Now the degree to which these solutions will work accurately is, still needs to be proved, tested and proved, and the degree to which these solutions will be taken out also needs to be tested. So I will cite Groucho Marx here. Groucho Marx said the money doesn't do the happiness, it's done by small things, like a small mansion and a small yacht and a small sports car and so on. So I will say technology is not all. That was very Marxist of you. So the technology will not solve all, but I think that will help very, very much. For example, in our case, we are happy in the sense or lucky in the sense that we are not dealing with thousands and hundreds or millions of documents, we are talking about thousands of studies and so we can manage more or less that the concerns are similar and the problematic is similar, the scope is similar, so we can kind of classify things and so on. But we face the other aspect that the people that should look into this, it's various cars. So I, for example, was approached by one oncologist, a very relevant oncologist here in Barcelona and this gentleman is leading a couple of studies on leukemia and he said, well, I had an ethical committee in my institution but when they approached them saying, okay, will you help me in seeing if these applicants matches the consent that they have from my patients. The ethical committee says, okay, we will see the general one but from here on, you will need to manage yourself. So now I have a very relevant medical doctor with a very talent on this oncology thing that needs to look at legal statements trying to understand if the request is matching their consent or not. So he told me, initially I say no almost to everything because I was scared that I say yes to something that I shouldn't and little by little I get more confident and I'm doing these things. So I don't want this gentleman and any other similar gentleman or lady in the world doing this stuff. I want them to be as fast as possible opening the data and being sure that they will not be sued because they have opened something that shouldn't. So in this case, I hope that technology is helping him to get both things. There's confidence and the speed to do this thing. And I think that in terms of these much more diverse terms and copyrights and so on, we should also help because then the nightmare is the opposite. You don't have too much variability although you can have much more time to look into this maybe. So I'm a fan of technology for this thing. I think it's time to pass the floor to you. So any pressing questions? Now that you have the opportunity we have them trapped here. Okay. I see the problem with all these concepts. These things we are also about to handle and we're trying to handle. And I think it's rather complicated especially if you're going on a European level because you have very different understanding of ethics and you have very different constructs of ethics committees. And as long as you stick to a certain country, I think it's possible to try to harmonize in a way all these limitations you can have in a constant. As soon as you come to an international level I think it would be almost impossible. We have this problem in Germany because we have different states and different states have different legislation in regard to health. So we already have this problem within Germany. So I don't think that's really a solution. It's a lot of work and it's really it's really all my respect to your concept for these different ways of coding these concepts. But in my opinion this cannot be the way to reflect any and every speciality in limitations. Shouldn't we have a way to come over that in some way? Well, these codes come not just from the Spanish or the UK experience. We look at all the consents that we have that we ask at our users, not a lot of them give it to us, but we look at different of them and we are talking on projects coming from Australia, from Japan, from China, from I would say almost every country in the world except Africa that is just going into the wagon. They share a very big interesting bunch of common stuff. I will not say that people is copying the consents one from another, but it's happening more and more. So I think that we have this common ground and the idea is that if we cannot fully automatize this thing and I think it could be difficult, but at least to facilitate as much as possible the thing. As I said that I'm not a legal expert at all I should say that I'm an ignorant in laws is that every law is doing the same. If the person is consenting to this thing then we are fine. We are talking about research that are more than not electronic health records and this kind of thing that is another kind of beast. But for this the community is also working on that and see how we can make electronic health records part of them at least, anonymized at some extent or the maximum extent that it makes sense to do this. I would say that it's easier to do this with electronic health record because it's not as identifiable as the genome is, the DNA is. I concur that you can know many things about people looking not just at electronic health records. We said how much you are tweeting and from how much you are tweeting and so on maybe you can detect which person is in the Twitter cloud. But probably if you share the same kind of people in these numbers probably you are more safe in the case of DNA. So our experience is that although the laws are there these consents are like I will not say well are on top of this thing. We have any other questions? I think you have been very intimidating to the audience. They see your wisdom and they feel that they cannot really ask. It's only two minutes that actually separate us from our lunch. I think it would be very risky of me to actually stand between the lunch and the audience. I would like to ask all of us to join our hands and thank our speakers. I will hand them when they try to have lunch and ask questions in private. Thank you.