 This is about large-scale brain initiatives in terms of investment and manpower. But I wanted to start off and just quickly review how this all started. In fact, the Allen Institute was really one of the first efforts in this decade to make a significant commitment and a significant investment in doing large-scale neuroscience. So this was in 2012, they announced their project Minescope and we're very happy of course as part of this meeting on Saturday in the Australian node session that we'll have some representatives of the Allen Institute talking about some of their latest atlasing work. You'll hear in this session about the human brain project with Richard Frikovac, which was awarded in 2013. So this is the European initiative to build a large-scale collaborative infrastructure for neuroscience, medicine, and computing. We'll also hear from Walter Koroshetz talking about the US Brain Initiative, which came along shortly after the human brain project was announced. And then we'll hear also from Japan on our own INCF Japanese node talking about the Brain Mines project from Japan. And these are the three main recently funded projects, but there are others and I want to give a sense of that because this is an important moment in history where around the world, every few months it seems, every year at least, we're getting new large-scale investments made by individual governments or groups of governments. And in fact, in Korea there's been a significant commitment to funding the Korean Brain Research Institute over a number of years and really looking at doing multidisciplinary brain research. In China, as many of you may have heard, there's an underdevelopment, it's not yet announced, but there is a large-scale, maybe 15-year project being planned to be financed and in principle we'll have three components at least in basic neuroscience and brain disorders and in brain-like technology. So we're looking forward to hearing what happens there. And of course in Australia as well, there was a proposal published proposing a $250 million Australian effort to fund and model the brain that has not yet been funded, but there's discussions underway and hopefully we'll see something emerging. I also just wanted to point out there are different types of large-scale international collaborative projects. And one of them that we're involved with is supported by One Mind for Research, which is a group in the United States, a philanthropic organization, which has actually helped fund the development of the infrastructure for a large-scale traumatic brain injury project. What's important about this is that this is establishing a standardized infrastructure that's now being deployed around the world, not just within the center TBI European project, but it's also being deployed in the United States for track TBI. It's also now been launched in China with 40 sites in China using the same standardization standardized protocol, ECRFs, the standardized case report forms, and a whole common infrastructure which allows this data to be combined with data from around the world. And then we're looking at expanding that to India. So there are these opportunities that are emerging and being facilitated to create other types of large-scale initiatives that are not necessarily at the scale of millions and hundreds of millions or billions of dollars of investment, but are actually being deployed at very large scale to get very large-scale standardized patient data. And then I wanted to just briefly mention that there is, in the planning stages, but it is planned to happen around OHBM next year in Geneva. It's an initiative to bring together the various international brain projects to discuss how they can contribute to public health and how can they have an impact on public health. So that's jointly organized with the World Health Organization and OHBM. And so just to summarize, we really see that as INCF is moving into this new phase where we have the resources to help seed collaborations between researchers around the world. We also hope to act as an organization that can help facilitate interactions between the community and these large-scale brain initiatives where there are these large investments being made all around the world. And so we really look forward to working with you in the community, but also the leaders of these projects. And with that, I wanted to then introduce our first speaker, which is Professor Richard Frikovniak. So Richard has a very rich and very rich Richard career. He's actually recently retired, so I can make fun of him a little bit now. What is he going to do with himself? But he's actually one of the most highly cited neuroscientists currently, and he's been a major pioneer in brain imaging. He recently, in his last stage of his career, led the neurology unit at the Kent Tunnel Hospital in Lausanne. And he's co-director in the Human Brain Project and we'll look forward to hearing about the medical informatics efforts in the Human Brain Project from Richard. Tell me a volume up, would you? Thanks very much. Hello, everyone. So I'm neither an informaticist, nor a physicist, nor an anatomist, I'm a neurologist. A breed called Clinical Scientist. And what I'm going to do today is to... Because this talk is a bit long. I usually talk for five minutes, but after a time and I've increased to 50, it's happened again, has it? No, that's the wrong one. Is that all right? That's it for today. I wanted the one with the timing is on it. Hang on a minute. That's my one up there, isn't it? Thank you very much. You're really kind. All right, let's go. OK, so we're going to talk about the Human Brain Project one arm of it. The Human Brain Project has three arms. There's a neuroscience arm, a medical science arm, and a medical technology arm, or neuromorphic computing. And because it's so large, I'm going to concentrate on the one bit that I know best, but I'm going to give you insights from that as to what the rest is doing and how we interact with each other as we go along. It's very important to say at the beginning that it's a big project because it is composed of what is a relatively new culture in biological science and neuroscience. It moves from the PI with his postdocs and doctoral students fighting the guy across the corridor working on the same problem to 17 institutions, 150 PIs, in non-competitive collaborative mode going towards a single A. So it's a different way of working and its basis is the fact that no one brain, in fact no one computer can hold all the information that we need in order to solve the current problems we have in neuroscience. And so you need lots of experts and you'll see me rabbiting on about this as we go along. Now there are a number of motivations purely from the clinical neuroscience side which have contributed to launching the human brain project. And there are essentially two major ones. The first is the ageing population. As the population ages, so the number of people who become disabled by neurological degeneration increases in absolute terms. At the age of 80 about 20% of people are disabled and need help from someone else from neurodegenerative disease of one sort or another, the commonest cause being the syndrome of dementia. Remember that word syndrome? It's a number of symptoms and signs which come together which a clinician understands. It says nothing about the cause. It says nothing about the underlying pathology. It says nothing very much about the treatment though people often think it does. It's a way of classifying a way in which an abnormal brain manifests itself. The second reason is that there is another major group of brain diseases which affect young people. And that group of brain diseases extremely badly understood has been treated using the stories of Mr. Freud for 120 years and about which we virtually know none of the biology and which are the psychiatric diseases. And some of these diseases are fatal to the patient or to people who are with the patient. Some of these diseases are so disabling that the sufferers never work and so are continually being supported by the state or by their families. So these are two major areas of brain disease which really need cracking and we're getting nowhere. In psychiatric disease, one drug, Halo Paragola in 1953 by Baron Janssen of Belgium, discovered by him basis of most of what has happened in psychiatric therapy of a biological type since then. Now I'm exaggerating for effect but I'm certainly talking the truth. So that's the first reason. The second motivation is this one about how do we get at the information we need in order to design good experiments and take knowledge forward. We are as a discipline extremely good at taking facts forward. We're even better at creating new methods. We have some amazing methods. We could see single calcium ions going through ion channels. We saw today examples of very rapid EM and reconstruction of EM of tiny brains, of big brains, of little brains, transparent brains and so on. We are each of a super, super experts in one little area and we are governed by a way of being funded which means we have to know what we're going to find before we get the funds. And we also work with a method which has been extremely successful in material science, the reductionist method, but which in terms of understanding organic matter, how it's organized and how it's organized in order to generate emergent properties such as consciousness, feelings and so on and so on is probably pretty useless. And even if we wanted to use the reductionist method as we heard from our distinguished Chinese colleague today, it would take us at least 30 years to do one little piece of that. So they're clearly showstoppers there which we have to crack through and understand how to go. Now the principle aim of the human brain project and of the medical informatics platform of that is to try and understand the rules which govern the relationship between DNA and emergent properties. So this is an issue of spatial scale in the first instance, temporal scale in the second instance. Spatial scale will require exabyte computing and so on. Temporal scale added to that will magnify the problem greatly and we don't yet know how that will be done. So the first aim is to try and understand how quite a simple molecule, DNA, transforms into what we're doing today. Highly social communication with transfer of knowledge making us, at least in evolutionary terms, one of the most successful species there are. Now in thinking about how to break through our problems, the impact of informatics and computing science comes right up to the forefront. And the most astonishing thing about it is how little medicine and medical science and how little neuroscience has actually used modern informatics and modern computer science of all the disciplines. I mean some of you will have read how the creation of the universe from 30,000 years after the Big Bang has been modeled now in nature last year and the end results look very similar to what we see using the Hubble Telescope in scanning mode. We all know how the Apple telephone tells us what the weather will be like in a couple of hours' time. I'm of the generation that if it could tell you what was going to happen in 10 seconds' time that was already incredible. And what happened in a week was always wrong so you took the opposite to be correct. All this is due to the fact that computing has over the last 20 years simply exploded. Big data analyses are giving us objects which are now becoming the subjects of our hypotheses. They're becoming the hypotheses themselves. We're actually going into the mode where the hypothesis doesn't come from the mind but it comes from the data and the data themselves are generators of the most profound hypothesis that we're beginning to discover. Now, this has led to a situation where we've got an enormous number of results as well as an enormous amount of data. And coming from imaging, for me one of the principal issues is going to be how we understand the new knowledge that we generate. And one of the ways that this is going to happen is that we're going to use images as transmitters of knowledge as well as simple illustrators of results. So it's very important when you come from medicine not to consider simply the radiological aspects of the brain which you can look at now in ways that were impossible in normal subjects before 1973 when the first CT scanner came into use. These ways of looking at the human brain disease or normal have greatly increased our powers of therapy and above all our powers of diagnosis. They depend on pattern recognition by experts. And as such, the enormous amount of information there is simply sub-sampled in many ways. We've heard today also about how you can make brains transparent and how you can find individual cells and follow their connections down to deeper parts of the brain, sort of transparent Golgi. But you can also imagine images like this which are conveying the information in the structure of the rodent brain and over time projecting it through to what the human brain now looks like and giving information about evolutionary aspects of brain development. Now, the images themselves on their own tap only into your visual recognition but if you can add onto each voxel element in those images, other information that comes from other aspects of neuroscience or from even for aspects of material science and developmental science and so on and so on, you begin to create knowledge atlases which really can then transform the way you think about how the brain is constructed. And all the little pieces of knowledge that we've been generating over the 30, 35 years we've had neuroscience as a discipline will come together into some sort of structural framework which will depend on how each level of scale relates to and determines the next level of scale. It sort of reduces the whole problem of how organic matter is constructed from something that seems to be impossible to understand to something where you say to yourself, well, there are certain rules about how DNA creates, RNA creates, proteins, how those proteins are distributed that at each stage limits what is possible at the next level of organization. And it is that which we're going to try to tackle on the neuroscience side with the computing, high performance computing side through simulation to try and create a model if you like of the whole brain a very rough model in the first instance onto which all the pieces and information can be hung and can be understood in relation to each other. So informatics and connectomics which we heard quite a lot about earlier today, we heard about the structural side in radiology and also some of the work we heard about today, the tractography at the microscopic level, something about the functional connectivity but not an awful lot. And the effective connectivity, the difference between the two is important. Functional connectivity means areas of the brain whose function is related in some simple way, one to the other, often correlated or anti-correlated or something similar, whereas effective connectivity means that connection from some other third element which drives the connectivity in the other elements to be correlated with each other. So one has no causality involved with it, the second has causality involved with it and there are various advances in mathematical modeling which have been driven by the advances in computing power which I was showing you about earlier which now allow us to solve really very complex questions, multi-dimensional questions of causality using highly complex data. That was interesting, wasn't it? No, that's wrong as well. Sorry about this. Where did I go? There. Lovely. So, how are we going to use all these issues? So first thing, first, let's start thinking about how informatics helps us to integrate data. Let me use just a few examples from imaging which is my discipline. One of the ways in which the connectomics people are trying to discover how different parts of the brain are related to each other is abinicia. So taking one area of the brain, saying now, do I find using this or this technique, often MRI techniques in the human brain, does it link to this part of the brain, to the other part of the brain? There are now methods for trying to see where the one part of the brain links to lots of other parts of the brain, just in simple terms. Another way of doing this is to try and get some information in there first in order to lead the way. So here, we see a little study that was performed by in fact one of my last PhD students who said to himself, well, there's at least 30 years worth of work in the literature in NIH and the Public Library of Science, which tells us what people have found about the connections of the sub-phalamic nucleus. They're interested in the sub-phalamic nucleus because the treatment of Parkinson's can depend on putting electrodes very accurately into the sub-phalamic nucleus and switching these electrodes on in order to stop trembling and in order to speed up movement. Now, he simply used the MASH indices for cortical and sub-cortical areas and he looked using the algorithms that are readily available on the website, how many papers he could find that correlated sub-phalamic nucleus in its various compartments with all other areas of the brain and came up with a large set of papers which could then be analyzed further to show, as you see on this diagram, an image which brings that information together. And it brings that information together from the sub-phalamic nucleus from its anterior and external parts to all other parts of the brain in which it has been reported in the past, in the human brain. And as you can see, there are very many. There are very many and some of them are efferent in blue, some of them are efferent in red, some of them appear to contain both efferent and efferent compartments. And then he was able to look at these functional connections in space. So to put them out in spatial arrangement, putting the sub-cortical regions here and the sub-phalamic nucleus here. And he began to see that various compartments showed efferent fibers together, efferent fibers together, going out to the cortex. And these could be correlated subsequently, again using the same papers to the various components of the, that were associated with the afferent and efferent parts, but which were also associated with particular behaviors. So the behaviors in blue here represent those that are primarily behaviors associated with motor structures and motor memories. The red ones with limbic structures to do with emotions and the associative structures in order to do with motor limbic projections. So using this simple passillation on the basis of past knowledge, all little bits of knowledge that were scattered throughout which he had taken them years to find and years to read through and so on, he was able then to look with MRI tractography at each of these connections and see how relevant they were in the human, how big they are, how small they are, are they present and are there in the human neurological literature any indications as to what each of the individual connections might lead to. So this is one way in which one can summarize information that is already present. There is an awful lot of information already present, but in order to do these things, of course, you must also advance the methodologies and the relevant methodologies. So here is some more recent work from MRI showing how the classical concept of taking an image, usually an anatomical image, which means it's T1 weighted, looking at things like volume, thickness, area, and so on. In one set of subjects, in a control set of subjects, and then making statistical inferences and interpretations, that's what most people do at the present time, to try and begin to bring quantitation into this whole subject. Take it away from pure radiology to something that is more quantitative and related to physical aspects that underlie the brain. So here there are various, what they call sequences, what we call sequences of collecting the imaging data with the same machine, in the same subject, in the same 10 minutes, proton density, magnetization transfer, and so on, and so on, related to the biological aspects of water diffusion, myelin integrity, ion content, and so on. And measure them in such a way that the measurements from each voxel can be compared not only to this subject at another time, but to another subject in another laboratory, in other words, creating real quantitative images of these biologically relevant measurements, where interpretation, then, of course, becomes much easier. So we need high performance computing in order to analyze these sorts of data, to bring these multiple variables together, to interact with the information that we had a priori. The other thing that informatics and high performance computing and the mathematical advances that it's caused have brought to us is the sort of statistics we now need in order to deal with these highly complex issues of interactions, changes in function with different contexts, which is a question that arose before, and so on. For the whole area of machine learning, machine learning as a statistical classifier, support vector machines as an exemplar of this, just to show you, in the past, when a radiologist looked at the nature of the human brain with dementia, he was capable of saying, it's atrophied, it's not atrophied, he was not capable of saying whether the atrophy was normal for that age and not normal for that age. He was not capable of predicting whether the patient had dementia or didn't have dementia without that knowledge. He was not capable of making differential diagnosis of the different causes of dementia. Here, by taking a group, an exemplar group of people who had died, had had MRI scans beforehand, who either were completely normal, no changes of Alzheimer's disease, even though they were very old, or they had only Alzheimer's disease, and using those to create a support vector machine, we've been able to increase correct classification of a patient's condition from an error rate of 35% in the best hospitals in the world to an error rate of around 5% from one single scan, rather than the week's worth of work up in the hospital. This was the clinical score, and this was the single scan score. We're now in an extremely interesting situation where informatics are finding our diagnostic analyses as well. If we take the Alzheimer's disease neuroimaging initiative, the ADNI data, but classify scans using clinical knowledge as well, using genetics sometimes, as Alzheimer's or normal controls, and then look at the same using the support vector machine on a single scan, we see that there are inconsistencies in the results with sensitivities of 75% and specificities of 85% for the clinical ADNI data set, which must be one of the greatest data sets that has ever been created yet in clinical neuroscience. So we can begin to really explore what's happening in life in the human with a 1-millimeter cube spatial resolution by using techniques which are using more and more refinements that come from informatics. Now, I keep going on about these refinements that come to my Motivation 3, which is the discovery of causes of disease, be it psychiatric disease, or be it the dementias, we do not know what the causes are. Most we've been practicing lamppost neuroscience. Let me give you an example of that in dementia, as I've mentioned already. If you take the pathology as the endpoint, we make in life a 40%, 35% error rate in the best hospitals with a week's work up in the hospital of the diagnosis of Alzheimer's disease as a cause of dementia. But we don't even know the cause of Alzheimer's disease. We find things in the brain. Some of them are very easy to find. There's a thing called amyloid, which you hear about even in the popular press. The nice thing about amyloid is you take a dye called Congo red, you pour it on the brain, you wash it off. You can see the amyloid. You hardly need a microscope. Anyone can do that. So amyloid becomes the center of everyone's focus of interest. It's probably a tombstone. We've tried to get rid of it. Nothing happens. We've done it in animal models, in humans. We've killed a few humans. We've killed lots of animals. It doesn't seem to be primarily an amyloid problem anymore. Ten years of work, 15 of the sum. Theory's been around for a long time. Ten billion, if not more, invested by pharma. We can't go on like this. Well, perhaps we can, because this gentleman was demented when he ran the world. Very interesting. Go to the social sciences, who do data mining, text, the best. Much better than we do data mining of text. When they do data mining of the text of his speeches, it's quite clear that there are lots of incongruities that change over time as he becomes demented, as he says silly things, as he doesn't construct sentences properly. Why aren't we doing similar things on some of our biological variables? We have EGs. We have physiology. We have scanners. We have blood tests. We have genetic tests. We have deep exome screening, et cetera, et cetera, et cetera. What do we do with them? We ask someone with experience who goes on the world round to interpret them in the light of what the patient says. You know what the patient says when he's demented? He says he's fine. Thank you very much. He doesn't know he's demented. That's how they present. There's Michael Husband who says he's going off his rocker. And the patient says, well, I don't know what I'm doing. Is she? He brought me. So the whole problem is a profound problem, and we do need new ways of tackling it. So it boils down in the end to a confrontation of scientific paradigms, essentially. Many of us have been brought up in the Cartesian model, a nice top-down model where you have a mental theory. We think in four dimensions. I just remind you that a single brain scan has 100,000 dimensions. There's no way we can conceive of it. But we conceive of a theory of the cause of schizophrenia or an interest. We mathematically express a model of it. We confront it with irrelevant data. We tweak the parameters to optimize the model, and we have a model which we consider a fact. What we are now potentially capable of doing using a simulation approach, bottom-up, is to use multimodal, multivariate data, mine it to demonstrate structures of correlations or classifications, mathematically express those structures, explore them to generate hypotheses, investigate those hypotheses using the classical method. And then we would have knowledge. And with knowledge in the context of a relevant theory, a global theory, then we will have a way forward. So how are we doing this with the medical informatics platform? So the first thing is that we believe that there is a very large amount of data out there. Every hospital has a very large amount of data, highly protected, protected for privacy, protected against corruption, standing in servers or data stores for the benefit of lawyers. And an 8% estimated benefit for patients, those who come up for a repeat with the same illness where their previous results have relevance to them in the context of treatment or in the context of being followed. And we are spending a very large amount of money on that. All the people involved or the hardware involved are doing nothing with it. So somehow we have to federate those data, and then because they come from all sorts of areas, be they behavior, neuropsychology, brain imaging, and so on, we have to integrate them, genetics, proteomics, and then we need to mine them to do some causal modeling on the basis of previous knowledge and in order to simulate, to give us a new set, a new catalogue of disease definition. Not a catalogue that is based purely on symptoms and signs because some people can't tell you their symptoms and signs, and then there are some pretty lousy doctors around, and then some of them can't write and some of them are demented as well, including me, I was going to say, but I think myself. And on the other hand, we have a vast amount of data about biological features of these diseases that we're not using in a systematic fashion. And so we want to bring those two together to give us then some insights from these more systematized pieces of information about new drug targets, creating cohorts for clinical trials which are much purer than the... Just imagine you're a pharmacologist and you've got a drug which you think treats Alzheimer's disease. You go and find a group of Alzheimer's disease. I've already told you your error variance, there's already 35% before you even start. And then you go and look for a control group, age matched, everything matched. But you don't know whether they are compensating or not. In fact, 20% of them already have Alzheimer's change. So the error variance is massive. So you're going to need a massive trial, which is fine if your drug's going to work, but if it doesn't work, you've lost a lot of money and your reputation. And you're going to go and look for another drug target and you're not going to know why you're looking for it here or there or anywhere else. And these definitions recalling biological signatures of disease. Now, what are the data sources that we have and what are the challenges that we've had to tackle? We have two sets of data. We have clinical data which is very poor quality and very, very large in volume, sitting there highly protected. And then we have research data which is of small volume of very high quality in many cases, which is sitting there waiting for you to win your Nobel Prize because you're protecting it. Thank God for the Allen Institute, which publishes its data as it comes off the measurement. So we have hospital databases, neither complete, structured, standardized or clean, protected for privacy and protected against corruption. We have research databases that are protected culturally. That's by us because we're paranoid. We have pharmaceutical databases which are protected commercially. Let's think about each of those. You can deal with privacy absolutely by aggregating data. So the first strategy today, we couldn't care less about individuals. We're going to try and get knowledge about groups and look at tendencies and trends across the population rather looking at individual points. So in that situation, we could depersonalize and aggregate data and then it's impossible to come back to the individual. In terms of consent, it's becoming more and more clear that the public wants effective clinical research and is agreeable to the notion of broad consent for research on things like samples of blood, genetics and so on. As long as they know it can't come back to me personally and they couldn't care less once your blood has been taken, your cells have been taken, your scan has been done, they couldn't care less whether you use it for ideas you've had before you've taken the samples or ideas you have subsequently and that's been repeated in many societies and in many places and it is becoming accepted. And then there's the management of ethics of sort of questions and so on where we have a very strong network of local ethics committees that give a lot of value and credibility to science in the population already in place. So what's the solution? The solution therefore is to keep the data where they are and someone has already mentioned this at some stage oh, you did it in any manner today. Keep the data where they are, keep them under the control of the people who control them now, the hospitals, the scientists and so on and see what you can do with that. So every hospital, every scientist has his database and they hate to think that that might become corrupted or it might be stolen, so they have an archive off-site. We want another archive copy in real time of every hospital database. This is ambitious by the way. We're going to start slowly. And that archive will be pre-processed to denoise it, standardize it, normalize it anatomically, normalize it numerically. So it doesn't matter as long as there's lots and lots and lots of data whether it's precisely structured and so on because the noise will disappear under the weight of the data. From that, we will select on the basis of queries the data to be used in each research experiment. So the data will all remain on-site. The original database will be neither copied nor moved nor corruptible. And the data will be selected on the basis of queries transferred into a holding position where it will be aggregated, encrypted and filtered for anonymization once more and then under secure transfer go to a unified portal where the experimenter will receive the result of this query or her query. Now the data will be used to refine disease diagnosis to visualize scientific disease and for further analysis of a public health or epidemiological type. And when the answers are obtained they will be always associated with provenance an archive of everything that's been done to the data where it's come from and so on and that provenance history will be stored to be able to redo the analysis if required to show that it's reproducible and the data that came out of the hospital even though aggregated will be deleted. So this is the informatics platform that has been conceived and as it's now been put into practice at the Shoeve in Lausanne from A to Z. And the team at the moment is writing the software to make that single line of algorithms scalable to multiple sites. It's been done in principle in the sense that the original data has been put on four different servers and it's been made to run. We've got four different sites with all the associated problems of moving data across very large distances and from lots of different areas. Data collection so far we've got quite a lot of laboratory results and patients as you can see a lot of clinical data with a lot of textual diagnostic labels we have data from ADNI all the data from ADNI three cities database which is a large general population in the west coast of France two further German projects are currently interacting the life project from the German FTOD network and then there is a lot of clinical trial data also from Shoeve patients and also from a drug company called Sanofi when it was explained to them that if they gave the data from their failed trials from the placebo arm which therefore is not contaminated by a chemical though it can be contaminated by the state of mind of the placebo then that has no commercial value and could be given readily they've given these data and that's now proof of concept that this can actually happen so we have a large amount of data which we're playing with we're recruiting hospitals as a joint National Health Service of the United Kingdom is joining with its Institute of Psychiatry Neurology the South Petrie Hospital Neurology in Paris is joining we're going to approach the Arkham Hospital for its psychiatric population which is massive and controls most of the patients in Germany and from the Niguada Hospital in Milan which is a large northern Italian hospital in a major university those five, six university hospitals will be the first network when we can then prove in principle and in practice that that network functions then we will open the whole thing up to commercial exploitation so what will the researchers' perspective be? they'll sit at the unified portal accredited there'll be security in place there'll be an entry point the researcher will see all the data as though on one server but actually distributed throughout just like I do when I use Google we'll be able to through the Federation at each hospital send the query so that the Federation will send the query out to all the hospitals they will pick out the relevant pieces of data and send them back to the unified portal where the data will go back and can be visualized, analyzed or diagnostics can be performed with the store on which all this depends in each hospital within its firewall controlled by its information safety officer or controlling officer or whatever you like so the store will involve extracting information from the primary data anonymizing, converting, creating raw data files which the data Federation query client will go down and interact with directly never being put into a database we couldn't load all these data into a database system because of the many files, the many formats the need to integrate with legacy software the privacy-related limitations I've talked about and the fact that the data are owned by the database system is when you load them up in so we're using this new concept developed by my colleague Anastasia Lamarcki at the Ecole Polytechnique Fédérale in Lausanne of NoDB where the queries are in situ on the raw data files running directly over the files so this is a sort of data virtualization and harmonization in situ which is done by the queries something which in principle is similar to what Google does though in many details it's quite different the data don't have to be move or door copied large collections of files can be integrated if one node falls down the rest continues just as if the site on the web falls down doesn't mean old web falls down it could accommodate multiple data formats doctors won't have to change their way of working necessarily scientists will continue to be paranoid if they wish but they won't be able to have access to the network if they don't give up their data the contract there will be you join, you give up your data you get access to everyone else's data win-win situation the patients will have doctors the doctors and they will have a contract the doctor will say to the patient I'm going to send your data set your vector to the new way of diagnosing brain disease and we'll get back some ideas as to what you're like do you agree that's what they do now doctors will continue to work like doctors so the real issue now is whether with all these external data we can be comparable with database performance so the queries will need high performance so retaining the data formats the files and scripts using this noDB concept here are some preliminary results as to the time it takes execution time where there are 1 to 150 queries using CSV or JSON data where as you can see the loading onto the database locating into the data store time used for flattening are estimated in different contexts these are classical contexts these are slightly less classical more advanced contexts and this is noDB you can see that there's major almost tenfold increase in sensitivity and so we get to a stage where we think we're in business so what are we going to do with these data just for the last 10 minutes the first thing that we need to do is to try and get these disease signatures right let me just get you to imagine what that means imagine the brain disease space highly multi-dimensional about 300 diagnostic labels in DSM 5 in neurology and about 250 in psychiatry so there are at least that many dimensions and probably a factor of 10 more because there's a lot of lumping going on so we need a lot of hospitals a lot of databases involved and there needs to be a whole real-time continuous iterative data mining procedure going on in the background so this will generate the new multi-dimensional diagnostic catalog which will be what the doctor will compare your disease vector with in order to find where you lie in that disease data space this is not something for guys saying are will you be studying Parkinson's disease? of course we have to study all diseases the trick here is that the messier the diagnostic labels the messier the more diseases there are the better because you draw them apart in the disease space that you get the bigger contrasts the bigger the contrasts the better it's difficult to get across as a concept quite trickily then for data visualization standard things, graphs and so on ideal for epidemiological research ideal for health services research depending on individual questions and the whole set of data and finally classical research the research that will arise from data mining of this type on subsets of the diseases which generate objects which can then be related to this issue of where the disease has it greatest impact in space at what level and so on and so on this has been done very simply by a postdoc in half a day in the blue brain project lab again on the National Library of Science in this case all papers that mention both genetic aspect and the brain disease and this is you can't see this but all this shows is that the half a day's work you can show that the diseases in these catalogs do cluster they cluster in relation to the genetics which are the little spots and the clustering makes sense this image doesn't make sense because it's in two dimensions and it's representing a very multi-dimensional space but it makes sense in that it shows that clusters occur so the power of data mining in showing up these clusters but what the hypothesis will be built on is very major and we have some preliminary data for data integration here a whole set of brain imaging clinical scales and measurements 500,000 to a million SNPs on some of the patients proteins in the CSF and the blood in some of the patients all put together MRI data, PET data, gene data, CSF data protein data, clinical data in over 5,000 controls and almost 1,000 patients with dementia sorry, this should be dementia and we've used this to phenotype lead, so dementia type of dementia led semi-supervised clustering we've used biologically led completely random start data mining high dimension and feature learning in order to try and cluster them and this is the preliminary result the first thing you see is that there are a lot of clusters the red clusters are demented people clinically so those are the syndrome of dementia the big cluster in the middle with the two people looks like it's Alzheimer's disease because to be in that cluster you have to have the MRI and the PET pattern of loss of brain function and structure in the frontal, parietal and lipocampal regions you have to have ApoE4 in your genotype you have to have the amyloid precursor protein gene in your appetite the blue ones are all normals and as you see there are different groups of normals which if you accept that there are and you have to accept that there are normals who are normal who are normal but have Alzheimer's changes so therefore compensated normals there are normals who age rapidly there are normals who age slowly and so on and so on we're beginning to see associations also of genes specifically with one group and not with another have the types of one group and not with another now there's a big caveat here which is that there are only 5,000 data points to do real data mining you need a million, two million, three million what you will get when you try to federate and integrate the whole set of data that are available to us through our socialized healthcare systems in Europe so this is the way forward visualization can take a number of formats this is another format which is concentrating on the structural side and associating it with different patterns you can see how some diseases appear to have satellites, other diseases appear to be very solitary like this one with just a number of different types associated with it so there is information here a lot of information here which bears looking at and then in this demo you can see how you can begin to bring together the clinical data with that these are categories from the DSM-5 catalog of diseases and as we go through you can find the categories you can show how many you've got in the patients there were 57,000 patients there you saw 9,200 with vascular disease this is the major categories these are sub-categorized these are the smallest types of categories you can type in the name of a disease you can then see how many of that disease are present 154 demons that year in Shoe 57 of those were vascular demons you can start plotting the males and the females with the different categories found in the notes one then has to recluster them according to the biology we will need the biological information as well here you have the display over age here is 70, here is 60 and here you begin to look at the associated clinical features so hypercholesterolemia associated with high blood pressure associated with other things in these patients there's one very interesting one where you see early Alzheimer's disease which comes up with with un-attract infections fascinating people with just early dementia if they get an infection they usually go bonk bump up they come back to their compensated state so these sorts of associations can be readily visualized and I won't go into that any further the critical things that remain to be done are how to extract the data what sort of features to extract something as complex as this which has 100,000 different data points should one use all those 100,000 points to make the computations more complex or should one extract features such as individual areas all sorts of issues which will need to have to be sorted out for data integration to be successful and efficient so the integrated view of the medical informatics platform is that we have a work package that deals with federated data management its acquisition and its federation for data query and data capture purposes integration and operation with users and community outreach hospital recruitment and ethics which are critical to the workings of the platform itself the medical intelligence tools for data categorization for the mining with the curation of the data data workflows to relate data from one country and one hospital to another in terms of ontologies and the scientific coordination and management of all of this that's been going for 20 months now we've got another 14 months in the ramp up phase and we hope to be there with the five hospitals at the end of that what I've described to you is this there's the future computing and the future neuroscience to come and all of these will manifest themselves six open platforms open to everyone, anyone as long as you buy into the contract I give I get so this is something to keep an eye on we interact with the other major brain initiatives we interact with some of the major institutes like the Allen and in the end the final issue is how to construct a blueprint for the brain at all spatial scales by simulation the ultimate connection set the ultimate connector because it includes everything not just structure, not just function and so on it talks about the ultimate connection set from base pairs to cognition and it will give us the link between the medical informatics platform and the neuroscience platform which will be doing and generating constraints and configurations at different levels of organization be they structural or functional in normal humans and we were bringing in from the reconfigured disease states abnormal values we will be supplanting the abnormal values into the normal model and seeing how the normal then predicts what happens and seeing whether that correlates with what the clinical presentation was the sort of reverse engineering in terms of the validation so I think I've come to the end I thank you very much for listening my two colleagues who sort of generated this program which I contributed to the medical side Henry Markrum and Karl Heinz Meier in the group that runs the medical informatics platform there's the imaging component with Bogdan Droganski and also the computing component run by Anastasia Ilamaki and the data handling and data analysis run by Ferrak Kareef EPFL and at the University of Urza thank you very much indeed go ahead thanks very much for this wonderful talk I will begin at the black eye topic with you of the content aspect of basic data and we've actually done some work on that we've worked for several months with lawyers and ethicists and we pulled many different data sharing projects to create the ultimate content form and present you with the text I can just copy paste into your content form and make sure that in the future you'll be able to share your data publicly for any retrospective and prospective uses so you can talk to me later I can give you the link to the content form but I've got two questions related to what you were talking about at length let me just begin like I agree that the differential who has done this, the INCF INCF was helping those work by me just advertise yourself fully oh, my name is Krzysztof Goleskiewicz I work at Stanford I was worked down with the INCF data sharing group okay, so names aside I agree that differential privacy is the way to go and I don't see a different way of doing this but there are some limitations and I'm trying I'm curious how you're addressing them so first of all the aggregation process will limit what kind of algorithms we'll be able to use and not all of the actually only a few machine learning approaches has been adapted to differential privacy there's continuous work on that and the second maybe more important question is what should we do with them one by one so the answer to your first one is this is an ongoing research project which has high ambition to innovate and I absolutely agree with your statement so we have a whole team which is working on data mining algorithms and how they must be adapted in order to the top priority is data privacy why? one country two or three countries if that probably more in the world have in the past killed members of their populations because of their medical records stating either their religious belief or their racial origin so there are some countries who are extremely, extremely worried about anything coming out in fact there's one country which was involved with a pilot recently who crashed his plane with 150 people and there are still people in that country lawyers who say it was more important that his medical privacy was maintained than that it should have been told to his employers to prevent 150 people dying so these are very significant and serious issues which we have to listen to and we have to take care of but there are things which need to be resolved we are moving now like a phalanx forward so not like a single spearsman so the data mining people have to solve these problems in parallel we're working like the brain okay, excellent so my second question is a bit more of a kind of maintenance sustainability question so in this approach you put a lot of demand on the hospital side so then you have to maintain the infrastructure for aggregating the data and serving it to the next layer and the question is who's gonna actually maintain this who's gonna pay for it how do you approach the hospitals so this is clearly a very significant issue at the beginning so the principle underlies our strategies as follows the hospitals themselves should have responsibility for the switching on and off the access in order to sustain their responsibility absolute responsibility for the integrity and for the privacy of the data they remain responsible for what they have in exactly the same way they need to buy a server to serve the human brain project and they need to with our help in the first instance for the first five or six in an R&D phase we will co-fund the installation of the initial software which will be brought in from us as an academic product best academic product when we show proof of principle this will then be made available to entrepreneurs entrepreneurs will have to transform this into industry grade data we'll have to get the certification from FDA and everyone else and we'll then market that service installation in hospitals so that will be a great wealth generator and a great job generator next everyone will need to be as much as possible probably more than we are with our iPhones we'll have to have an up-to-date set of apps so there will be a second business opportunity which will be an HPP app store which will be maintaining and everything will be run as much as possible on the principle of apps which update as the software is developed on the academic side and then transferred into industry standard and then into the hospitals so this is a process which is not just academia dependent it's academia industry dependent the only people who have done that so far Facebook, Google, Apple and so on there's no reason why given that they've done it for once we really pissed off that we're not first but there's one thing I I only talked for 40 minutes and you gave me 50 I looked on my computer it was only 38 minutes and 56 seconds with the next speaker unfortunately he was not able to attend