 Well, I'd certainly echo the first speaker in saying that this is just an impossible task. Looking two years ahead in bioinformatics is almost impossible, but 20 years is quite mind blowing. And so I've had difficult few days worrying about what I was going to say here, so anyway I'll do what I can. What I'm going to do is first of all start with a little bit of crystal ballgazing a dweud i'n amddangos ar y cyfliad 2020 i gael ymddydd yma. Oeddwn i'r ffordd o'r cyflwstiynau hwyl yn ymgeiswyr yn ymddydd. Ond oeddwn i'n cael ei ddweud yn datblygu. Oeddwn i'n clywed ar y rhaglen gwyfaniaed o gynllunio'r ac yn cymhwyl gyda'r ffordd taeth. o障оны gyda'i interim, ac os ydych yn gynaliad i wneud i wedi會w ddull gynaliad ddau'r Llyfr Gwysigol yn y llyfr yng Ngwybodfer. Yn unrhyw ymlaen o bwysigol, yn cyfnod i'r bwysigol yw'r data yma, ac rwyf wedi bod wedi gweithio am y cyfnod yw'n gweithio, sydd fydda i'r plwyaf yma yn y pethau. Wyd 내 byddwn ni gwych yn elu yn llysgol yearsiol, mae'r byt yn fawr sefyd慰, mae'r brofiad yn llygiau'r beth sy'n tuio gan y'r cyfeun, y thesaurus yw yw yw yw'r gweithio'r llifonau oedd yma'r wneud arfer oedd, y dystiad yw'r gweithio yw'r gweithio fel yr adegau a'r ffordd o'r progen o'r stwyf o'r gweithio teithio o'r ffordd o'r ffordd yw'r ffordd o'r ffordd o'r ffordd. Yn y gallwn gwneud y cyfnod, y ffordd o'r hollwch, y'r cyffredin ar gyfer allan o'r wneud. Felly mae'r adegau ar gyfer allan o'r progen o'r gweithio yw'r ond wrth fawr i'r cwrsau ar gyfer yr organ Model Ogrwynau. Felly mae'n dweud o hyn arnyn nhw i adlosiw'r feminista. Nid yw'n rhaid i'r cyfnodydd os rhaid i'r peth, os ydy'r partau o ran y seams ymaを cyd-gynnuon o'r llwyddoedd ei bawbol o rhaid i'r bobl. Felly mae wnaeth weithaf os angen i'r rhannu o wahanol sy'n gwneud dros yr unrhyw bobl duol, dydy'r pestau a dydy'r dfynagnon o'r pion o'r rolygu ddiwyd yn gael. And we need all of that sort of information in databases. And we also clearly, then, that's all generic for the whole sort of population, but we also need then the information on the individuals, not possibly first as sort of representatives, but I suspect by 2020 we'll be talking about lots of individual data on individual people. So by 2020, I think that the parts dictionaries will be more or less complete. We'll have the key genomes, we'll have most of the SNPs on those and most of the promoters and regulators I think will be identified on those genomes. We'll have found all the genes I hope by then and got them all annotated. We'll know what the proteins are including all the splice variants that come out of them. We'll have structures certainly I think for most of the domains, perhaps not quite all of them, some of them may be very difficult to get. And we'll have structures of some of the complexes, again probably not. We saw some of the complexes this morning, the transcription factor complexes. Those are going to be very difficult and again the combinatorics inflates the number of structures that you need. But in terms of the basic structures I think we'll have most of those sorted out. And I think we'll know most of the small molecule composition of the cells that we're looking at. So all of those parts we'll have to play with, that will be the basic thing. The next stage of course is, well no, I'm not on to the next stage yet. So let's just look, I thought just as an exercise we would look at how we should project these. And I'm sure you've all done this sort of thing, where we are the year and the amount of data. So this is the cumulative number of sequence pairs in GenBank. And of course it's gone up immensely quickly over the last two years. And if one extrapolates that to 2020 you can see there's a huge, depending on how you extrapolate, there's a huge variation in what you might end up with. So you can end up with about 10 to the 15 base pairs or you can end up with about 10 to the 21 base pairs. The total number of human base pairs if everyone on the planet was sequenced is that red triangle square in the corner there. So potentially by 2020 we could have complete sequences of everybody on the planet if we wanted them. And I'm sure this isn't the way that that will happen, but nevertheless I think that we can certainly, if we just look for variations we may well be able to have something like that available to us. From the structures, the structures have actually gone up much more slowly. So at the moment probably this is three dimensional structures. We have about 16,000 structures in PDB and again one can extrapolate and again it depends how you do it. But we end up with about 10 to 100,000 protein structures, which is a lot less than probably we will need. Now again this depends, we did try to factor in, you can see there's a little blip in about 2005 and that was supposed to be the structural genomics projects with their current estimate. And that little blip doesn't do very much for us actually, as you will see. But I think that as with the genome sequences we really, Francis showed very nicely yesterday how in fact we achieve much more than we thought. I hope that the same will be true of the structures. But it's very different in the structural world because the technology isn't there. It's not just a case of scaling up which I think it was for the genomes to some extent. A lot of the technology has still got to be developed so it may well take longer. The other problem of course is that although we've got lots of structures in the PDB what this shows is that the number of unique structures at the bottom is actually quite small or the number of representative structures shall we say have given families. But that's the bad news, the good news of course is that there actually aren't that many protein families. It's clear that there are of the order of a few thousand probably at maximum and that the huge variation that we see in biological function comes about by mixing and matching these domains together in some way. So that last year most of the structures that were determined looked similar to one that had already been determined before. That's not to say that they don't have new information in, they clearly do. They might have a drug that's bound or something like that. But it's clearly the sort of parts that we have to play with. And in fact most of biological variation has come from individual proteins radiating out and doing multiple functions rather than evolving whole new protein foals. And this is actually a plot taken from one of John Walt's papers. Sorry, is that better? Yeah, which suggests that you can pick, you can have sort of a clever strategy for deciding which protein structures should be solved. And if one cleverly picks which ones you want to do, and actually it's not that cleverly when it comes down to it quite straightforwardly, you can find that you only need to determine about 10,000 structures to have reasonable models for most of the domains. I think the issue of how these domains all packed together is a different one and I'm not sure that just by having the structures of the individual domains we'll be able to construct the whole thing. In fact that's one of the challenges that I think we face. So those are really the, that's what I think of as the hard data. Once we've got that it is wonderfully finite, which is very consoling, I've had it very consoling anyway, that there is a limit to how much of that data there is. But of course from there, that's the beginning of everything, not the end. And hopefully already we have a good taxonomy at the NCBI and I think clearly by 2020 that taxonomy with all the new sequences will become really very stable and we'll be confident about that. Now I might be wrong about that but I would hope so. It's also clear that we're going to build up our dictionary if you like protein domain families. So we'll have on the sort of shelf a book or on the computer the equivalent that tells you all about all the individual domain families, what they do, what they look like, what their properties are, perhaps who they interact with, which networks they're involved in. All of those sort of things will have brought together. So we'll have a much better view I think of the world of proteins. And how they perform the biological functions that they've got to do. And from that we can go on and do classifications that we've done already and many other people as well. That will become I think much more, at the moment we're just looking at a slice and we're extrapolating to the future. I think we'll find out whether that slice was a good representative or not. And as I say, the number of protein families that we can expect is really quite small. And so I think we will have these parts really determined. But gene expression of course is a completely different thing. And it's different for various reasons. One is that at the moment we're still getting the data and we don't have it. There isn't 30 years experience behind us. But of course the other is that it isn't hard data in the sense that a sequence is hard data because it just depends on what the conditions are and the environment. And so in a sense the expression data is just infinite. And you get out these sort of plots and then you've really got to start all of the information processing. And so the next 20 years I think we'll see the collection of expression data. And again this is still much more advanced, my suspicion. I think there's a lot more data held actually in industry than there is in academia at the moment. There's a reasonable amount in academia but it's not really put together. And there's been a real goal to try to get together all the information that you need for a microarray experiment. And I think there's been international agreement on this in the Miami Convention to really to say what the data is that you want to hold and what some part of the ontology that we need to do that. And I think at EBI Alvis Brosman is developing this Array Express and there's also similar developments here I think. And so there are going to be databases by 2020 I think we'll all have access hopefully to very large databases on different gene expression. And that will be just great because from that I think we'll develop then the possibility of doing the modelling, the simulation and the prediction. And I think really it's this area that is going to be the area of most intellectual challenge computationally. I hope that by then we'll have hands on the data if you like. We'll have sorted out how we store it, how we solve it, how we distribute it, how everybody accesses it in the way that they want to access it. But in terms of actually understanding I think that's going to require the development of whole new ways of doing modelling. And the analogy, because it's very strange when one's asked to look forward to 2020 you immediately look back to 20 years ago and think where you were then and what's changed over the last 20 years. And of course when I enter this field the big goal was to solve the protein folding problem from sequence to structure. That was the paradigm that really drove everybody. And really we aren't very much closer towards that than we were 20 years ago. And I think it really does emphasise that basic understanding is very hard to get. Data by comparison is easier but actually interpreting it is very difficult. And I think it's here that having good robust models will make an enormous difference. And there are different sorts of modelling that you can do. So you can do molecular modelling with one protein and another protein and how do you dot them together. So that's very much at the molecular level. But I think increasingly the interest will be in pathway modelling, process modelling and modelling whole cells, organisms and organs. And of course the wonderful thing about modelling is that you start with your data. You understand the data and from that you can generate a hypothesis. From that you can build a model and from that you can build a prediction. That's what you want to do, to go along that route. And we're in this state now where there's just a flood of data. And so the next steps are going to be working towards the predictions. And to do this I think we'll need, as we were talking about, thieving from related disciplines. I think there's absolutely no doubt that we will thieve and take what we can. And it's very powerful if you look in the engineering department their systems processing and the way that they handle their pathways is we were trying to do something on pathways. They have a linear programming algorithm that can find as all our pathway distances like that, whereas our algorithms took miles longer. So we can borrow all of these I think to inform us as to how we build our models. So there are different source of networks that I think we're going to have. We're going to have pathways and of course we all think of the metabolic pathways. We're going to have pathways or occasionally circular pathways. And we know that these aren't pathways. They're really very, very complex networks. And so we can't just in isolation model pathways we have to worry about the whole network. And we have to worry about it not probably to include it all but to know what we can exclude. Which bits we can effectively cross out when we're trying to model any one part of it. And that I think is going to be a challenge that many of the processing packages will be able to handle. We've also got the singling pathways that we're only just beginning to uncover and I think the next 20 years we'll see the uncovering of a lot of those and with that will come the models and the modelling and hopefully the prediction. And of course the big challenge is to include the time constraints as well to try to see if you can actually model. So that we want not just I think starting with qualitative models in effect all the models that many of you draw when you draw this interacts with this interacts with this those at the moment are not accessible computationally and they have to be made accessible. We have to develop mechanisms. So we've been going through the same thing with enzymes. We know all about enzyme possible catalytic mechanisms. And the only way you can find them is to register and draw these register diagrams with their arrows on and we have to find a way to capture that information computationally because if you're going to try for example to look at an enzyme active site and then predict what its catalytic mechanism is then you need a way in the computer to do that because it's much more powerful if you can learn from the data that you've got. So I think that this developing of models is really going to be a challenge a great challenge. I think it's also important to point out that models at different levels have different uses. Sometimes you need very detailed information and sometimes you need very general information so for me the best analogy is that actually if you want to know how many miles you go you can do you can work out the miles per gallon quite easily for your car knowing nothing about what's inside and that's very useful in a cell you can have the same sort of very high level modelling and I think we have to choose which level we want to model which process at and it all depends what data we've got and what we know we can ignore. So it's the same, it was always the same in astronomy and physics and atmospheric climatology and everything you choose your model according to your data and build out from there. Of course one of the great things about having the expression data is that potentially you can reverse engineer gene networks from microarray data and this is again a slide from Alvis Brasma where you can take the gene networks you can get the expression levels from the microarray experiment and from that you can hope to work out some sort of network. That network may mean different things it may mean as we've said already A interacts with B or A inhibits B we don't know what that means but it's the beginning of ideas hypothesis that can sort of lead us forward. Of course this is one of the networks that Alvis generated from the yeast and you can see it's just horrible and I think what we've obviously got to do is to develop all the tools that with this sort of an experiment will put it into some biologically sensible visually sensible and predictive tool that we can say well if we change this enzyme or if we interfere because of course that's what you want to do you want to stop some enzymes in pathways if we stop this one what will it do to the whole network so that we end up with something more like this and more biologically linked. Actually well we've been very interested I've not mentioned evolution yet which is obviously a great challenge and we have all these sequences and we can really begin to address evolution and one of the things we've looked at recently is pathway evolution and we really don't have no concept I don't think really of how pathways have evolved we've worried a lot and are still worrying about how enzymes evolve or proteins evolve and how they change their functions during evolution but the next step is go to the pathways and the networks and ask how they've evolved and how they've changed between different organisms and so you can begin to address questions like this when you develop the tools but at the moment we were looking at the ecocyc and E. coli actually getting the information out is a nightmare for all the different things so linking all the bits together from the sequences from the pathways to the sequences to the structures and that really brings up another issue that the integration of data is going to be I won't go into this actually because it's too long so that's I think where a little view of where we might be in 2020 just at the academic level I haven't actually talked about the medical or the implications for the links to the medicine and I think that that is important I couldn't really tackle everything and I wasn't sure what to say about that and I don't know very much about it I think it is interesting to look at the hardware and what we'll need computationally and again one thinks about where we've come from during the last 20, 30 years and one started with small machines then people tend to have a big machine but it was for the whole university and you had one terminal and you used it and then you had both the lab based and the big, big supercomputers the craze the last two or three years have seen the introduction of farms and biology has really moved from the small compute to the vast compute and many of us have got quite reasonable sized farms very quickly the biology the biotechnology industry will pull that will be the main usage of compute or a very, very many will compete with the astronomy and with the atmospheric and everything we're going to see huge demand for large farms but I think that that in turn will be replaced hopefully by a more sensible grid system I don't know if you're familiar with the grid concept where you kind of the best analogy is probably where you plug your plug-in and you've got to get electricity but you don't know whether it's coming from a nuclear power station or a coal power station it's just centralized and if one can develop that sort of a system whereby compute systems all over a country or for that matter the world and you can access them then I think that becomes very attractive and I think it's being pushed by the physicist by the particle physics community and I'm concerned and so it will happen because they've got lots of money as far as I can see and I think that we will benefit enormously from that but I think what we need is actually rather different because I think the critical thing for most of us is actually the data grid not the compute well the compute grid probably is important but the data grid is even more important and it's the sharing of the data these very large expression databases that we need to share across continent across cities and everything we don't want to have to store them in every single place and how that will emerge I really don't know but I'm sure that there will be progress towards this more sort of grid based technology the other thing that I think is important that perhaps we haven't spent as bioinformatics people haven't spent enough time on is preparing the data and presenting it in a way that it's useful for the biologists the wet bench biologists this is absolutely critical I think and so people aren't going to want the data is just mushroomy, it's just vast and any one person can only conceive a little fraction or understand a little fraction of that biology and they want to see all the information that they need to look at their problem and at the moment that's still quite difficult so for example if we take you want to understand all about aging what are the proteins that are thought to be involved in aging what are the pathways that are thought to be involved in aging what do you know about aging in the different model organisms how do you tie it all together that is just critical I think these views of the data for the bench scientists I think are something that we've got to address and of course the parts database that I've talked to talked about are really the core data if you like but without the linkages out to these other databases that are listed here then I think we just lose half of the value of the information that we've generated the literature is top of that list because I think it is the top priority all the information about biology is in the literature if we don't successfully mine that literature and get it into a computer readable transportable you know, mineable form then we'll lose things and that's immensely powerful I think that's where computers can really help the ontologies getting the defined grammars is just critical for the biology and biological as we heard before this A phosphorylate B C sticks to D and interferes with so and so all those verbs if you like about what happens we need to define those those aren't really defined yet the disease information the mutation database the location where things happen the therapeutics and the compounds databases I think in terms of translating for human health are just critical and so we have to link through to those databases and ultimately one of the real I think attractive challenges is to try to understand toxicology and why given a given drug that interacts with certain other proteins protein families is it specific is it nonspecific and we can only do that if we link this molecular pathway data through to the toxicology data and that's really I guess also saying linking it through the one that I missed off here which I shouldn't have done is the phenotypes which is sort of saying understanding the links between the genes and the phenotypes and just one example just trying to map the omyn mutations on to protein structures at the moment is not a straightforward procedure and that just has to be and it's not just once you mapped it you need to know so this just shows saying these mutations cause acute porphyria and they occur in the active site of the enzyme so you can understand what's going on you need that information in contrast in Wardenberg syndrome it's a mutation to a DNA binding protein that occurs in the DNA binding site so again you can understand that and lastly the last one is thought to be because it mutations in an interface that break up the dimer and so you get a deficiency in the antithrombin so this sort of information you need to in one sense it's annotation but in another sense it's the thing that really captures the understanding about what's going on and the other as I mentioned before I think that just linking from the DNA through to the proteins in terms of linking the bioinformatics and the chemoinformatics so I think what I'm really saying is that you think of the genome but the genome is really at the core of everything and from this it's the links to the other disciplines not just data it's the other disciplines the people who dealt with the small molecules for years and years we want to tie that in so that we can look at a protein family we can say these are the proteins that bind drugs so if we look for example the human genome at the moment only there are only 155 drug proteins that are human out of the whole lot that's 0.4% of the genome there are probably about 3,000 close homologs which may be drugable and a few more distant homologs does that mean that they're the only ones or does it mean that those represent the pathways if we can find other proteins in the pathways those are the ones that we want to go for so you can see for this in order to do this sort of work you need the whole lot lined up all together so you can go in and extract that information straight away okay so what are the grand challenges I think there is a huge challenge to develop the biological ontologies and this is one of those challenges that's right at the interface of computer science and biology a computer scientist can't do it at all because they don't understand the biology a biologist who doesn't understand the need for ontologists and what the restrictions are can't do it so actually people who are at this very interface are critical for doing that and I think the go ontology is a beautiful example of that this developing qualitative and quantitative models has to be a grand challenge I'll come back now to my roots if we could calculate free energy then we could solve many many problems we could solve the protein folding problem we could solve protein ligand interactions we could solve protein-protein interactions we wouldn't need to go and do all the experiments if only we could calculate free energy now I think this is a huge challenge and I suspect it's not one solved by biologists but it would really revolutionise our whole and our understanding of biology obviously we want to predict interaction that works to save everybody having to go and do it by experiment and we want to predict the effect of drug therapy as a generic thing when you're trying to develop drugs but also for individuals so that you can give them diagnoses and perhaps maybe the biggest challenge I think in the long term is to design new molecules to take what we've learnt from biology and to put it to our advantage by designing whole new things that do things that we want them to do with specific functions and of course to understand how we got here in the first place so I'm rambling on a bit I'm afraid but the immediate challenges then if we're going to get to this brave new world and model everything and predict everything and you know life we understand much more than we did before we really have to get the foundations right we've got to get the databases sorted out and this just shows in fact this is a list taken from Rolf Achtweiler from the SwissProt but the things that he wanted to do in the immediate future to the database for SwissProt should be complete and up to date that's how the minimal redundancy sounds trivial but anybody that's used these databases knows that it's a pain and it's really got to be sorted out you want the as much annotation as you can get you want it all to be retrievable by programs and you want this high interoperability with other databases so if we just look at the protein database the structural database it's actually it started in 1973 so it's almost 30 years old and many of it's it's been immensely useful to the world, to everybody I mean my life has been built on the PDB practically, academically but it's basically a flat file format it's got almost no annotation so after people have spent years and years determining structures you don't even know which are the active site residues if you look to PDB file the deposition still involves hand curation the ontologies aren't complete there's almost no ontologies for the NMR part of it in terms of actually using it to predict structure from sequence or to predict function from sequence that there's almost nothing, you know, actually we've really not made big progress there so just the basics some of the basics really still have to be addressed the annotation as far as Rolf Artnale is concerned is very much the bottleneck providing the annotation and so for proteins we want functions, post translation and modifications, domain sites protein-protein interactions, pathways diseases, sequence conflicts variants, all of that and some of that, well most of that is in the literature and not in a computer readable form so that is a huge challenge for the genome annotation with ensemble clearly that's mentally powerful you and sitting in the audience with many other people at EBI and at the Sanga they've brought together all of this information but what's important I think, and this is something that was developed with many people around the world I think is the idea of distribution to the annotation where people who are experts can provide their own annotation and I think the dust system is one example of how that might be provided but I do think it's going to be increasingly important in some way to allow the community to contribute to these databases so that really reviews obviously where one can get annotation from the ontologies those are the areas where the real need is so diseases, when we started looking at the OMIM diseases you get these syndromes and you've no idea what they are and you just don't know where to look and it's really difficult for people who are trained as a physicist and then you have to try and understand what these medical conditions are and so you really need some sort of short sharp summary that will tell you that can also link them back to the and I guess for the medics they want it in reverse they want to understand what the hell this complicated protein structure looks like and should cause this particular effect so we've got to build that in the ontologies I've talked about a little bit okay, so the priorities oh, priority no, sorry I put this up because well Tony Blair, our Prime Minister when he was campaigning for his election he said we have three priorities the first is education the second is education as far as I'm concerned for databases, the first is data integration the second is data integration and the third is data integration it really is a challenge for us and I think we ought to address it and so within the EBI clearly we have a mission to bring the data together and again there are many different technical solutions and it's not clear to me which is the correct one to go down but I think we have to think in these ways now, not in five years time so then just to summarise I guess the literature curation I would probably still put top of my list the standardised vocabulary is the ontologies the interoperabilities and having within the database some quality assessment as to what we need and biologically there are just you know I started doing this I could have gone on all night you know there are so many different biological challenges that we can really address and they are going to need new algorithms and new approaches and I think the close integration between the computational and the experimental is what we need and just to underline that I'd just like to point out the composition of my group I have three biochemists three biologists, three physicists and they have become the norm rather than the exception and what I do feel is that the the physics so I've had quite a lot of people who converted from high energy physics or whatever come to the group and it really does take it's certainly not a day's work it's a year before they begin to talk the same language and it's another six months before they really get their teeth into a project it takes a lot of time but it is immensely rewarding and I think that these are the people who have the skills in the other fields who come together and work together with the biologists I think together we'll go forward and I think that's... I've just got to acknowledge them to people who've helped me when I've been struggling with what to say and provided me with some slides thank you I'm sure there are questions and I want to suggest that people who want to ask questions the people in the back of the room have not been able to hear the questions so please speak up and say your name and your affiliation before your question because some of us don't know everybody here we'll start in front I enjoyed your talk very much and I just... David who are you? David Valle from the Hopkins but I couldn't help adding another kind of prediction that we would like to make to your list of predictions and that is that as we learn more about the contribution of various genes to susceptibilities and resistances to phenotypes the physician will be faced with dealing with an individual patient who has all these individual strengths and weaknesses and a unique environmental history and then the physician and the patient will say how do I deal with this what's best for this particular patient at this particular time and at least right now it's hard to deal with a serum sodium sometimes so dealing with that kind of information is mind bending so we need to work in that area I absolutely agree I think what you will notice in my group I don't have a clinician that's because I can't pay them enough I think but I actually think that there should be and will be clinicians and I think but I think it will be another branch I think the health the whole health issue how to keep patient records all of that issue and then how to link that up is a huge thing and I don't know how far we will be down that line by 2020 I think it's going to be quite difficult Next question I enjoy your talk and I agree that data integration and databases are extremely important however it has been already said in 1998 when people prepared this report and actually 90% of bioinformatics part of this 1998 report is about databases and data integration what is missing in this report is that if we wanted to assemble the human genome we needed to have new algorithmic ideas for example we needed to have new fragment assembly and what is missing in this report is that this fragment assembly back in 1998 didn't exist so I wonder what do you think are major algorithmic challenges in bioinformatics these days computational challenges and by computational challenges I mean not like if we say to predict interaction networks it's not yet a computational challenge for me it's a biological problem how does it translate do we need to develop new mathematics or probably we are okay with existing mathematics and should just refine existing methods so what do you think what does a grant bottlenecks algorithmically and mathematically in this area I think this is a very very difficult question and it's one that actually is a very common question from engineers and mathematicians who say give us the problem and we'll give you the solution and actually I honestly can't I think there are time issues in terms of we've got huge databases and we're trying to bring them all the data resources into families and clustering isn't fast enough and we need to proceed that clustering but that's there are clustering algorithms out there already and it's really just a case of implementing them efficiently I suspect so I can't put my finger on a new algorithm that I think we should need we need stats is often very important when you're looking at matches and really the blast and the side blast their their power rests in their statistics and treating that properly and that's in a sense it's not a new algorithm it's just a new way of assessing the matches and then going round and those two algorithms or those two programmes on the field of protein than any other so I'm not sure how often we actually need new algorithms so I can't really help you, I'm afraid I had a question about Can you say your name please? Julia Licinio from UCLA I really enjoyed the presentation and I had a question about phenotypic data do you think that the database of the future will have that and let's say by the year 2020 could we have like a person's entire DNA sequence plus the person's entire phenotype in a database that you could then try to model? I think that raises all sorts of LC, I think you call them issues I think the individual is willing to do that and if all the LC questions are I think it's easiest to start with mice and eat coal iron, things like that and I think that yes we have to have phenotypes there's absolutely no doubt at all and really I don't know whether I got it in the genotype to phenotype that moving from one to the other is the key issue and we can't do that in any sort of high throughput way unless we have the phenotypes held in the database and also then how you if you think about Drosophila which has lots of phenotypes how do you map those on to I didn't touch that I meant to talk a little bit about comparison I forgot how do you map from Drosophila on to mouse on to C elegans on to human I think linking those phenotypes together will be easier at the molecular level than it will be at the description level I suspect but I really haven't thought very much about phenotype ontologies Yes, one thing from the ethical point of view I think we see a big barrier to that and we think about the average person there are many people that I've talked to let's say the elderly people have terminal illnesses etc who would be more than willing to donate their DNA and their history if they are not going to be around much longer so they are very I mean people seriously they are willing to do that people donate their body to science to make organ donations they can also donate your DNA so it's not out of the question or completely implausible I think it's critical that we do have these things I think many of the you know like all the SIPPERS studies it's clear that those are going to have to be completely international because you often don't have enough data in one country or one continent so bringing all of that together I think is going to be crucial but how much you'll be done by 2020 I don't really know a lot actually I think in the map Alderford Smithies University of North Carolina Janet I like very much comment on literature I just want to make a little parallel there that the we had at one time we just had the index medicus and that was titles and then we had medline and then that was abstract and now we have often enough the ability to get to the whole paper and that sequence is absolutely essential in order to evaluate the worth whileness of most of the data because often enough the title claims things that even the abstract doesn't say and the data are hopeless and you have to be able to have that change so I would urge you very much the processalization if you like of the idea that all of the data must be a computer available in the original form rapidly but a much better search algorithm is needed because you can have a search algorithm for words that you know are in there and authors that you know are there and it will miss completely and that's really unacceptable I think it's a real challenge I know a group in Sheffield that does a lot of word processing and word literature mining they tried to do it on enzyme active sites which biologically are actually rather well determined compared with most of the data that many of you deal with so all they were trying to do was to extract which of the catalytic residues and I don't think they were I don't think it worked because there are so many different ways of saying this is a nucleophile or this might be but it's not because of so and so that it's very difficult to mine but I think we have to do something but I think hand annotation and hand curation as we all know the reason SwissProt is so valuable is because of the army of annotators that have worked on it for so many years My name is Aran Cido from Stanford University I wanted to bring up one thing that is often peripherally mentioned and in fact you just said oh I forgot to talk about it it's comparative genomics and related things in molecular evolution and I just wanted to emphasize that I think comparative genomics has a much more important role to play in the future than what we currently consider it to be useful for so for example right now we think that inferring functional regions in the genome is important those kinds of things I think we have a much bigger issue in that we have chosen to work on many many model organisms and it's really going to be incredibly important to integrate the information from the disparate organisms we work on in a rational framework so that we can actually make predictions as to what our human health status is going to be and those kinds of things so I wanted to emphasize that that is actually a very difficult problem that goes beyond just where we come from and the sort of short-sighted utility that we see in it currently but I think that it is a fundamental framework that needs to be established in the future I couldn't agree more I really seriously did forget that I was going to do another slide that I just forgot to include because I think this ability so again coming back to ageing we're just starting a project that's looking at ageing in Drosophila C. elegans mouse and hopefully wanting to extrapolate to humans and so there's going to be transcriptome data, mutants you know how do you make your mouse models for a particular disease it's all related into this comparative genomics and how you can go across so I think how that is handled is going to be very very important because much of what we know comes from many different organisms and so actually synthesizing it into I know many organisms are different so different things and the problem for example with a keg database which is for those people who don't know it's a pathway database so they synthesize they put all of the organisms in together and create the pathway and then you can extract which bits come from E. coli but in fact it does obfuscate so it makes it more complicated when you're trying to use it so I think we have to find really powerful ways of linking the different genomes together and it's critical Kim? Kurt Fischbeck from the NIH bioinformatics and chemoinformatics and I wonder if you could say something about the prospects of having something useful to link up to in terms of small molecule chemoinformatics it's my understanding that the large pharmaceutical companies have ways of describing their chemical libraries and distributing them in chemical space and annotating them but is there a prospect for having a shared resource that would make that kind of information a little overcoming the proprietary barriers you mentioned you have 4 chemists on staff and I was wondering if you have any thoughts about how that could be done scientifically as well as socially I think scientifically it's quite straight forward actually in the sense that many of the chemists within Pharma have developed wonderful algorithms for doing very fast searches on small databases the problem is that they've got all the data and especially all the toxicology data actually that they sort of don't use in the way that perhaps academically one would like to use it and I think that this is a really good area for academic industry interaction actually and I think there's a whole I mean it's a really interesting problem to sort of map protein space with metabolism space with drug space and put it all together and so I think there are lots of challenges so we've been academically we've been looking at some of these things so how do proteins distinguish guanine from adanine so there are all sorts of negative things that must be in biology because the metabolism not in plants but in humans is smallish again it's finite and it's not that large and so actually understanding the play of these small and large molecules I think is critical and I think we must find a way to work with the pharma companies to develop this because from this could come all sorts of you know really I think the major problem with drug designs everybody knows isn't actually designing the lead it's getting it to go in the body and you know not be toxic and all of those sorts of things and at the moment it's very hit and miss so the drug companies have a huge if one could say this is not a drugable protein or this is not going to work in humans because it will interact with all these other protein molecules then that would cut down their costs enormously so I think it would be hugely to their advantage and hugely to the academic understanding as well so I hope that happens so one of the things I'm trying to do at the EBI is to get some chemo informatics brought in and to interact with people Hi, Mike Eisen from Berkeley so with the literature you've touched on something near and near to my heart sorry I shouldn't stand in the microphone so I have a question and a statement so the question is how do you you see a path for us to go back to the hundreds of thousands of articles that have previously been published PhD students there's obviously a lot of parts to it there's getting it into digital form annotated but I just want to know if there's any ideas for how to get into that literature and make it accessible and then the statement is that I want to point out that most of the literature we're publishing now today is being produced in this machine readable form that we could be using to annotate the genome we could be using it to do all sorts of things but we're all or almost all of us signing away the rights to use this in such a fashion by the way we publish and we like to encourage people who appreciate the value that the literature could have for doing all these things we want to do in the future to try to stop and to change the way in which we do that so that all the literature can be available but also I really do want to know if people have ideas for especially the EBI or the places for how do you go backwards in time as far as to forward in time I think is easy if we all act properly but going backwards in time is a much more difficult problem I think actually we should look forward and the key thing to looking forward is to develop the ontologies because if we have the ontologies then we should be able to get in some way be able to abstract from the literature in the future as far as going backwards goes so we've been doing a study of enzyme catalysis and I've had six students summer students over the last three years plowing through the literature to get the information out and it is the only way and after they've done it a new PhD student started and had to go through all the literature again to make sure it's all consistent I mean it is a nightmare it is a nightmare but if you're interested in the data it's the only way to look at it it's the only way to get there and so from that what I hope is that we'll develop the ontology and we'll develop the models whatever their enzyme is as this active site is like this and they'll draw the little arrow diagrams but they'll also submit hopefully a computational version of that to a database that we might set upon enzyme catalysis I don't know I mean one has to start somewhere I guess Pavel was looking for challenges for algorithmic challenges seems to me that there's an algorithmic challenge of trying to automate that process so that you don't have to a million PhD students but there is a huge amount of work already in text mining in I think in the military in all sorts of places but I don't think it's I don't know it should be easier for scientific literature but as I say I'm only familiar with this one project and they had a lot of trouble I know in what I thought was a relatively straightforward task I think the middle mic is next other people might have a better answer to that I don't know if there's anybody any of you lot know about text mining because there is a lot of text mining out there which is kind of difficult yet a statement in a comment a statement about the last thing did your name please? I'm sorry Douglas Crawford from the University of Missouri a distributed effort there are lots of people who have knowledge on their enzyme their kinetics they would just enter the database and you have to somehow reward them for that as a publication but there are lots of people that might just be able to do that that's not my question I've talked about that too back to comparative genomics and annotations the surprising thing that we find is how difficult it is to do the annotation even when you try your best so we work on a fish and so if we want to be able to link all our CDNAs up to keg it's impossible and for example simple enzymes that we know well on TCA pathways aren't named as a simple enzyme pathways succinyl dehydrogenase so you come up with a protein which isn't succinyl dehydrogenase it's something else and that simple tool would go far from comparative genomics a way to easily mark my gene into either a yeast pathway or into the keg database but that's not a difficult task yet nobody seems to be doing it that's a simple tool I want to know in protein X where is it in yeast, where is it in keg absolutely I think we certainly went through the same process with our structural data on annotating keg and we actually had a very good computing person and they basically hacked something so that they could annotate the pathways with all of our structures but it was a hack so I think that many of the one of the reasons I went back to the PDB and the basic, I think many of the things that we need are actually fairly straightforward and fairly simple the problem is applying them over whole databases is rarely straightforward and never simple I mean it's hard we have time for one more, Rick Rick Young Whitehead Institute with this discussion about data integration I get the sense that people are not aware that the vast majority of what you describe as challenges have been solved in yeast that one of the most powerful sets of tools we have for educational research are two databases called yeast proteome database and the Stanford genome database and they're fully annotated for every gene there's a thing, there's literature and connections over the internet to all that information just an extraordinary resource and as a model it might be used to create one for other organisms and it's deficiencies might be corrected by looking at that model and asking what could be improved I know YPD is very powerful but I thought perhaps you could correct me but certainly when we've accessed it we can't dump it down so we can't have access to the whole lot it's just one thing at a time and that really causes problems when you're trying to do it over the whole database YPD which is the most powerful of these was developed by a company and it's really accessible to academics but is not fully downloaded and that actually this is a real problem because that's just what you want and you don't want to go and repeat all of that but at the same time you really need when you're trying to build up protein families you need views that cut across so they go all the way across now maybe we'll have to try and find clever computational ways that actually pull the data out for a particular query but if you're trying to build integrated resources it's very difficult when you can only do it a gene at a time it really is I agree because it was mainly hand annotated is my understanding that YPD was all and the human genome may well need to be the same but these will be the same you won't assure me I think it's a real challenge actually and it's again one of these interfaces that we need to perhaps resolve Thank you Janet for a very provocative talk and good answers to questions it's now time for lunch