 So, I think most people would agree with the top statement that is the application of computing technologies to biological information would be a reasonably high level definition of what bioinformatics is, but when you start drilling down into what you mean by that, you end up getting a whole lot of different ideas and this is just some of them that's not comprehensive and I'm sure you can answer this, but in discussion with council members, my colleagues, etc., these are some of the ones that come up specially for NHGRI which is the development of new methods in our rhythms, the use of existing methods, development of laboratory information management systems, processing of biological data or pre-processing the data, development of analysis pipelines and using and developing different environments and curation and collation of and representation of biomedical data. I would also say that some people will include electronic health records in here and a variety of other things. The key thing here is that this means many things to many people. So in the context of what I want to do here, I'd like to be able to give you an understanding of at least the way we see this at NHGRI from our perspective. It doesn't mean it's complete or comprehensive, it's our first attempt at trying to do this. So this is why I've called this version 0.1 and this is an attempt to look at a high level view of what we do within the bioinformatics program and to give you an idea of how we actually view that and our perspectives based on that very broad definition that I gave you just a moment ago. So in the context of today, what I'm going to do is give you an idea of what we're going to cover. I'm going to cover two main areas. One is the actual grants portfolio itself, what I heard from council is what is this made up of, et cetera and I'm going to give you a high level overview of that and I'm also going to describe some of the bioinformatics activities. Often at council you hear about bioinformatics and it pops up in various grants or we end up helping with certain grants or there's something related to data. So here I'm trying to coalesce those activities and give you an understanding of what that looks like from an overview perspective of bioinformatics and those areas I'm going to cover really the consulting services that we do specifically inside NHGRI and also some trans-NAH activities because obviously bioinformatics is a fairly broad topic. So looking first at the actual grants portfolio, so here we have the matrix of a whole collection of program offices and this program is managed by many people and the fonts relate to the number and size of grants, those with larger fonts, we have a larger number of grants and soon as the size of the portfolio but there's quite a large number of people that work with this and again it covers a different range of skill sets from those folks that have very strong informatics skills in the computing areas right through to the biological areas who are implementers of bioinformatics tools. Now I want to talk a little bit about what we described about the types of bioinformatics grants and they really break into two categories and these are the ones I'm going to describe today. The first one is primary bioinformatics, that is grants that are primarily computational, their focus is entirely computational, it has very little laboratory work in it, there may be some grants that have some of this in it but the focus is computational method development, algorithms, representation, curation of data representation of that and presentation of that particular information. I'm going to go through each one of these subcategories as I go through the presentation. The second thing I want to talk about is the component informatics and as all things categories a little hard to sometimes enumerate but this is our first attempt at doing this. Component bioinformatics, what I mean here is where you've got a grant where the computational component is not the main focus. There's often a very strong laboratory component, technology component but what we're seeing is an increasing amount of where informatics is included in these grants and that is not obviously unsurprising. So first I'm going to talk about the primary bioinformatics grants, those which are really focused around computational biology where their main focus is computational and the areas I'm going to first look at is the model organism databases or the mods and the enabling technologies which I'll describe in a minute in the next few slides. So this institute has had a long history of funding the mods and we've also had an increasing number of enabling technologies. These enabling technologies I'll explain in a little minute but essentially these are things that enable us to look, move, manage, organize, view data that is not necessarily related to a mod although it can include that but more broadly around biological information. All of these grants are very large scale informatics grants and from the perspective of NHGRI these are the U41s, P41s, U01 and P01s. So what are these? Essentially the mods are the ones that we fund or partially funded NHGRI, Zifin, OMIM, Flybase, SGD, the yeast database, MGI, the mouse database, Swimbase and RGD and all of these really focus around looking at highly curated data, integrating data sets and sets of genes, proteins, variants, alleles, phenotypes and genetic disorders for a particular type of model organism and yes I realize that I'm suggesting OMIM is a model organism of human but it's uh I guess we could see it that way. The key thing here though is that all of them have that collectively they do the same thing as what I'm saying in point one. They also provide a common nomenclature for gene symbols and anatomic terms so that's the curation component but they also all provide methodologies for search, analysis and visualization of their data to be able to actually explore that and they've actually been working more on that as time goes on. Also historically in terms of NHGRI most of these mods have been focused individually on a particular organism but increasingly what we're finding is we're looking at integration of these particular data sets because obviously these particular data sets form a mosaic for looking at the human genome and human disease. The enabling bioinformatics technologies as I mentioned before is a broad collection of tools that we fund. This category has grown quite significantly over the last couple of years which I think really shows the need for that and I think it can be encapsulated by what the statement I'm making here which is it's a way to enable easier access particularly to those folks who are not computing specialists who don't know how to code so there are tools in here that enable those biologists to be able to better access this information but also the broader community including informaticians that allow some level of integration, development of visualization or representation of that information also management of genomic data and tools and I think this is a reflection of the needs inside the community for both the biomedical scientists that don't have coding abilities but also the need to be able to integrate these data and develop systems to allow us to view this kind of information in a cohesive sense. This particular portfolio is managed as I mentioned to show here up on the right by a large number of program officers and the names are just listed there and then below these are the current ones that we fund and just a couple of examples of these so I'll describe say Uniprot which is a collection of protein information which is represented curated and represented across all model organisms but also human and other resources and it shows a highly curated set of proteins and has visualization tools into those set. Galaxy many of you might have heard of that it's a collection it's an annotation system for genomic data it allows a lot of biologists, biomedical scientists that don't have coding experience to be able to go in and create a set of analyses that enables them to do the kind of analyses they wouldn't be able to do unless they had coding skills. HGNC the human genome nomenclature consortium an idea of creating a consensus around nomenclature of genes so that what you call one gene is exactly the same as whatever somebody calls it out on the other side of the pond. Bioconductor I think that came up this morning so bioconductor is a collection of the R statistical tools right so this originally was started for looking at microarray data but it's moved on to look at SNP analysis and other next-gen sequencing data and it's become a very large collection of these tools which is used by many inside the community and it's getting bigger as we speak. So I just wanted to highlight a few of these these are just examples of the kinds of technologies that we support that are all very strongly computational that are both useful as resources to the community but also require a tremendous amount of informatics development underneath to supply this to the actual community. Moving on to another subsection of what we call primary bioinformatics grants we have variation analysis and association. This program is managed primarily by Lisa Brooks and Erin Ramos and here we're looking at the development of methods for analyzing genetic variation and association studies. An example of a program here is 1000 Genomes and the kinds of work that this group is looking at is discovery of genetic variation, linkage studies, population genetics, admixtures and variation databases. Quite broad but it's quite a large area in our portfolio. Now we come to a collection of things which I'm actually calling related to other biological areas. Obviously when we're actually looking at grants that come into NHGRI we think about them from a biological perspective as well as computing. So these there's a collection of grants which really fall into some loose biological areas and the ones I've enumerated here. Gene regulation, next-gen sequencing data analyses, genome annotation, clinical informatics, gene expression, networks, pathways and systems and biomedical ontologies and I'm just going to go through those very briefly. Most of these grants are initiated grants. They're smaller they usually are ones are 21s some conference grants and some SBIRs. It's managed by a collection of the program offices shown here on the right and this covers a whole collection of things that we get in. This can cover things like genome assembly type data, tools, base calling tools, comparative genomics type tools. I've seen a lot of folks coming in with using next-gen sequencing data and actually trying to do comparative genomic analysis across different species, looking at strain variant information. We're seeing an increasing number of tools around privacy and encryption and also modeling there's quite a lot of modeling involved in here too as well. We also cover a fairly large range in what we would call broadly genome annotation and by that I mean gene prediction type methods, genome annotation pipelines, phylogely and visualization processes like the UCC genome browser, things like that. The last in this category that is entirely computational is that what we call the DAX or the data analysis and coordination centers. These DAX provides support services to specific projects and the most active ones right now are H3Africa and ENCODE but we've also been involved with common fund projects where we've also been managing those. The example there was the human microbiome project. The DAX are particularly involved in a collection of things and I've just enumerated those here. That is the development of data pipelines, the development of metadata standards, the storage of the raw data and the drive data. Sometimes not always the data is put through to EBI and NCBI so there's all the involvement of submission to that and it's also they provide a provision or a website for that particular project to get easy access to data, tools, SOPs, etc. It also takes a significant portion of staff time. This is the stuff that happens underneath that is absolutely essential for a project to run and takes quite a considerable amount of time of the informatics staff time to actually do this. In this case we provide some funding or complete funding for example for ENCODE for these particular programs. So here I just want to give you a couple of summaries and a little like Jeff Schloss this morning I was just going to present them to you in two particular flavors. One is the list and in the areas that I've just shown you out here for the primary informatics the key thing here is that it's quite a large component of the portfolio. I'm showing 2012 and 2013 numbers and then for the component informatics it's about 7.6 and going up to about 11 million in 2013. So these numbers 103.7 is for 2012 and 100.6 for 2013. Here's the same data just represented in a slightly different way showing that the majority of the portfolio really forms within primary buy informatics but we've got a reasonably growing component in the component buy informatics which isn't surprising to me given the kind of things that we're expecting to see in our organization. Alright moving along now to other buy informatics staff activities so the consulting services that we do. So what we do is we provide technical advice on grants that include computing components. We provide an understanding of the connectivity with NCBI and EBI resources and people. I'm frequently asked about I have some data and I've heard of this GEO or what do I need to do with GEO as an example. There was for those at Council I think over about a year ago there was the issue with the submission to SRA and so there's a lot of work done into the submission of SRA. There's obviously new schemas that come out from that so being able to translate the meaning of that to the actual POs here and understanding how to get those data from those particular programs into those resources is part of what we do and that also includes dbGaP and a fair amount of chunk of time has also been put towards specific programs for example here the Centers for Mandolin Genomics. There's been a lot of work done particularly by Chris Wellington with data and metadata structure in the submissions to NCBI. Again it's mostly because we have those skill sets to be able to enable the other program officers to understand what is needed in these particular programs. All right I'm now going to turn to the trans-NIH bioinformatics activities. These fall into a couple of different flavors. The first one is interactions with NCBI. The second one is projects with other ICs. I'm not going to have a slide on that but I'll just tell you this is the work that Heidi Safira is doing with TCGA so it's helping other programs and providing the informatics support input and advice on those kind of programs. BISTI which is a group within NIH that looks at biomedical computing and some initiatives which is the NSF-NIH big data initiative and a little bit about BD2K. So looking firstly at NCBI we've done a tremendous amount of work in actually working with NCBI and here we're looking at tracking sequence data and metadata from various sequencing programs and projects at H2RI and looking at their deposition into NCBI. This is often working with the sequencing centres, they're doing the direct deposition but we're actually looking at where that data is being deposited in NCBI and how it's represented and more importantly can we actually find it after we've actually submitted to NCBI. Which leads to the second point which is bioproject pages for the various NHGRI programs. Bioprojects are a way of representing information scientific and the various data types that are actually associated with particular projects. The first one that we set up there was the Human Microbiome Project, this is a collaboration between myself and Leader Proctor and Chris Wellington and NCBI to develop actually a page that shows you the different data that was generated for the project, the science, where to get the information, where to get the data and how to find it easily. And myself and Chris Wellington have been working hard on doing the same sort of bioproject pages that represent other programs within NHGRI. The idea here is to enable the community to better easily find the data and hopefully to be able to use it. Another thing that I've been working on with NCBI, I think you've heard about the cloud a few occasions and there are certainly council members here who are using it or actively using it supporting it. I've been working with NCBI on data and tools and the representation of that within the cloud. They have a large chunk of this data already on Amazon and Google and I've had quite a lot of discussions with Don Proust and Jim Ostell about how they've been doing this and the value of that to biomedical data. And the last one, which is the one that's coming up more recently, but I think as you've seen from other council presentations, the use of variant information. So we're starting to work a lot more with ClinVar and DBSNP and the representation of that data, the flow of that data from various resources and awards that have been made by NHGRI into these particular resources. The principle is the same thing. We want to be able to get the data in there in an efficient way that is represented correctly and it makes biological sense and it's easy for the community to be able to access this information. Those are our goals. So these are all the hidden things that happen, but I think it's what makes the data extremely useful and usable to the community. Next thing that we do is related to BISTI. BISTI is the Biomedical Information Science and Technology Initiative. It's a consortium of representatives from each of the NIH institutes and centres. This was started back in 1999, 2000, which was a report written, I believe, by David Botstein and others as advisory committee to the director about the needs of computational sciences within NIH. And that report embodied a collection of things which were suggestions of how we would actually consider dealing with computational issues at NIH. And that consortium of people in NIH is primary program officers, but there are additional folks that actually come to these meetings. And we focus on the discussions related to biomedical computing. The focuses of those discussion can be around the development of FOAs. For those of you who submit to NIH, there are a collection of program announcements that are specifically related to the development of computational tools or the further development of computational tools. But we also have presentations around certain technologies for example, we've had natural language processing. I gave a presentation on the cloud. We've had various information science technologies represent. And so we get a better understanding of the kinds of technologies that will impact us when we try and deal with biomedical data, use it, and also to act as program officers inside NIH so that we can actually act as advocates across NIH to better educate our various colleagues. The trends in NIH activity here I'm highlighting is the NSF Big Data Initiative. This was done in 2012. This was a joint solicitation between NIH and NSF. It has the institutes that I've enumerated up there. We were one of those. The solicitation was an attempt to try and bring these two agencies together. NSF traditionally focus on pretty hardcore compute support structure. And NIH does not. So coming together was a very timely pursuit because I think we both had a real need for doing this. The initiative that came up was this one which is called Core Techniques and Technologies for Advancing Big Data Science and Engineering. And the key take home message here was to be able to make collections of large data tools easily more accessible to the community or developing methods for doing exactly that. And I think it was a very interesting first round. I think the key thing here that was really interesting was that working with another agency which is involved in heavy duty computing merging with some of our discussions I think we had a better understanding of what that community has been building and they got a better understanding of the kinds of needs that we actually had. So I think it was a very helpful and opportune time to consider having a look at this. The last thing I want to highlight is the BT2K. I think you heard Eric Green discuss that this morning in his director's report. This is the Big Data to Knowledge initiative and it breaks down into four primary areas, data sharing and standards, software methods and systems for analysis, training and centers of excellence for big data analysis. The various program officers listed here and also analysts, the folks who are actually been working on this particular initiative, we either co-chair some of these groups or are particularly involved in heavily working with other groups in these areas. And this is proving also to be it's time consuming but I think it's very relevant to the mission of NHGRI and I certainly think we need to be involved in this from the perspective of our own better understanding of how the community is dealing with these issues and also being involved from our own perspective of what we do within NHGRI. So from that I get to my end and here I realize that I've spelt it in the Australian way rather than the American way so you'll have to take my acknowledgement for that. I'd like to thank our advisory group. Here we have a small group of council members that have bioinformatics experience and skills that range from development to the end user and I think it's very important that we get that broad perspective and they provide advice on various bioinformatic issues that affect NHGRI and they've been extremely helpful in going through the presentation that I gave you today and providing advice on what you guys might be interested in hearing. And our council members represented here are Carlos Bustamente, Jill Mezrov, Bob Nussbaum and Lucila Honor Machado. They've also kindly agreed after this council to look at additional topics as we move forward to figure out what we need to bring to council and when we get to the conclusion of this presentation it would be very interesting to hear from council what other things you would like to see. I'd also like to add an additional acknowledgement also with a need to my fellow program officers and also to Javel Goldner who's been working hard on a lot of these spreadsheets and I'd also like to make a very special thank you to Chris Wellington who has been heavily involved in a lot of the analysis of the spreadsheets and has been trying to do bioinformatics in Excel which is for those of you know what that means it can be rather complex and challenging I think is the right word. So with that I'll end. I hope that's given you a general understanding of what we do. It's a very high level overview but it's my first attempt at trying to figure out how do we actually present to you what our program is about, what's in it and how do we do what we do. So now I'd like to open that up to questions and see how we could improve our next round. Thank you. Thank you Vivian. Questions, comments, just what you wanted. You guys said you wanted a good summary of our bioinformatics portfolio. Yeah and thank you Vivian. That was very helpful and of course as we start to maybe understand a little better of questions maybe arise. I was struck with, we talk about the transient IH things. You have Biste which started what in Biste and like around 2000 then you got big data and then we got big data knowledge and but they all sound similar to me in a very way and I just wanted to, is there some continuity, some learning, some linking among these transient IH initiatives? Yeah many of, so Biste was started as I said as an attempt to try and coalesce the issues around computational biology and what came out of that was certainly some program announcements and in BD2K we're looking at furthering those, we're looking at developing those, leveraging what we've already learnt from Biste so there's an example where we've taken that further. Some of the things that were suggested in that original report that were put in place into Biste are being extended into BD2K and there are many program offices that are part of Biste that are also part of BD2K and so I think we've spent a lot of time learning and developing further than that. I think Mark's going to add something. Biste in general was much more voluntary on the part of the institutes. Some institutes participated in the program announcements, some didn't, most of those they're mostly focused on R1 type grants and in a sense investigator initiated although sometimes program announcements are considered solicitation, BD2K is much more actively programmed and much more directed at a specific issue of big data. Even though, and Jill can comment on this, the report from the data informatics working group of the advisory committee to the director on which BD2K was based really sort of morphed over the course of the report from big data to all data but as it's being implemented the initial focus is really on large data sets. Yeah I would add too that if you look at the BD2K Centers of Excellence RFA that went out you'll see great similarity between that and the Biste NCBC centers. In our report we tried to point out some ways in which we thought that could be improved and I think the main thing is that when David Botstein and I think it was Larry Smarr, wasn't it, wrote the Biste report I don't know how many years ago, they foresaw exactly the same issues that are pressing on us now and so it's not surprising in many ways that there is some overlap in the ideas and the concerns and I think both Biste and the BD2K report as it's called are just trying to point out that we're in an age where computation has to be taken seriously and these are some ways in which that can be done and that as Mark points out it started out as the big data problem but in fact it's a much broader all data issue. It's as much the complexity of the data that we deal with as it is the amount of data that we deal with. I don't think the problems have gone away, I think there's similar problems. I think what's becoming even more urgent is the fact that with such large data sets you need to have the ability to access and analyze those and you've got the folks that have computational skills but what you need to do is get into the hands of folks that don't have those so what we're seeing especially in some of our enabling technologies is tool sets and systems which commit the biomedical scientists without those kind of skills to do exactly that. Access the data in an environment that's supporting of large data sets and can actually deal with it and the second thing is to give them tools to be able to do their analysis that's something that we're seeing it was pressing before and it's incredibly urgent now. Yeah and I would also just add what I think was a critical point in the in the report of the data informatics working group is that being successful here is going to require considerable culture change at NIH. Culture change is being to take computational issues seriously to actually spend money supporting that that seriousness and and then the whole issue of data sharing making data accessible so forth and my sense is that BD2K through BD2K and through the appointment of Associate Director for Data Science there's a much more active approach at the NIH level than there has been in the past. I think one of the things that we see is an increasing number of grants where the focus isn't computational right and they're wanting to use those technologies to be able to better do whatever they're going to do biologically in the laboratory of technology development so I think that's going to be pretty important that we know how to deal with that and acknowledge that exists and how do we find out how do we support that community how do we get access to the tools and how do we fund those kinds of things. So I don't think this came up earlier but if it did excuse me if it didn't I missed it how is this all interacting with the you know administration-wide directive to come up with plans for making publicly funded data available. I assume that there's a very direct interaction with that administration-wide kind of. Talking about the OSDP members. Yeah yeah. Definitely Mark I don't know if you want to address that. Definitely within the context of BD2K that's becoming incredibly apparent. You want to make all the data accessible and findable and I think some of the things that I gave an example before for NCBI making sure that the project page is there so the data is findable and locatable is another example within the context of what I gave. Do you want to add anything to that Mark in terms of. So I was just going to say that there's a group at NIH that has developed our plan to respond to the OSTP memo and on that planning group are members of several members of the BD2K initiative as well as NHGRI is sitting on it and several other institutes that have major stakes and history and experience in developing the data sharing plan so that we can try to coordinate across the board. So we're definitely keeping track of that and very connected to what's happening so that we can talk to each other basically. Many of these algorithm development software tools are major investment for NHGRI especially the extramural projects and in reading some of the reviews there was concern that there wasn't enough usage of these tools and I wasn't sure whether all the onus was on the investigators to kind of market and get people to use them or is there some sort of partnership with the institute in trying to showcase your investments and getting people to use the tools because obviously something at a university is only going to get so much traffic for people to know it's there where NHGRI has much probably a lot more hits in a day so is that an issue or is it something that you've thought about? Yes and I can answer that in a couple of ways so we have a U41 which is a community genomics resources and the enabling technologies that I described today one of the key issues there is high utility to the community so they have to be able to demonstrate that reach out to the communities and really work hard at that it is also something we look at very strongly when we look at these grants in their annual reports etc but there is also a need for folks that develop tools that aren't necessarily going to push it out to the community just yet right they're developing new methodologies which they're just simply testing out to figure out do they work and I think we need to support those kinds of things so it's a balance between those kinds of issues and I think that's really important Jill. So I think Tony was making a somewhat different point and that is that yes the onus right now is on the tool developers the system developers the database developers to do all their own marketing and to make whatever connections have to be made and you know I think it would be an additional way to support these groups that NIH is that NHGRI supporting with funds to also help with that endeavor I'm not suggesting there they're having in the past been projects where institutes have said if you're responding to this RFA then you must show us that you're using this particular other project that we fund so I'm not saying that but I think you're right that there probably are some good ways that NHGRI could help with the marketing and help with the visibility of some of the good tools that that are developed with NHGRI funds. One of the discussions inside BD2K I think you heard about it this morning was the data catalog there's also a discussion of a software catalog which is the ability to locate find software which has been developed by NIH and the idea behind that is so that it's not stuck on a PI's website but it's easily findable and so that people can find this information in a way that'll help them decide whether they can use or want to use the tool be able to contact the developer and actually do that so there are ways that it's also being permeated in more recent events and I would I would just note that the data catalog and software catalog the plans for that are a direct response to the first recommendation of the data informatics working group which was exactly that for the purpose of getting much more widespread usage of existing data sets new data sets software and but it's also a way potentially if it's done right for data generators and software developers to get more credit because they they will be able to cite usage and we'll be able to start collecting usage information through the through the data catalog. I just sent out the link to Gene Space to my lab to advertise but to the question I wanted to explore this interaction with NSF because we were funded unlike probably most of the council members by equally by NSF and NIH and we use actually some you know when we're asked so we have a significant compute capacity but when we are overloaded we go to you know Big Iron in Texas and they want to know you know which NSF grant I'm going to assign this to and sometimes you know I you know I I have one that I can point to and boy I'd like to run some of my NIH project mapping on algorithm testing on some of their new on some of their Big Iron. How did that interaction look because that there are resources that I can get there that are that are tremendous you know in terms of the the power that they have and I and it's not so seamless actually in terms of crossing agencies. I degree in the terms of what I showed up here it didn't have anything to do with that but what it did enable me to do is to have conversations with NSF program officers to discuss exactly this and there are things in the work at the moment where we're going to meet with various supercomputing facilities around the country and also cloud providers both academic and commercial we're interested in figuring out how do we support the Big Iron that's needed for doing the big data analysis so for example if I understand you correctly you don't want to set up a Big Iron facility but you want to use something and if it's funded by NSF how do you get access to that and at what cost and that is exactly the kind of thing we're looking at because the big data initiatives the big data issues are that you need something to be able to work on that way and we don't want to fund something we want to reuse things and so I think those conversations are occurring now and we want to figure out what technologies we have to look at. We have tried to interact with the supercomputing centers in the past it hasn't been necessarily successful but in most recently it has been and it's been very productive in fact you know we can do a lot of things with them that you know in just a fraction of the time we can on our own machines and you know in the past it's always been some interface issue we couldn't find the right person but some of these places actually do have the right people now that you can work with and get get what you need done. So in the conversations I've had very recently with a number of supercomputing centers they were extremely excited to be able to talk to us for exactly the reason you've just said they want to be able to figure out how do we work together is a way to deal with this can we leverage that what's the right way to do it so the conversation is the timing is now and I think with the such the large data sets that are available pushes that conversation further down the track. So thank you for bringing that up. You just gave us a two-year snapshot of the funding and it looked largely like it was it was static year to year. I'm curious on the overall balance of methodology support versus tool support or making software available to people. Do you feel like you've got the balance about right and will you keep it about the same? Development of new methods algorithms approaches versus software that exists and facilitating access to that software by biomedical investigators. Do we think we've got the balance right? Yes. It's a constant challenge. I'm not sure if the balance is right yet but we're aware that we have to support both so I don't have a better answer than that other than we're aware of it and we need to look at it a bit more deeply. I know that it's in the back of my mind at all times which is what are we supporting for development of methods and that which is out there. Then there's the issue too that you're constantly supporting these resources and you don't have additional funds for new things. That's another aspect of it as well. Not a good answer but it's something we're certainly considering and thinking heavily about. If you have any suggestions very helpful to us. I raised it just because I have a concern the tendency often is to take it away from developing new approaches and thinking about the problems and just to put it into the tools and I think that's a risk but it doesn't look like it's a problem just yet. We're very aware of it absolutely. Collins. So I want to thank you for putting that together. I mean I think it for me it's incredibly useful to get a sense and just to echo that it's a really tough problem and that you're supporting things that run basically from almost what physicists use command line tools all the way through you know what could be EHR and interactions with things like Epic which often costs hundreds of millions of dollars to install in the first place right. So I guess my question to follow up on your question is that if we think about what the sweet spot is right in terms of what we envision in terms of the bioinformatics program do you want to get tools to the point that they become commercialized because bioinformatics it's not often very easy or do you want to create tools that are good enough so that you know a good set of analysts could take them on say like the way GATK has been produced or is it more to develop the ideas with the thought that because the science is moving so quickly in fact we don't really need to be thinking about a software development cycle like MS Word where we're going to have programs that we're going to be supporting for 10 or 15 years so I'm just curious your thought your sort of big picture thoughts in that regard and others because I don't think it's an easy thing to sort out because there are a lot of technical and I think you need to think about it across that entire spectrum that's one two you need to have a balance so I think you're going to have as you were saying before you need to have development of those tools you need to get them also particularly with so many biomedical scientists not coding and don't necessarily need to you need to have tools that you can the term we use is hardening right to get them out to the community so we need initiatives that deal with that as well for current tools but we also need to leave the space for new tools that think about things in new and interesting ways to get it into the hands of the biomedical scientist so you could use the same old stuff but you also need to think about new innovative ways of delivering that information based on the science that you're trying to do I'd say although data integration is one of the key things I'd say it's even more imperative now being able to do that and I think the difference that we're seeing with some of the data integration pieces is in the past people would try and make interoperability at the base level of the data and try to make everything match as you're trying to look at different data sets we're now realizing you can't do that because the volume of the data is too huge so you have to essentially take a data science mining approach as one way to think about that to do that so I think we have to cover that whole spectrum and we need to think about how we need to do that in the context of balancing our priorities and then we absolutely need to be able to talk to the community to find out what's needed out there in the community as well I mean I you know I wonder if we're reaching a stage in bioinformatics and particularly of genomic data similar to the what we see in the development of sequencing technologies where the role of the funded entities is to get it to a point where it can then take on a life outside of what is the traditional either NHGRI funding or the universities I mean just because you know truth be told right you know universities aren't particularly good at the engineering side or the software development side we can take it to a certain point but you know you'd rather let people who do that really really well take that on and I think in bioinformatics there just hasn't been a lot of money to be made but you can imagine that now potentially we're reaching a point where that could be I'm just curious what other people think I mean in terms of that investment right we wouldn't fund somebody to build a sequencer right we'd fund them to develop the technology to get it to that point and I think it's always kind of weird is that the ones that do build a sequencing often then have really crappy bioinformatics to support the sequencer that then kind of comes back to us to develop so anyway Vivian do you want to sort of add to your answer by talking about a little bit about what BD2K offers as an opportunity to to address these kinds of questions rather than just talking about NHGRI? Sure do you want to take the lead on that since you're the BD2K's are? I was thinking in part about the the software development component of BD2K focusing on underserved areas of software development and opportunity for new tool development in areas that that historically have not been well supported and also the role of the centers in taking the tools they develop and further than than perhaps groups have in the past. Sure so to add to Mark in the context of BD2K in terms of the software let me step back a little bit when Bisti was developed the two main program announcements there were related to if you're developing some software and it's brand new there's a collection of R01s R21s etc and then if you've already developed something and you want to maintain it or further develop that that was the other set of FOAs that were available in BD2K we realized Bisti still does that and it's great but we want to grow that there are very clearly underserved areas that we really need to look at things like compression which don't really impact the biology in the sense of the actual biological question but pre-processing and filtering of data for example there are some grants that we now fund that look at filtering a very large data sets before you do an assembly now where would they get that funded in the past it's not really a methods development because it kind of gets applied to biological data it's something that might get thrown over an NSF and so this is where the line is bordering and so there's an example where BD2K we're trying to capture these things which I guess move more on the line of computing that's very strongly supporting of biology so if we have a continuum of computing here and biology here we're moving I guess to the left a little bit which is a good thing because you need this computing support systems in software and hardware to do that which I think you were just pointing out before. Can I make a comment also that I think as we talked about genomic medicine in application and clinical practice the integration with electronic health records in other areas it has traditionally been apart from the computational biology and other aspects but the also tying into privacy of human genome and everything I think there it's a point where medicine clearly meets biology and computation is right in the middle of it but there is a tremendous lack of applications tools and even environments to support that data in a hip compliant manner. One of the areas that we're noticing a lot of is you've got a biological scientist who has a tremendous amount of data and may actually know how to use it but they don't have access to big iron right they don't have compute systems locally to do that so what do they do and this is where you were just describing before about moving to super compute facilities a trend that I'm noticing is people using AWS which is Amazon right why because you can quickly set up VM up on that run your analysis and for very little cost analyze a tremendous amount of data bring it down and you don't have to incur any computing costs so we're seeing people leveraging these technologies to be able to do their science and we need to sort of think about the impact of those things I think something we should look at no I totally agree I mean in fact one of the things we've done in thousand genomes is we've partnered with AWS and then even AWS revender so we have a collaboration with DNA nexus where they've ported a bunch of our pipelines and they can produce call sets incredibly quickly which we think is great you know and again it sort of gets to that sort of what I would call the sweet point where you know if they want to then repackage it and you know make it available to other people then from our perspective that's great it gets our tools out there and we don't have to worry about building a system to do that right so you know we can then focus on what I consider that we're good at which is the methods development part and getting an implementation and not the sort of distribution channels and all the kinds of things that are much more what you know companies should do right part of the problem that we come up with too is that when people go into review with grants which are resources they often get dinged on the innovation component right because it's a resource so it's not as cool and innovative as something else but it's an absolute essential need for the community so the current system supports the development of these technologies resources up to a point the question becomes how do you move that potentially into a commercial market which is viable both in terms of a business plan but it's still at the same time is respectful to what an H needs to do and I think that's that's a nexus that we need to figure out how to handle I think there's going to be some issues here with public private partnerships which we're going to have to look into simply as people are using either super compute facilities clouds or whatever those are things we're going to have to consider so we've reached the point where we're criticizing peer review and I'm going to save us from sailing over the abyss I will give Jeff the last comment not about peer review but well maybe it is um it's that we're we've I've often heard that while people don't mind paying for uh buying a sequencing machine and buying kits that they expect their bioinformatics tools to be free uh and and that play so that plays into is their market go ahead mark in detail I think commenting on the slide video showed about the NHGRI staff who've been involved in bd2k I don't think I saw Betty Graham's name on there and I just wanted to make sure that Betty got credit Betty is one of the co-leads of the training component surprise surprise and uh has been uh doing a lot of work on that it was a test sorry Betty all right thank you Vivian wonderful presentation all right we're going on to the last presentation of the day which is by uh Heather Jenkins she's the co-director of the NHGRI training and career development program and she's going to present NHGRI's proposed plans to expand our current training program and I'll because we're getting near the end of time here and we have a hard stop to get you downtown I'm going to remind the council we have a closed session uh a period of time in closed session to discuss the training component as well so