 I've been, I've seen Rob give presentations like that several times and with the moving ball through the PCA cloud and I've been trying to come up with a good name for that. And at first I thought it could be just ping pong ball in Hailstorm. But I actually realized that it's just, it's all the topics that Rob wants to cover and then the ping pong ball is just touching each one of them as he sort of proceeds forward. Okay, I have a very, very gentle commentary for everyone which is I respect and admire you all tremendously but I think that there's a way in which when you've been talking about your charge, which is to tell us a little bit about the gaps, you've been in a sense a bit too polite and I really, I mean this sincerely, when Lita and I talked a lot about what we wanted to accomplish with this meeting, we really had in mind that one of the things that you would do, and it's certainly happened, people are definitely touching on this, it's certainly happened that we were hoping that we would get schooled in really no kidding what are the obstacles to you getting your science done and what if you just could spend a little bit of time like really letting us know about what are the important things and I say this respectfully, Curtis when he presented he said well let's talk about the gaps in our knowledge and other people have talked about the gaps in that context and I, and that's great and I think it's just that we're all sort of getting to know each other in this new field and we're behaving in a very polite way but I would, I strongly encourage people to, if you have time to definitely dedicate some thoughts in your slides to being very direct about what types of things are preventing you from accomplishing your research when you're starting to talk about gaps and on that subject I will do what I usually do in most of my presentations which is to show almost no data whatsoever and the entirety of the talk will be about some of the gaps or limitations or at least concerns that I have about the field as it goes forward. So to introduce myself I just want to say that I am the PI of the data analysis and coordination center which was jump-started at the same time at the beginning of the HMP. We do have funding for another year so we're around and we're trying to do the best that we can to help meet the needs of the microbiome community which is obviously getting a lot larger. We supply lots of resources. We supply lots of training. We also have a computational infrastructure that we really would like to encourage people to use. We've developed a virtual machine-based software that could be run out in the cloud and that's it. That's my presentation on the DAC. I'm not really going to talk about that much more. When I do talk about other issues feel free to editorialize with me later on about what might be issues that the DAC could take on in our last year as we go forward. So another way to introduce myself is to say that I was involved in this fabulous constellation of working groups that were associated with the human microbiome project and you can just kind of see them here but there were a number of different groups that got together for data analysis and it's a pleasure to be presenting after Rob Knight and Curtis Hutton-Hauer who contributed enormously to the data analysis of the HMP and one of the things that we did for the HMP in our data analysis was different ways to start examining the community composition of the healthy human subjects that were studied and this is by way of saying I'm going to look at a different community composition. I would like to talk a little bit about the community composition of all the scientists in this room so that's a new meaning for the term microbiome community composition and to do that I just first would like to make a point. This may look like I'm showing you data but I'm not actually doing that. This is a slide that demonstrates what the size in terabytes of the sequence that was generated by the HMP and the only point that I really want to make from it is just by talking to two people at this meeting they've already warned me that there will be some very much larger bubbles that are coming. This was around four terabytes. They've already told me that there's going to be some data sets coming along that are in much larger size but we can also expect that there will be lots of small data sets as described by a lot of the presenters here and there's probably going to be a great diversity of them. So backing up let me talk a little bit about sort of two extremes of the NIH approach for how they get science done and this is exemplified by Francis Collins great legacy of establishing these really large consortia and some characteristics of these consortia is they tend not to be as hypothesis driven in comparison to an R01 study and they tend to be very very large engineering projects like completing the genome of some macro species and they tend to be the source of funding tends to be from one institute so it's much easier to obtain kind of a top-down control about the way people go about things like collecting samples or making the data available or just you know getting very simple things done like the way all the DNA would be prepped for example in order to do a large sequencing effort so that's one approach the consortium style of NIH research all right here's a word cloud of some microbiome text and let's just say what we do is really diverse we do a lot of different things and that's the only thing I was trying to demonstrate with this and if you look at I won't you obviously can't read it we could I could happy to give this to people but this is a table of the current RFAs that have now emerged all of which are associated with microbiome work so there's there's three or four that are out of date there's 20 that are out there to support microbiome work and the good news as Francis mentioned was that there are lots of different institutes that are taking this up okay so this is sort of a to contrast in terms of how a consortium style science might be done this is to convince you this is a table to just say that there's a lot of different diseases that are associated with this and Curtis talked about this and others people other others talked about it so I don't really have to go into that and then when I did a quick survey of all the session speakers and made a word cloud from the different departments they came from there were lots of different disciplines that they all came from so if I'm going to characterize the microbiome community one of the things that I'm going to say is that we are very diverse in research areas and there's implications to that we're supported across lots of different institutes we tend to be hypothesis driven which is great we're studying many diseases in many many systems we're multidisciplinary and we are both big data generators for some of you and then there's also going to be a diversity of let's say smallish data generators so that's that's that's our characterization of the community and their implications to that and don't get me wrong I love you all enormously this is I'm not working up to some criticism that you're somehow doing it wrong I celebrate the fact that you are coming from many different disciplines you'll have many different levels of expertise you're asking many different questions and your hypothesis is driven that's great okay so contrast to the consortium style and that's one way that we could go about it or we could talk about this diverse population and one of the things that was attractive about the consortium style is that there was common processes behind where a sample came from there were common approaches for the way that we dealt with data it would all get released you know there was metadata it was pretty easy to make available to the community for these when you did the cow genome it was obvious there was one straightforward protocol for identifying that DNA and how it was getting produced there was also a centralized computing infrastructure so all these members of a consortium were beneficiaries of the large genome centers or other centers that supplied computing equipment to help them get their job done so that was really great okay we're not so much like that this group of people are not so much like that you don't have one computing infrastructure that you count on you have one protocol that you could count on for the way that you go about isolating DNA and there's some implications to that so I'm gonna humbly ask if there are ways in which we could be a consortium of sorts that is coordinated in certain ways and the only reason that you'd want to belong to a consortium is if it was a benefit to you so let me try to make an argument for how it might be a benefit to you if we for example engaged in some type of protocol standardization there might be some increased certainty about data there might be an increased usability and also more importantly reusability if we've got lots of R01 researchers out there and all of our budgets are really tight tight wouldn't it be nice if we could count on being able to combine the data from other experiments to increase the ultimate power of what we're working with and Rob just gave exquisite examples about how it would be nice to be able to reuse data wouldn't it be nice if we as a large consortium had a streamlined IRB process okay wouldn't it be nice as a large consortium if you had somewhere to go when you had computational needs and also that one of the consequences of you coming from different disciplines is you're gonna have different levels of expertise with knowing how to generate lots of those beautiful figures that you've seen already and I would like it if everybody in the room knew exactly how to do that and I would hazard that you don't okay and wouldn't it be nice if there were some common training so this it was already said by Rob very well Rob speaks very fast and he skated over some difficult issues but if you want to get all the stool samples are out there at some stage you're gonna have to be presented with a thing called the short read archive and this is an example of when you go to the short read archive and you type in something like human microbiome stool and you get these entries and it says results 1 to 20 out of 746 pages okay there's a lot of problems associated with just trying to retrieve data from SRA and I'm it's not that I want to kick SRA in the shins it's it's that they have a very difficult job and they're being supplied with data that is incomplete okay so we have there's that list okay and there are many ways in which you have a type of uncertainty about that list okay you don't know a whole lot about its origin you don't know where those samples were prepped you don't know what patients they came from you don't know how the library was made if there were primers used you don't know what those primers necessarily were there's other things though you don't know if there is some generous researcher out there that collected those samples and put them in a refrigerator and wants them to be available same topic they may have been recruiting subjects or volunteers for a study and those people may still be available to confer downstream studies and we don't know that we don't when you're going to this list you don't necessarily know if the data has a publication behind or if that publication was cited by other downstream publications okay then this is a big one that concerns me you simply don't know the quality of the data in one from one study to another when you're presented with that list okay you're presented with a list it came from stool but it doesn't tell you if there were any type of errors along the way in terms of generating it it's very difficult to pull out things like patient phenotype and this is the one that really starls me a lot this is true right now there are lots of data sets that are in SRA that were associated with a study but you don't know which data set was associated with a disease okay and that's a that's that's astonishing to me okay so there are there are issues in terms of data uncertainty and I just want to call your attention to it okay so I'm arguing that there are some benefits to increased coordination and I'll talk a little bit about them okay I think that with improved submission standards and with provenance it would be possible to learn more about your data if you wanted to know if a sample was available I would say that we might be able to do something like create an investigator registry and to track things like bio samples that registry might also tell you about volunteers with respect to publications I'm going to show you an example of being able to track investigator publications there's also ways that we could do large-scale QC on on those different data so you'd have information about it we could get smarter about improving the way that people are submitting data to DB gap and deal with this bogey of essentially being able to track things like disease phenotype so I'm arguing to you that maybe if we start pulling together and operating as a group these things may happen with the existing data that's out there we have a fighting chance of that happening so you can imagine me I stared the the the data management process for the HMP and its ugly face that was four centers that we're getting together and trying to herd data and it was difficult enough as I imagine this diversity this wonderful beautiful tremendous rainbow diversity of researchers out there that are going to be doing all these different types of studies it really worries me if we aren't trying to be a bit more coordinated about this and I just think that there's certain benefits to that has my accent changed to sound more like Rachel Maddow I really feel like I'm channeling her right now so with respect to sequence submission I humbly suggest to you that it might be useful to have a data coordinating center that was helping as a submission broker that coordination center there may be many different models for that sequence submission we could just give you improved submission tools we could be giving you better submission tools for sequence and metadata standards and things like provenance I was working on an effort to help to get letters of support from different journals that were actually quite interested in enforcing things like metadata standards and having people submit them at the same time as when they were getting their paper published and also they could describe things like they would we could require investigators to describe things like sample and volunteer availability so let me put out there that there's lots of NIH staff here who are also putting out their RFAs you may want to be considering requiring that people are making this happen or at least describing if that's happening that to me is a type of recommendation that I'd like to see emerge out of this meeting the journals could also play a role in capturing protocols but the nice thing about that is that all this supple this could all be described as supplemented data that would be hosted at journals which I think would be a good thing if you had sort of a distribution of where that data resided in addition to being at a coordination center okay now I when I say the term investigator registry I don't want you to think that means I'm going to slap a radio collar on you and then require you to do a bunch of things I shouldn't really use a term investigator registry but I would like to maybe say a participant consortium participant list or something and I would argue that they'd be useful to be tracking PIs based on maybe the literature that's out there or the grant funding that's out there and and contacting them and asking them do they have volunteers or do they have biological samples do they have publications do they have protocols and I think that the investigator registry should could also play a role in helping with coordination of IRB approval so I'll just make a point that this is something called the NIH reporter which is actually a very nice system it's obvious that there's a lot of resources that have gone into it I've contacted them and asked them if it's possible for an independent entity like a coordinating center or somebody else to be adding more data to this and they said yes there are ways it didn't seem like the greatest model in the world but they said it would be possible so you can imagine that this could basically be a good use of the taxpayers money here's this great system and we could use it as something of an investigator registry and we could attach keywords to investigators and say is this person a member of the consortium or not and then you could pull down all the people that were and get a lot of information so that's just one thing that's out there I think that there are ways to capture names of investigators there's a brilliant young man named Ilya who worked with Larry Forney and he's developed methods for culling information from PubMed by capturing data from abstracts and doing a little bit of tech tech text mining and essentially creating networks of these investigators are associated with these investigators are depending on what keywords you might find you might be able to pull out these are all the investigators are all the publications that are associated with a vaginal microbiome study and I think that that also could be this system which is called co-pub net could be capitalized on to start sort of ceding the investigator registry that's out there this is a network that comes out of it of different people's names that are associated with a particular research project I'll also make the point that there are plenty of people out there who have actually written software for things like volunteer registries there's some great efforts out there and we could we could be using software like this and using it to the benefit of this consortium if we wanted to so that's out there I'd also like to point out this effort this is an effort by the CTSA program the clinical and translational science award program and it's called IRB share and the IRB share is one creative solution to the way that we might be dealing with IRBs there's probably a people who have a lot more experience with IRBs and I do out in the audience so I don't want to insult anybody there's I'm sure that there's plenty of policy obstacles and there might be some details of this that I'm not getting absolutely right I'm just suggesting to people that maybe we could be a bit more creative about things and the way that IRB share works is that if there's a local IRB well this first I should also say that this is strictly for multi institutional sites multi institutional studies when there's just one study and multiple institutions okay and in that scenario normally each investigator has to go through their own IRB what they do here at the IRB share is if there's one IRB and they've gone through a process they could submit all their documents to this it's an actual server where they get all these they get all these documents and then they've got different people that sit on a centralized IRB and go through the review process okay and so the common institutions can submit review documents they sort of split up the review process and the nice thing is that it promotes consistency and compliance and kind of eases the process of IRB because you can see hey this is how another institution handed it and so I'm just offering to people that there are things that maybe as a consortium we could do again you could be presented with this list and there's a really big issue and Rob talked about this that you could have lots of samples and you just don't know that much about the metadata behind them you don't know that much about the quality or many other things and we went through an exercise with all the demonstration projects for the HMP where Steve Sherry at DV gap looked at patient variables that's what these are these are different patient variables that were collected and submitted to DV gap and he colored a column if a particular study had a variable that was in common to some other studies and what the thing that was really astonishing is things even like age or height or if people were smoking all those variables were tracked in a different way okay so it was very hard to retrieve give me all the studies that were associated with people who had taken a certain antibiotic and I found that surprising and I just Rob already mentioned it we have a standardized system for describing metadata and I think it should be exploited and we also have something called FENX FENX is a study that's sponsored by the NIH that's meant to be dealing with the process of converging on common variables across studies so the way that they do this is they establish a working group the working group looks at a small number of measures a workable set of measures just like I was showing you with that previous slide from Steve Sherry and they get input from the research community they review that data and then eventually they make a final set of measures and they also have published a tool called FENX which assists people with the process of marking up their data dictionary from their clinical studies it's actual software that helps you do this and helps you make sure that it's marked up the proper way and then it helps you create the submission documents to dbGaP okay this is another thing that as a group as a consortium we could be considering okay so you know imagine if you're a member of this registry imagine if you decided okay I'm going to drink the Kool-Aid I belong to this consortium we could do things like help you with management of IRB forms we could help you if we have that party a is searching these researchers for or these subjects with these variables and you gave us your IRB and told us what variables you were interested in we could alert you to the cases where there were study participants who would like to participate in your study all we could help you with you know tracking publications we could help you manage stuff into SRA this is the sort of scenario we could be working towards last thing I just want to mention very quickly and Rob already talked about this this is a slide that was created by a group of people that were trying to standardize assays for a completely different domain and the point was is that there's a lot of give and take but they established working groups were eventually they made harmonized protocols which does not mean identical protocols harmonized protocols that meant lots of labs could be generating data that made the resulting data was comparable between studies and I would argue that we should establish some working groups that were really setting about the business of doing this and and Rob gave some great examples of how for specific domains like with protocols having to do with stool protocols having to do with different body sites we could establish some harmonized protocols okay so I'm just gonna skate forward really quickly I just want to say that at one point we performed an email poll asking the community what type of analysis is analyses methods they needed the most popular response was metagenomic assembly so we held a two hour webinar given by two luminaries in the field on assembly and a hundred and nine people sat in on that on that seminar to listen it was a fantastic success and I just want to bring that up that's one of the things that the DAC is doing but that gives you an idea of the hunger that's out there for training and there needs to be more of it okay so very quickly if I were going to identify gaps I would say we have gaps in training okay your experts coming from lots of different fields and you may not know how to do some of the statistical analyses that you saw presented today or you may not have the computational equipment to be able to do it so we need more training I would argue that some time I shouldn't use a term PI registry we should have some type of consortium centralization site that is tracking lots of information of this there are lots of different types of QC that we could be doing on all the data that's in SRA another big gap that we have is and I only heard one person really mentioned it so far is that there aren't a lot of resources out there to help people do processing or adding value to the data that exists at SRA we clearly could be going a long way with harmonization of protocols imagine if all the data that you were looking at was stamped with a bunch of standardized protocols that it was derived from and you could with confidence make some of the PCA comparisons that would rob was showing and it would just simply be adding power to your own data there are gaps in terms of how people go about being able to submit data to the to the different repositories that are out there and I'd really like to see some progress there okay so I don't have a attribution slide I just want to give thanks to all of you and everybody in the audience for hearing me out and we'll just take it from there and there's the names so I think we have time for one or two questions don't be shy don't be shy okay there you go one of the things I thought was missing there was the ability of to get help in analysis of data so when you generated data set and have never seen it before it would be very helpful if there were people that say yeah we could Skype with you one day and help you go through the data and analyze it and we just had a case where we hadn't done a raise for a long time we got some IT people who showed us ways to analyze the arrays we never would have thought of because we hadn't done that for a while so I think setting up some type of technology analysis data set of people who volunteered to do that would be very well so thank you I'm sorry I I obviously didn't emphasize that enough I agree I agree strongly I think that we could literally have help desks where people were able to contact some place and and get assistance at the DAC website I'm not not saying it has to be us but at the DAC website we have things called walkthroughs that are these step by step processes that tell users how to perform analyses I think we need to at meetings like this we need to have breakout meetings where you know it's the the latest cool publication that's out there you're just being walked through and people are describing how they generated the figure the information in figure two I think all of those things should be happening so I agree agree emphatically there should be more training I also wanted to state about the standardization I think it's very important I was involved with the cytokine and interference society where we were worried that people were reporting cytokine levels at all different papers and using all different kits you bet so what did it mean when you used a bioread kit or you and so we went to the companies and we went to the journals and said they all have to be standardized against WHO standards if they're available and that way you could interpret data between papers and to mean something I'm totally with you my concern is we have this wonderful we have a Cambrian explosion of RFAs that are coming out from the different ICs the different institutes here at NIH and I'm just really hoping that they start to each as individuals understand that it'd be nice to standardize across the entire NIH for the reasons exactly what you're saying one more question I may be missing something so maybe you can bring me up to speed but my experience in trying to find a control spike to put into genomic or matter 16 s or metagenomic data sets that there is one available from the BEI and the HMP yeah but in practice when you request it although it's free you're told very specifically you can only have one and that's got to last you for the year right and so if you actually start using the sample in in all of your library perhaps you it doesn't really these in my experience wasn't really a practical solution to what I think is a very important problem so I wonder if you could comment about the importance of a control spike that people be putting into these data into these experiments and then also what it means that we can only request it once a year so I'm with you a hundred percent I don't have firsthand experience to be able to account for the what sounds like a ridiculous policy associated with a reagent but I will certainly use this bully pulpit to say I think an enormous just a small amount of resources should be put towards a read some type of working group that's generating reagents like that and they absolutely have to be made available without any encumbrance whatsoever preferably with some very very nice publications to go along with them to tell people about the what the value and what they could be doing with reagents of that kind so I can't quite account for that what sounds like an odd story yeah go right ahead so this is Maria from and I ID and we're the ones who actually sponsor be I so I would be very interested in talking to you if there is an issue with with the amount of of reagent you get because I mean and we can talk offline about it because I think that you know maybe it was set up that way and maybe we can change things we can be flexible with things like that so I don't know what the issue is but we can talk about it okay I think we're now moving on to a question-and-answer period and I'm saying this very sincerely I think I think that there's a way in which we're all getting to know each other we know that there's a lot of different people from a lot of different backgrounds in the room but I really do strongly want to encourage people to speak frankly about the concerns you have for us to be able to go forward and just succeed at our job and so leta I think yeah oh well let's let's thank me we should thank Owen and Rob for two very good talks thank you