 Gwyd. So all the other exciting talks you're going to hear about today are going to be what's come out of the modern code project and the types of interesting scientific discoveries, and my job is to represent the data coordination consortium which is the data herding part of the operation and tell you about how to get at the data, so this is a talk about the data, not about the scientific results from it. Felly, o'r gweithio, mae gen i'n 10 grwyffau data o ddechrau, ond, mae'n dweud bod ydych yn fwy o 50 ddechrau labu o ddechrau a oeddiech chi ddweud o ddechrau. Yn ni'n cymhwyl â'r FNHGRI, dyfodd yn ei ddweud i'r ddweud, nad ydym yn ddigon cyfnod yn ddweud i'r ddweud i ddweud, mae'n ddweud i'r ddweud i'r ddweud i ddweud i ddweud i'r ddweud i ddweud i ddweud, ac oni would be uniformly annotated, thur data would be curated in a nice way and therefore usable in the long term. And thinking about what Eric said about the human genome it's quite ... in some say it's quite simple to represent a human genome. It's one nice simple flat file with TCs, Gs, and Nns in it. And to represent features on that is pretty easy as well because they're just starting endpoints on chromosomes. But the type and range of data you're going to see later here are altogether more complicated and although this data can end up in well-defined public repositories, the idea was that there would be one highly curated coherent whole, one location where you could find the data and it would be of a uniform standard and that was the purpose of the data coordination centre. So some of the people within the data coordination centre were known as Wranglers and they were assigned to the different data production groups and they helped the data production group to submit their data and importantly metadata relating to all of the experiments to the DCC. The metadata, the bits of information about the experiment, whether it was done on a fly or a worm and whether it was done at what developmental stage and under what temperature and what other conditions and in order to take these very complex data sets and be able to use them truly in the longer term it's important to have very good metadata in a very rigorous way even though that can be a kind of painful process to derive it. So the DCC's job was to accept the data and the metadata, perform quality control checks and vet it and then release it to the public and on release it would end up in a fasted browser which allows you to find data sets of interest and I'll tell you about that a bit in a minute into a data warehouse called ModMine and also it's being made available by two different so-called cloud computing environments. So in terms of volume there have been so far 2,300 data sets that have been through that QC process and have been released and somewhat I think to everybody's surprise the final number is going to be more like 3,700 so when we everybody moved towards the end of the project and had to sort of clean out of their data sets we suddenly realized there was a 1,400 more that needed to go through the pipeline and as the project progressed data sets tended to get larger so the current release data volume is about six terabytes but we're expecting 20 to 25 in total and that's really more than you can comfortably carry around on a laptop and so we're in the post-laptop era, I remember meeting somebody at a conference in the days when disk drives are small and they had the whole human genome on their laptop and it seemed very novel and it's a nuisance to download that volume of data as well so inevitably what's going to happen is you're going to start moving computation to the data rather than the other way around and that will happen both through Amazon freely providing the data and also there's a binimbus cloud that I'll tell you about as well so the data will be available in a number of different ways as raw file depositions to the repositories geo and sra so if you like that's the crude crude data worm based and fly based the model organism databases for the for the worm and fly communities will take the refined data from the modern code project but that will take them some time and then everything's going to be available on the cloud and you'll be able to take computation to the data so the front page the portal that Elise mentioned is moderncode.org and it has a number of different tools which I'll briefly skate through starting with the data set search so the data set search is a so-called faceted browser and what this means is that without knowing what data we've produced you can try and explore to discover what there is so I don't know if you can read this but for instance there's a list of the organisms here if we click on melanogaster the number of data sets that would came from Drosofla is 792 this is a bit out of date of course and if we were to select something else for instance transcription fact to binding sites that number drops further so we're discovering that there are 87 data sets related to Drosofla and transcription factor binding sites and as you select make any selection the other numbers adjust so whereas it started out with a number of Drosofla data sets being 792 that dropped as soon as one made a selection for the number of transcription factor binding sites so there are many different categories that you can facet by so if you want to know the output of a particular lab you could open up their principal investigator one or which cell line or tissue temperature developmental stage and so this is a good way of then exploring what is available and in the process it brings up a list of the possible data sets you can tick them off and collect them into a little shopping basket and having done that you can do various other things like launch g-brass look at a genome browser view them in mod mind warehouse you could download them get a list of the download URLs if you want to do it in an automated fashion or find out about where the data are on the cloud so another way of accessing the data is just plain old fashioned FTP where there's a big list of directories arranged by organism and two files that you should pay attention to the manifest which says what there is in the entire project everything that was produced and the other is a metadata file that carries common metadata across all experiments so if you want to find out roughly what the entire project did that's perhaps the file to look at and you can burrow down as you'd expect through C elegans transcription factors into the parent directive that chip seek experiments to get right down to the data files that cover the entire genome for this particular set of experiments so back up to the the the top if you were to if you wanted to go straight into a genome browser where you're viewing the the data presented across the individual genome you could have done that through the faceted browser and then launching g-brass but sometimes it's useful to go in directly into directly into g-brass where there again you can select tracks in much the same way as I just described so you could say you're interested in chromatin structure out of all tracks and clicking on that would allow you to for instance choose system modifications and within that chip seek experiments and within that you can then select individual experiments up my server them to find out more details and finally bring up g-brass where you can see the various different data tracks laid out across the genome and because sometimes you want to compare and contrast particular types of data you can save track combinations because it can it can take a while to set up the correct set of tracks so now on to mod mind which is the the data warehouse and the mod mind is built on a platform called intermind that makes it easy to integrate a large variety of different types of data and to perform flexible querying and for the modern code project there's been a number of different tools that are integrated into it which I'll I'll go through some of them quite quickly and so for instance one is there's a simple search and for instance if you're interested in chip seek data sets which were carried out on lava you could type in chip seek and lava and but equally well you could type in lab head names or other other items of interest and this will bring you up a list of submissions and where you to select one of them you can find out quite a lot of details about how which you won't be able to read here my fade which about how the experiment was carried out and well when and how much data were available and there are a couple of things of note one is off the bottom of the screen which I'll come to next but first this green box which allows you immediately to download various data in various different common formats like tabbed limited or gff or sequences associated with the the features that are annotated or to find lists of genes or other things that overlap the binding sites that have been defined or find genes that are nearby to the binding sites by derive defined parameters so these are things that people often want to do with these types of data lower down the particular submission page is a big blue stripy area which is kind of important this was what the pain and suffering of collecting all the detailed metadata was about and in principle you can start at the top here and find out how the strain was grown and for instance because this is a chip experiment how the chromatin perhaps were done and how the chip was carried out how the hybridizations were done because this was a chip chip experiment on through how the images were collected how the data were normalized and how the enriched regions were developed were extracted from those data and the idea here is that if you end up using a particular modern code data set you can work out in great detail how it was in fact generated and each of these links here takes you to a corresponding wiki page where there's a detailed experimental protocol for for how the step was was carried out so the hope and the idea is that in years to come people can look at the data scratch their heads wonder wonder what it really how it's really produced and backtrack through all of this in order to work out really exactly what was done so another feature of mod mind is that it works with lists of data so for instance it's often experiments nowadays produce lists of genes and you can upload lists of genes into mod mind so best mind for instance that you're learning a demalagasta gene set and this then makes available for you a number of tools that run on the fly so for flies anyway not for worms there's a developmental time course in a series of cell lines gene expression list that's generated and so this allows you quickly just to have a quick look see to have for instance your favorite gene set varies or co varies across for instance a developmental time course and there are other widgets little tools for instance for go term enrichment or publication enrichment and publication enrichment is while you find a set of genes you're interested in and if there's a publication that unexpectedly frequently sites that particular set of genes then it'll that'll come up at the top of the list and that's a useful thing to do because you want to know about the papers that are citing the particular set of genes that you've for instance just pulled out in an experiment another tool that's we've recently introduced is the genome region search and in the genome region search you won't be able to read this I'm afraid but I'll describe what it is you can upload a set of regions that you're interested in so for instance chromosome to positions one million to one million and a few thousand and then select the types of data set modern code data sets that you would like to extract features from and it will then for as many different chromosome regions that you provide will provide you with the overlapping set of annotation so this is a way of pulling out slices of modern code data corresponding to just a particular regions of the gene that you that that you like and are interested in so there are a number of other features that I don't have time to tell you about so for instance there's a we have a thing called template queries or template searches whereby four commonly asked questions or common tasks are available as little web pages into which you can put a gene or you can select a list of genes and carry out searches with those and there's viewers for the fly chromatim states and a link to the park lab viewer for those data and also the interactive regulatory maps so for instance the science papers had a fly and a worm regulatory transcription factor micro RNA maps and we made dynamic versions of those so for instance you can master for a particular gene and highlight its its regulatory targets so finally I want to turn quickly to the the cloud data and so Amazon as I mentioned before will mount data sets that it believes are going to be useful to the public and we've Lincoln signers managed to persuade them that modern code data will be useful and they'll do that for free and if you were to google Amazon modern code data then I think that within the top two or three hits you'll get this link here which is the link to the Amazon web services browse by category academic data under biology of modern code or organism and psychopedia of DNA DLA elements data the moment that's about five terabytes loaded as I said it's going to go up with time and the way things work is this it's a pay as you go and the reason it's pay as you go is modern code funding is for a limited amount of time and therefore there's no way for us to pay to to provide the data to you for the long term future but you can load a you can mount an Amazon machine Amazon machine image which is basically a computer with all of the data all of the tools that I've just described on it and that then allows you to use all of those for as long as you like and then shut down your down your image and if you'd like to do calculations you can you know buy time on a thousand large machines and carry out big big computations you you have the choices of mounting everything tools and everything or the G-brass data or just the data and the advantage of this is that we can make everything available on the long term and don't have to worry about system administration issues related to keeping your own machines up to date that the Amazon takes care of all of that now if you don't want to pay for the time being anyway by nimbus.org has a similar environment and similar data and as long as you register with them you can get access to the data through by nimbus and they'll provide computational resource as well so for the time being the most attractive option is probably to go there but in due course they'll that that will probably cease to exist then Amazon will be the will be the long term home of the data um so i've told you briefly about the monocode project and how did data get in the front end portal faceting browsing uh finding data in the ftp site the g-brass the data warehouse various tools that were available and the fact that all the data will end up in the cloud and so not everything's on Amazon yet but in the last year as we ramp everything down the all the tools that I described are going to be available as a as an Amazon machine image and and that that will be that now the only last thing I should mention is that we have a help email address help at moderncode.org and we like receiving emails and we usually respond within 24 hours not always but usually and in the future there is going to be likely a joint modern code encode data coordination center run by Mike Cherry and I believe that they're probably going to keep responding to those emails as much as they can in the longer term and I talk too fast so I'm happy to take a quick question actually technically I think I could have 13 minutes of questions but I'm sure I'm sure you won't do that so I'm a basic scientist I'm not a modern code person and so I'm curious with these different opportunities if there's a simple way to load in one's own personal data one thing we are often doing is comparing to the modern code data and so I'm just wondering what the interface is between that those kinds of data sets so it depends exactly what you want to do but for instance if you were to go to gbrous where you're viewing the data and of course the human eyes and brains are one of the best image processors are out there in gbrous it's gbrous it's possible to upload your own tracks of data and there's a variety of different very standard formats like the UCSC genome browser that are acceptable so BED and GFF and things like that and so it shouldn't be a problem if you have any difficulties help at modern code so I sort of have the opposite question is there a way to incorporate analyses that people are going to do from this moment forward not just by uploading data but but also to capture the results of the analyses to so the so some of the data sets that the modern that the consortium has vetted and had a released are in fact analysis or reanalysis data sets and so yes in theory right now as we stand here it's difficult to do that because we're ramping down we have a very large backlog of submissions to get through as I said we're all all kind of surprised and so it's hard at this juncture I think to take on more data sets and so what you're asking is in the long term future two years from now how would one go about doing that well I'm thinking about you know the hope is that this massive amount of data will be used by the flying worm in other communities and so it would be really great to be capturing that in the centralized site which I'm hoping will be taken over by the code so I was more thinking about not not what we have left to do in the next year or half year yes I see I see what you're saying well I guess there's kind of two types of answers to that one is that because we'll put everything up on the amsymysine image and that and Nicole correct me if I'm wrong but I mean in principle that's going to be the entire submission and processing pipelines anybody who wants to could rehydrate the entire data processing system and load in new submissions but I think something slightly different will probably happen in practice which is that the distillate the conclusions from the the experiments maybe not all the detailed data on which it's supported will end up in worm base and fly base and they're sort of mandated in the long term to curate and provide data to the communities and we work with them or closely already and are already passing off data and fly bases done a really nice job of for instance displaying the expression data so I think that may be a way forward and I know that fly bases been looking at for instance using into mind as a way of maybe managing this scale of data because up until now I think I'm not actually sure what the relative volumes of data are but I believe the modern code data sets really are quite big and so it may be causing some head scratching on in the model organism databases as well and how to deal with them