 Yn yw'r cyffredin, rydyn ni'n rydyn ni'n ymdod i'r cyfrifennu yn y cyfrifennu'r data. Erbyn rydyn ni'n cyfrifennu'r ymdod i'r angenion ym MbL EBI, ynglynig Biaith Arrych, mae'r perthynat iawn i ddechrau i'r byd hiwn cyfrifennu. Felly, rydyn ni'n gweithio'r cyfrifennu ym 4. Fy enw i'n gofio'r cyfrifennu'n cyfrifennu'n cyfrifennu i'r data i'r cyfrifennu. Ac rwy'n credu i fynd i'r cwestiynau llwyaf o bethau i gael gwneud eu gwahwmio ar gyfer y cyflawni ddysgu, yn y ddisgu ei ddysgu, neu'r bwysig. Ac rwy'n credu'n cael ei pwysig o ddiddordeb, oherwydd i ddweud o'r gweithio gydag ac mae'r ffordd o'r ddweud o'r llwyffydd a chaeligau'n ddweud o'r cymryd. Rydyn ni'n ddweud o'r cymryd, ac mae'n ddweud o'r cymryd yn ymddiadau, yn ymddiadau, yn ymddiadau, yn ymddiadau. Rydyn ni'n ddweud o'r ddweud o'r ddweud o'r cymryd, yn ymddiadau wedi fod yn gallu ein stances, ond o'n wath gennych i, motherfwg gallai psefyddedol. Dydaol gyda cael cy excaf placesce, y tro o ddweud o'r ddweud o'r ddweud o'r cymryd ond byw nifarau y coedl olarakmannu call лrugol a bwrdd yn cyfidelig, ejecto hyff designation fan hyfawr cymryd o'r cymrydol, ddyn nhw'n ddweudio'r sgwysig, ymddangos, rydyn ni'n gwybod. Mae'n rhaid i'r ddweudio. Felly, rydyn ni'n gwybod eich ddataeth ymddangos, ac mae'n ddataeth oedd, ddweudio'r ddataeth, a'r ddataethau yn ddweudio'r ddataeth yn ddweudio, mae'n ddweudio'r ddweudio. Eich ddataeth eich ddataeth gweld eich ddataeth, a fechaf mae'n gweithio. Dwi ddweud 50 ddataethau yn ddweud iawn, mae'n ddweudio'r ddweudio a'r ddweudio'r ddweudio, iawn ddefnyddio'r ddweudio a'r ddweudio, o dasa ni'n cyr audllw ymlaen, oedd pob jydydd yr olygu ddweudio'r ddataeth, i ddim yn bobl. Pauwysig o'r ddweudio Ac mae'r cyfle ddechrau ei wneud o'r ffordd ar gyfer y rhesedig. Rwyf eu cyfle ddechrau, ac mae'r ddechrau ar gyfer y cerddau cyffredinol, a rwy'n bwysig i ddigon i ddaeth cyffredinol. A rydyn ni'n gwybodaeth yw'r cyffredinol yn y ddweud y tro cyffredinol yw'r bod yn y ddweud. Mae diwrnod oherwydd yna'r ddweud yn cael wneud o'r ddweud o'r ddweud yw'r ddweud, don't really fit in the kinds of databases that we have at the EBI. The big data that I'm describing is frequently supported by Fundor mandates. There's long associations with journals that require deposition in the databases. There's metadata that describes them, international standards, that keep track of the formats that we use. Many of these Mae'r bwysig yw'r cyfan yn gyflwynt gwyloedd yn ymgyrchol gyda'u US i Japan. Yn ymgyrchol eu hunain europei yn wych chi'n gwyloedd y Swiss Institut Ddweudol. Rwy'n cymryd o'r fwyllteidiau sydd wedi cymryd i'r ffennidau yn gyffredigiaeth ar gyfer y cyfrifiadau. Mae'r bwysig ar y cyfrifiadau arbennig y 1988, ychydig sy'n cyfrifiadau sydd wedi cyfrifiadau neu gweld ychydig ymdweud. ond yn gwneud yn ei ddatblygu y dweud tano, y dweud diwrnodd ar y ddweud y cyfnwyr yn cypinegfaeth ar y cyfan, ac ymyl iawn eu cyfnodd ar hynny. Wrth gwrs o amser o'r dda chi, dyncario cyfnodd Gallheir Namell yn ei ddweud, cyfnodd cyfnodd ddweud i ddweud ei ddweud yn cyfnodd ar hyn, ac yn credu pwdd y papr. Dwi ni'n r recreational i ddŷn y cyfnodd ar-dylawn, Jeremy Westman y ddweud wedi'i ddweud wedi cyfnodd ar arydyn. funerdd. Mae gynhau iawn i'r gwrthoedd Cymru yn newid yn kilfyniadol, ac yn gallu bod gennym gyntaf, hynny mae'n ddraffog pa'n gwybodaeth, mae'n defnyddio methu cwrdd y newid yn gyrsdrannu, mae'n gyrsgwp iawn, yn gyrsduaeth penredu,지는au, neu os yw ddwylo fwy o bwysig iawn. Yn yr gyrsglwp iawn yn cyfrifiadau, oherwydd mae'n cyfrifiadau oedysgrwp iawn,olygaeth rydyn ni pod ni ...cymdiad y gall gynhyrch oherwydd dyna yng nghyd i siart o'r newthau... ...a'r honno ddaethion rhaglenol yn'n gwybod... ...y gyda'i mwy o'r unrhyw ddiddysgwr sy'n mynd â chynhyrch... ...y cymryd. Mae'n ymddangos hynny oherwydd gyda'i ddifffrongio... ...y ffrindig o gwahanol rhywbeth cerddur gan y cyfryd... ...rhyw gydagiau'n mynd i'u cymryd... ...cymryd'u ganddiad gyda'i ddefnyddio meddwl ar rhai bod. Beth yw hollwch am gael data yma? Mae'n goeth gyda gydag data eich llwyddoedd yn enghreiffti'r bwyl. Yn gyllun, mae'n gael hollwch chi'n gweld gwahanol sydd os bydd gyda FTP, bydd gwebmer, bydd gwebcais, bydd cyflawn. Maen nhw yw'r data yn cydwaith ymgynydamio, mae'n gwybod o gweithgoedd hwnnw'r gwybod a phakwr ystodol, felly mae'n gobeithio'r ffordd o gyfinnwyd. i'w dweud y data wedi'u cyfnod o'r disgynch yn ei wneud. Ond mae'n dwylo protein cyfeirio, ychydig uniprot, oherwydd mae'n 1.2 miliwn ddweud o'r enthrys, yn unrhyw o'r proe, oedd oedd o'r dwylo protein cyfeirio, oedd o'r proe sy'n cyfnod o'r dwylo cyfeirio, oedd o'r dwylo ei ddweud o'r cyfnod o'r cyfnod. Yn ymdweud, mae'n dweud yw ddweud yw O'r cyfnodd y gallwn gwneud cyfnodd y mae'r cyfnodd yn gweinidog gyda gwrthegau i gyrnodd gwrthegau ei gwneud, ychydig yn ddif iawn i'w gwneud y diabetau, ychydig yn ddifol o'i gwneud sydd wedi'u gwneud ar y ddeithasol. Yn y pdbe, yw ei wneud, yr adegau ymolwch. Rwy'r adegauau ymolwch yn ei wneud atwyr i'r adegau ar gyfer allanol yn y biologiogol. ddweud o siŵr. Felly, mae'r periy studio byddai teglaeth MSU yn meddwl ar draesfodi ac mae'r syniall wedi fynd ymlaen. Felly, mae'r baiwch cyfan clerwyr gyda'r ebiad i'r bydd. Fyddynt, mae'r baiwch cyfan cyfan cyfan cyfan cyfan cyfan cyfan cyfan y llyfrwyr i ddynsbeth gyrdd a'r bachwyr a cael wneud. A dwi'n iddynt, mae middle o'r rŷn gynllun o phabwy sydd o'r ddein iawn. Yn mynd ylai, rwy'n cofnirio ychydig iawn. yr Aelodau Arryden yng Nghyrwm Yng Nghyrwm Aelod yng ngyfel hwn. Gweithio'r lle ripen o'r ddrydd eraill, y dwy data a'r roi gynnol yng Nghyrwm Yng Nghyrwm Yng Nghyrwm yng Nghyrwm Yng Nghyrwm Yng Nghyrwm yn ddryd a'r Ddryd Ddrage, i eich metodatae. Roedd 26 miliwn ddryd rhan, wrth gwrs drwy Puddwyrnos Popedau, oherwydd travers o'r 22-3000 miliwns o ddryd. O'r meddwl, mae'r meddwl a'r gydach. Mae'r meddwl sydd yn cael gwirio. Mae'r meddwl yn cael gwirio, bod ddynnu'r meddwl yn cael gwirio yn ei dweithio. Mae'n rhaewt o'r meddwl mewn newid yw'r meddwl wedi'u meddwl. Mae'r meddwl yn meddwl, mae'r meddwl sy'n meddwl. Dwi'n meddwl gwirio ian i'r meddwl. Mae un o'r meddwl gynneud o'r meddwl. Mae'r meddwl, yn hyn i'n meddwl, byddwn i'r hollwch i'r cwylwch cyfnodol, ac byddwn i'n gofio'r lleol yn amlwnnwch. Fy nid oeddwn i'r EBI yn gwneud i gael unrhyw o'r ddweud o ddataeth o'r EBI, ac maen nhw'n ei ddweud eich bod hynny'n bwysig ar y cwylwch ac ar y cwylwch. Mae'r ddweud eich cwylwch i'r ddweud eich cael 1.1 miliwn niw'r gweithio. UK PubMed Central is a more recent addition. This is a full text database, so we have, this is not just abstracts, this is the full text, either a scanned content, OCR content for the older stuff, some PDFs, and the majority of the newer stuff is all in XML according to the NLM DTD, which is public again, and a lot of newer publishers just use the NLM DTD now rather than creating their own. There's a website, we're soon to release a web service on this, which I hope that people will use, and I can give you details of that in about two weeks, I hope. In fact, next week it's deliverable for the Wednesday, so we'll see. It's very soon. The UK PubMed Central website and web service is supplemented by all the abstracts in SiteExplore, so when you use the UK PubMed Central website you can search all the abstracts and all the full text. All the full text articles are represented as abstracts, so essentially we have a database of abstracts for which we have some in full text, basically. That grows at a slower rate, but it's about 150,000 articles a year. A little bit more about UK PMC. It's built in collaboration with PubMed Central in the United States, developed at the National Library of Medicine, and there's another node in Canada. The European Biathematics Institute is now leading the project, and we do this in collaboration with the British Library and the University of Manchester. It's supported by 18 funders that have mandates that say, if you are funded by us, you must deposit your articles in UK PubMed Central. The total spend of those funders is 2 billion GBP. They're led by the Wellcome Trust, has all the major life science funders in the UK, BBSRC, MRC, Cancer Research UK, and so on. The European funders are Telethon, Italy, and Austrian Science Foundation. As I mentioned, it's a life science repository. It has a manuscript submission service, i.e. it has a self-archiving route for grant holders. It has a database of grant information. We have details of about 18,000 PIs. That's the grant holders. It has a grant reporting and funder analysis tool, so the PIs can come into the database and claim articles that are theirs and link them to the funding that they got for that article. Likewise, the funders can come in and say how many articles have we funded, which are the most highly cited articles that we've funded in this database, things like this. In terms of usage, we're up to about 8 million requests a month, and there's about 40,000 IPs every day. So, how are we doing on the open access stakes? Well, I was quite surprised to learn, actually. I just haven't looked at this figure for a while at the bottom, that now 20% of the articles in UK PMC are open access. A few years ago, this wasn't the case. It's only about 10%, and it really shows that open access is increasing in the life science research space. We get these open access articles for the most part via the usual publication routes that scientists use. So, for example, if a scientist publishes an article in plus biology, part of the deal with that publisher is that that article because it's open access goes straight into PMC, and from PMC it gets circulated to all the nodes, UK PMC and PMC Canada. So, that's why we've got so many OA articles for the most part. Only about 15% of the OA articles come through, actually 50% of the articles come through the self-archiving route, so 85% come via publishers. And it's actually, if you look at a publication year on year in 2010, I think it's about 40% of the articles that have a publication date of 2010 were open access. So, there's still a long way to go, but things are improving slowly. So, how do we make the connections between all this data? Well, two main ways. Firstly, links. When people submit sequences to the sequence databases, there's metadata associated with that sequence. And one element of the metadata is usually a publication describing that data they've just submitted. So, what we can do is we can turn those links around and from the literature link back to the database. We can do this for several database types. So, that's why I call it, it's links by the author, author made links. But also, in cases like Uniprot where the database has got a lot of curation on top of it, a lot of the job of the curators is to look through the literature on quite a detailed level. You're looking at figure legends and things like this to link that Uniprot record to the literature. So, we also reverse those links. And this is expensive, of course, because it requires people to do it. And it's quite slow. I know the Uniprot curators look at many, many, many more articles than they actually add to the database. They're only adding two or three articles a week, I think, per curator. But they're looking at 50 a pop. But it's very high quality. You know they're going to be good and relevant and true. Text mining, we also do, this is obviously by algorithms and based on dictionaries or terminologies that you basically look in the full text for those terms. So, the good news about this is that it's very fast and you can get through millions of articles. The bad news is that the quality is variable. It depends on how good the algorithm is and how up-to-date the dictionary is and so on. And you can get quite a lot of false positives sometimes, which annoy scientists no end. The other good news is that because it's post-publication, you allow a layer of computation that might be able to find new associations at the individual authors when they're writing their paper weren't thinking about. In terms of the links we make, here's a list of the data types we link to. Proteins, nucleotides, OMIM, which is an online Mendelian inheritance in man, which is a very nice text-based database on genetic diseases. Chemicals, protein structures, clinical reviews, protein families and protein-protein interactions and the list goes on. We're adding them soon to our gene expression experiments. The winner in terms of the number of articles linked is Uniprot because of all those curators working all that time. There's 800,000 articles in our space that have been linked to Uniprot. And the numbers go down to things like protein-protein interactions where they need a lot of scientific work in order to define a protein-protein interaction unequivocally. So there's not many papers there. I think there's about 5,000 papers in that column. In terms of the text mining, we mine currently six semantic types, genes and protein names, gene ontology terms. This is an ontology to describe cellular molecular biology. Organisms or species, diseases, accession numbers or persistent identifiers and chemicals, again from an ontology called kebi. And as you can see the profile, the bottom line is the number is quite big. On the far right there's the total number of annotations. So there's 15 million gene proteins annotated across the whole data set. But actually that covers only a quarter of a million unique terms. The unique terms is the number of terms in the dictionary that you're using. In terms of the number of articles touched, well there's 2.2 million articles in the data set and overall some of these are touching 1.8 million articles. So you can see it applies to quite nicely across the whole data set. Although not without challenges, as you can see at the bottom, the chemicals here, the dictionary size is quite small, only 76,000 terms. But the number of articles annotated is enormous and the number of annotations is the winner, 22 million. And that's because we run into these difficulties where you're crossing domains. Whereas the gene protein dictionary is very relevant for biology, the chemical dictionary is slightly off. Obviously there's a lot of commonality between chemistry and biology, but in the world of chemistry the term RNA or DNA is quite an interesting novel term because they're talking mostly about small molecules, but in biology DNA is a rubbish term pretty much. So a lot of these large numbers of annotations will be for things like DNA and RNA. And I know that open air is very interesting in cross-disciplinary integration and that is one of the interesting areas of research. So here's a case study from biology about how we then use those links or how researchers might use those links when they're trying to do their science. How many biologists do we have in the audience? By training, no matter how long ago? Thirty years ago. Thirty years ago? Oh well we were teaching evolution then I think. There are some basic truths in biology that help link us all together and one of them is that in evolution we all derive from a common ancestor, from the primeval slime 3.9 billion years ago. And since then we have diverged depending on where we live and how we live and basically the eukaryotes of which we are one in animals over there on the right are related to bacteria, aquifex down the bottom left, but only very distantly. And of course we are far more closely to things like chimpanzees and mice and rats. So basically the more similar we are as organisms, the more similar our DNA is. However because we are all related on some level there are some elements of DNA that are the same across all spheres of life. And on this basis we can do a lot of computation with our DNA and our databases to see where the similarities are between DNA from different organisms. So I'm going to tell you quite an old story that illustrates the value of not only keeping the data archive but keeping it archived in a way that it can be computed on. It can be reused by others coming after you doing your own experiment. So this is quite a long time ago now. There were some researchers in the United States who were looking at human colon cancer and they discovered a gene that was implicated in human colon cancer and they sequenced it. And this is represented by this line query at the top and it's repeated three times. And what they did is they took this sequence and they submitted it to the sequence database using an algorithm called BLAST which is a sequence similarity algorithm which basically said are there any sequences in the database that are similar to mine. And lo and behold there was a sequence that was similar that came up in the database. That's the subject which is the second, fourth and sixth line. And what BLAST does it aligns these sequences so that you can see the areas of similarity and here the similarity is indicated by a repeated letter in the middle line. So you can see they're not identical but there is a lot of similarity between the two sequences. So when those got very excited about this because this is an E. coli gene so that E. coli is very distantly related to humans but on some level is related. And when they looked at that record for the E. coli gene in the metadata there there was a paper associated with that gene so they clicked on the link to the paper and lo and behold that paper describes the role of that gene in DNA repair. So immediately by doing not very much work and actually leveraging the work of others that has gone before they found out that the gene that they'd isolated associated with human colon cancer likely had a mechanism that involved DNA repair. Ie when it's mutated and it goes wrong the DNA is not repaired properly so cancer ensues that all the cell processes and cell division goes wrong basically. So that was their hypothesis and that was allowable from the of course they had to go back and prove this with experiments they couldn't just believe that but this was a very strong indicator and gave them real pointers to do more experiments. Now this paper was published a while ago in 1993 in Cell and here it is in UK PubMed Central and I don't know if this is an enhanced publication or not I'm not sure but here's the abstract. There's a text mining been applied to the abstract here so we've got our gene ontology has picked out the terms mismatch repair and chromosome diseases has picked out hereditary non polypopsis colon cancer, polyposis colon cancer. Genes and proteins, MUTES and MSH2 that's the names of the genes, the human and E. coli genes and so on. We've got some funding information at the bottom this is all funded by the NIH various institutes at the NIH. In the top we've got the value-added stuff that UK PMC has done in the tabs. We've got citation tabs that shows you the number of articles and the articles citing this paper. Bioentity shows you the terms that we've mined from the full text if they're available and the databases we've linked to. Related articles is a natural language processing algorithm that matches abstracts that are similar. That's a service from PubMed. So the PubMed ID of the identifier of this article is at the top 8252616 with some magic syntax that is secret which we hope to make less secret in the future. You can say how many articles cite this one using that identifier. What makes it unique in our space is the suffix med. And here they are. There's actually 849 articles of cited this one in our information space and actually two of them here this little multicolored flag signifies that we've got that article in full text available. One's in Ploswan and one's in the Journal of Gynaecological Oncology. So all these papers have cited this one since 93. So that's quite highly cited actually. 800 and so is pretty good. Most papers get cited less than 50 times. Actually half of PubMed has only been cited zero times. So it's a famous paper. Actually I should say if you use UK PMC there's a time-cited sort order there now. If you're looking at any list of any interest you can sort by time-cited. And actually this is what Mark Walport of the Wellcome Trust did when we implemented that. The first thing he did was look for articles funded by the Wellcome Trust and sort them by a time-cited. See which was the best so to speak. So one of these papers that has cited the original one actually went on to elucidate the structure of that E. coli gene. This is the protein structure here and it's a very pretty picture showing the topology of all the domains of the protein. And you can see there's a DNA helix in little dots around there. So this is showing where the protein binds the DNA helix presumably to do its job in DNA repair. So I like to read more about that. So in this database again the metadata links to the literature the paper that describes this structure. Unfortunately it was done a few years ago and it's at nature.com. I can't read it because I'm not at an institute that subscribes to nature or I'm at home. I suspect actually if this paper was published now it would be available because it's funded by the NIH and they have a mandate also that papers should be open access or available at least in UK PMC. Right so I'm stuck so what do I do? Well I look up this structure in the protein database and again because like the blast algorithm for DNA there's also an algorithm that will find you similar structures because the data is kept in a similar format that it can be computed on. I look down this list and I find a protein structure here. Here's the identifier of it 1EWQ. I don't expect you to read that. These are similar ones. If I look at this I then use this term to search UK PMC because we've text mined the content. Here's a list of articles that contain this term 1EWQ and that's a reasonably clean search there. So here are some articles that describe that similar structure but these are all available in PubMed Central so you can read the paper. One of these papers has some beautiful figures. I don't really understand what they mean but I think what they're doing is now this research has moved on in time from identification of the gene to protein structure to then finding out how that protein structure actually does its job and this is single molecule analysis of the dynamics of that DNA protein interaction so things really have moved on in the last 20 years. As you can see surrounding this figure, this beautiful figure for which you cannot click and get the data by the way it's just a beautiful figure. There are two supplemental data files described and these are linked to the article but only in a kind of narrative kind of way. So this is the modern article if you like. So challenges and future directions. One thing we have to accept and learn from the physicists on perhaps is that data driven science with huge projects that produce core resources where the resource is really not the paper but the data and the data base and this is one on cancer genomes and I don't know how many authors there are there but I don't really know what they've all done in contribution to that paper and I know that actually these kind of huge authors really do distort citation counts and things like H index measures but really the key of this paper is a landmark paper it's not really describing any biology and the reuse is post publication by other people and really the amount of data being produced now is prompting places like the EBI and other people to make quite hard decisions about what we do about keeping complete data sets and should we do it and should we invoke that cost or is there an editorial process that we need to apply for long term archiving. So here's that slide from before and I haven't really talked about unstructured data at all. I think there's future possibilities for unstructured data and one is clear although the research article links to the paper it does so in a narrative kind of way it doesn't do it in a structured way because the data, those data objects themselves don't link back to the paper they're just sort of free floating files in a repository and the question is how could we structure those links better should we structure them better and furthermore are there any ways that we can reuse this unstructured data I mean potentially we can have curated data sets if we see similarities or even maybe they would be used as a seed point to start new thematic databases there's a new data type that's coming through that we can see a requirement for that more easily if we know what data there is in that unstructured data I'm aware I've got one more slide I was interrupted by all the people coming online so I think one thing we must do is analyse what's in this unstructured data that we have associated with articles right now there's about 200,000 articles it's about 10% overall of the articles in UK PMC have unstructured data associated with them it's about one in three articles now submitted with data attached to it and that's what it looks like it's a big old mess as the bottom line with PDFs accentuated largely because publishers I think use PDFs but what's actually in those files is unclear I don't think they're all beautiful data sets put it that way I think there's a lot of rubbish in them as well that just didn't fit in the article for some reason and finally I think that OIA made a really important point that usability is really, really important and I think one of the key things we must do whatever it is, is to make sure that we apply the solutions in the context of the science that people do so here is that E. coli gene here's the gene information here it is on the genome and down the side here you've got information on gene expression on proteins on protein structure and on the literature and this is a thin slice across the top and dig deeper than that if you wish that's it I'll just leave you with that URL if you want to have a look and it's just faded away for some reason okay that's it