 Alright, so we'll start off. This is our last lecture for the day. And it's always a little difficult to sort of place this one because sometimes you want to put it before the lab and sometimes after, but we're putting it after the lab. And this is really about databases. And already you've made use of databases, spectral databases, just to do your own analyses with XCMS and with Bazel and GC AutoFit. But there's also other types of databases that are really critical for metabolomics. So we'll sort of introduce some of the concepts behind database models, particularly as it relates to both bioinformatics and chemoinformatics, different types of metabolomic databases for different purposes. There are a variety of public NMR databases, public MS spectral databases, their pathway databases, and what I call metabolomic databases, which I think are distinct from purely chemical databases. So traditionally, bioinformatics and a field of chemical informatics or chemoinformatics were sort of like two silos, two solitudes standing apart. They developed independently, even though they're about informatics and even though they were about either biological, chemical, biochemical entities. In the case of chemoinformatics, it was really established in the 1960s. It was a much earlier field pushed by a variety of chemists, organic chemists, natural product chemists. It was really designed initially for specific needs for organic chemists, but evolved to become particularly useful for medicinal and pharmaceutical chemists. In the 60s, the concept of software, which was sort of governed by IBM, was that everything was user pay, everything was limited, with very little public access. Lots of companies, MDL, Bilstein, Sigma, Aldrich, American Chemical Society, which has also started the CAS system. All of these are still profit generating companies that control a lot of the informatics data for chemistry. Bioinformatics really got into its prime in the late 1980s, early 1990s. Part of its timing was that it was coming online just as the web and the internet was being developed. That was pushing more for an open access model. Obviously, bioinformatics typically was designed to support the needs of molecular biologists, primarily people doing genetics, genomics. Because of the timing and because of the needs for the human genome project, it was largely funded by NCBI, EBI, NIH, Genome Canada. The fact that these things have different histories has been a challenge. What's really been happening over the last while is that the chem informatics model is becoming more like the bioinformatics. We and others have been really trying to push for open access web-based resources and software for metabolomics, because metabolomics is in between that chem informatics, bioinformatics space. We still have, obviously, lots of user pay systems. A lot of the vendors for MassBeck and NMR have their own tools and equipment, but there is a growing body of open access web-based tools. I think we can learn a lot, and I think given the success of bioinformatics, that open access model, and how quickly things have developed, how much information has been shared is really, really important for the success of any omics field. What do you do for databases and why do we create databases? Obviously, libraries used to be our primary resource, but nor were moving to a database concept to consolidate information and to link it. We obviously use databases for rapid retrieval. We also need databases for referential numbers, referential concentrations, referential spectra sequences, referential images, and data. A lot of people use databases to help test or train predictive algorithms. You need training data, big data, and this is the success of machine learning is largely because we have large database for testing and training. There's also a need to do similarity searching. If you've ever done bioinformatics, you've done sequence similarity. If you've ever done structural biology, there's structural similarity, and metabolomics we're seeing spectral similarity, and many other areas we're seeing image comparison, text mining. All of these are similarity searching techniques that are done over and over again and they can only be done with electronic databases. The other thing that we use databases for is to help with prediction, taking something and then learning from it or enhancing that. So without having a gain training data, you can't predict structure, function, property, activity, relationships, and that's another key role of databases. Most databases actually start out as hobbies, and I can speak to that having started out Drug Bank and Human Metabloom, other databases. All of them started out as almost hobbies, done as short-term summer projects. But as either the need grows or you recognize challenges in managing the database, you often then start turning more to a curated relational database where you try to improve its performance. The database obviously gets bigger to handle, bigger to manage. The largest databases are actually open deposition data banks. So GenBank is an open access archival database. The Protein Databank is an open access archival database. Metabolite is another open access deposition database. In many cases, these databases are not only relational, but they're distributed over many nodes, separated in continents or countries to make sure there's some level of redundancy. As you go over in terms of increasing the size of a database, you obviously increase the depth, increase the coverage. There is a distinction sometimes between the open access archival databases and the curated ones. So in the case of Metabolites, which we'll talk about a little later, it will take anything that anyone wants to deposit in it. So if there's a bias in one area, then that's where the data is reflected. In terms of curated databases, someone may simply say, well, I just want to focus on this area. Or they may say, I want to broad, equal coverage in all areas. And so in that regard, it may be less biased or maybe substantially more biased. In that regard, it's well curated databases sometimes a lot more depth because they have professional curators working day adding new information to the data, whereas the archival data is simply how much work do you want to put into depositing your archival dataset. As you move towards open access, open deposition, the costs greatly increase, many fold, which is why only the very largest organizations can actually maintain these open deposition resources. As you move up in size and move towards open deposition, open access, you have to work with standardization. You have to make sure things are much more automated, and you have to enhance the querying capabilities. Now databases used to be the primary problem with metabolomics. That was the bottleneck. There were databases and search tools for genomics and proteomics. But historically, metabolomics didn't have those. And this is sort of the issue particularly about 10 years ago. When we and a few others are starting to look at the state of affairs for metabolomics and almost all the metabolomic and metabolite data was actually in textbooks. It still is actually. And this is because there's been more than 100 years of clinical chemistry and more than 75 years of classical biochemistry that's been accumulated. And the medium for putting that together was all through textbooks and written journals. So even today, I would say the metabolomics lags about 10, 15, maybe even 20 years behind the other fields. The other part that has been a challenge for us and also for others is having to appeal to different users. So metabolomics researchers have very distinct needs. They want information about spectra. They want information about detailed chemical structures. Analytical chemists really want to know something about solubility, stability, melting points. Plant chemists want phytochemicals. They don't want mammalian compounds. Clinical chemists want normal ranges for blood and urine. Physicians want to know which ones are good biomarkers and which ones aren't. The drug researchers want to know about drug metabolites, but they don't care about plant metabolites. NMR people want NMR spectra only. Mass spectra people want mass spectra data only. And bioinformaticians have whole ranges of querying challenges that often most databases don't build. So this makes it particularly challenging for metabolomics. Partly to deal with those ranges of needs, there have been essentially not just one type of metabolomics database, but many types of metabolomic databases. So there are now NMR spectral databases. There are mass spectral databases. There are general compound databases. There are general pathway databases. And then last, not least, are the ones that we call comprehensive metabolomic databases, which try to be all things to all people, which they really aren't. So I'll talk about first set, which are NMR spectral databases. And this is just an awareness thing. We're not going to try and dive into them too much. But there are three, four, five that are of note. SBDS is equivalent to NIST in Japan, and it maintains a really quite an impressive spectral collection of many, many different compounds. NMR ShiftDB, which is developed by Chris Steinbeck, who is also the one who developed metabolites and has worked at EBI for many years. And then MMCD and BMRBR both maintained in Madison, Wisconsin, and were developed by John Markley, who's actually a structural biologist, but then shifted over to do metabolomics as well. So the SDBS, this is the part of the AIST, which is like the NIST database. It's been around for almost 45 years. It includes NMR spectra, proton NMR, carbon NMR, it also has MS spectra, it has infrared spectra, and it's covering lots of compounds. It has some variety of standard search tools, but like many resources, it wasn't defined to be just about metabolites, it was defined to be about chemistry and chemical standards. So that means that the compounds may not necessarily be natural, many are actually synthetic. Another NMR resource, as I mentioned, is the Biomag Res Bank, and this is actually grew out of work that was done for structural biology, and is still maintained, and it's closely aligned with the worldwide web protein data bank. But about 10 years ago, they decided to try and create a resource for metabolites, and within about a year and a half of starting it, they found that this was actually their major download source. So even though the database was designed for proteins, their biggest hit was to actually create metabolite resource for NMR spectra. They collected NMR spectra for about 900 reference compounds. They've assigned them all. They've got proton, carbon, 1D, and two-dimensional NMR. They've got capacity to search by synonyms and inchy keys and smile strings. The initial focus is on plant metabolites, but they're also including the million ones. NMR shift DB, and it's reincarnation NMR shift DB 2, still maintained in Germany, and like SBDS, it's also primarily chemistry oriented. So it's organic synthetic compounds, mostly 1D, proton and carbon, but on the order of 40,000 compounds. So it's quite extensive. So I'd mentioned it was started by Christoph Steinbeck. As I also mentioned, it's mostly organic compounds. They actually have online chemical shift prediction, and you can search by a variety of techniques, including chemical shifts, and there are the chemical shift assignments. And the problem with at least NMR shift DB, NMR shift DB 2, and the SBDS is that most of the spectra are collected in organic solvents. And in metabolomics, we do everything in water. And so that actually changes the chemical shifts and also the spectral coupling patterns. So you're not going to get great matches between the observed spectra and the ones you're actually collecting. So that's a little vignette about NMR shift databases or NMR spectral databases. There are also mass spectral databases, and I've mentioned these already. I've talked about the NIST database. We've talked a bit about METLIN. I briefly mentioned GOM and mass bank. So the GOM database is in Germany. It's always expanding, so these numbers might be a little bit out of date, but they have about 1400 formally identified compounds. They have the mass spectral data and the retention indices. And these are not just random compounds. These are things that are found in living systems, so they're useful for metabolomics. There is a strong focus on plant metabolites. You can query by both the name and MS spectra, and they're in formats that are compatible with NIST and AMDIS. So these are just some screenshots of the GOM database. And it's always evolving. You can search the GOM database relatively easy with just the menus on the side, query things, explore it fairly easily. Now, if you look at other mass spectral databases, this covers a few of the major ones, and the numbers are interesting. Right now, the largest mass spectral database in terms of total spectra is mass bank of North America. How many people have heard of MONA? Okay, one. Anyways, it hasn't actually been published, and that's one reason why it's probably not that visible to people. But it's covering hundreds of thousands of spectra, many of which are actually predicted spectra, not actual measured spectra. There's METLIN, which has about 68,000 spectra now that have been experimentally collected, and then they've also used CFMID to predict a lot of other spectra, sort of like MONA. That covers about 13,000 compounds. MZ cloud, lots and lots of spectra, but for a very small number of compounds. That's because people are collecting the spectra in different instruments and in different conditions. So it's the same compound. NIST, hundreds of thousands of spectra, but a relatively small number of compounds. Mass Bank of Japan, about 11,000 compounds. And there's commercial ones like the Wiley database for about 10,000 spectra for 4,500. GNPS, 9,000 spectra. Respect, about 9,000 spectra. When you add it all up together in terms of actual spectra, not predicted, but experimentally collected, getting rid of the redundant ones, the duplicates, there's actually only about 20,000 compounds that have actual MS spectra that are relevant to biology, if you want. So again, I think it's a sobering statistic, because everyone thinks, oh, I can see hundreds of thousands of spectra, we must be able to find a match somewhere. And in reality, no. It's not that many compounds, and this includes a lot of plant compounds that are exotic and strange, so you'll never find them in mice or humans. So the real size is quite a bit smaller than we all think. Metlin, I think, is something to gain, that you guys have been using XCMS, XCMS online. It's very closely connected to XCMS. That's maintained by Gary Shustak. Hey, the list keeps on growing. I guess they've now added another 700,000 recently, DSS talk. So it's getting close to a million compounds, about 60,000 spectra, and as I said, these are experimentally acquired one, and then they started using CFMID to predict their MS spectra. A lot of the experimental spectra in Metlin are actually of peptides, dipeptides, tripeptides, and that's easy to get to about 8,000 plus spectra. So the actual number of non-peptides is about 4,600. So again, it's a smallish number. The issue with Metlin, as I pointed out before, as they continue to expand their metabolites set, is that they aren't all metabolites, and increasingly a lot of them are not actually have never left the lab. So they aren't really things that you're going to see in animals' plants or microbes. And then the other issue too, again, is the sourcing. Is this only found in this rare microbe, or is it only found in this rare algae, or is it only found in this rare plant, or is it common to many, many animals? That's not clear. With Metlin, you can do some really nice searching. You can enter masses and choose charges. You can find metabolites. They provide structures for everything. They provide in-chee key and other vital statistics. And then if there is a spectrum, then you can click on that, and the data is viewable in various formats. Mass Bank, which is originally developed in Japan, has moved over to both Europe and to the U.S., in part because of limited funding for Mass Bank. It's really nicely designed. It has a whole range of ESI, QTOF, triple quad, iron trap mass spec. When it sort of got defunded at the time, it had about 41,000 spectra covering about 15,000 compounds, and a lot of them were contributed from many people in the area of metabolomics. So, again, these are higher quality, more relevant to metabolism, metabolomics. So, last I looked, the database is still maintained, but in large part most of it is migrated over to MONA, the Mass Bank of North America, or to the European Mass Bank. But many of the structures are still there. Many of the concepts, which are really strong, are still there. It allows you to query things. You can do peak searches, typing in lists, extracting compound matches. So, it's, again, user-friendly and designed for higher end mass spectral querying to see if you can match things. So, those are the mass spectral databases. I've covered everyone, and those are some of the NMR spectral databases. They're all important for annotation and identification. And, again, primarily emphasizing the free open access ones. There's excellent ones that are commercial. I've mentioned other compound databases. We talked about PubChem. We've briefly talked about ChemSpider. We can also talk about Kebby. I mean, this is the number that we had, I think, last year. It's probably up around 50,000 now. So, Kebby stands for chemical entities of biological interest. It's a database that was started by Chris Steinbeck, and it draws from a variety of sources, compounds in keg, lipid maps, drug bank, HMDB, various patents. The focus of Kebby historically has been on names, getting full names, proper names, all the synonyms, and also develop a bit of an ontology for chemical structures. And that ontology has been manually curated for more than 10 years. I've mentioned PubChem as well. So, the number of compounds is well over 80, probably approaching 90 million different compounds. In order to be something in PubChem, it has to have less than 1,000 atoms. So, in some respects, you can have fairly large peptides or proteins being listed in PubChem. It is an archival resource. So, you can deposit, if you wish, chemicals into PubChem. And that's part of the challenge, because it's possible for someone just to create a whole bunch of exotic chemicals and say, I'm depositing them. You don't actually have to synthesize them. You don't actually have to have them in your possession. So, that's one of the issues with PubChem. They have things broken down into substances and chemical identifiers, but there's a huge body of information that they're collecting and still haven't fully released to the public on these chemicals. So, it's quite astonishing. So, it's not just about names and synonyms, but lots of stuff on properties, bio-activities, and then links all through the literature to what's in PubMed and to other NCBI databases. And the switchable features in PubChem are also really quite excellent. ChemSpider is the British answer to PubChem. It was developed by Anthony Williams, who's now at the US EPA. When he left, it sort of kind of stopped growing, but it still has about 50 million compounds from an actual larger number of data sources than PubChem. And in fact, the comparison between ChemSpider and PubChem, you won't find one is a subset of the other. There are distinctions. Like PubChem, you can search through NCI and structure and registry number, but you can't, at least the last time, you can't search by formula or mass. It's really great in terms of names and synonyms. It also has links to pharmacology and to spectra, which you don't generally find in things like PubChem. There are issues though, is that since Anthony Williams left, that ChemSpider has become much more closed and limited access. So, some of the things that used to be able to do with it, you can't. Legand Expo is essentially the small molecules that you find in the protein data bank. Why is this of interest? Well, PV is the data bank for all the protein and nucleic acid structure. So, when a molecule is bound to a protein, you have information about its target. So, you can link metabolites to their targets. You can link drugs to their targets. And what's more, rather than just having a two-dimensional picture, you actually have a three-dimensional picture or an authentic three-dimensional picture of metabolites. So, that's actually quite valuable and relatively rare and used to be very, very difficult to try and get chemical information from the protein data bank. And that's become much, much easier with Legand Expo. There are other kinds of compound databases. NAPSAC is something that I've mentioned again, maintained in Japan. And what's really, really nice about this one is its connection to plant metabolites and species information. 3DMET is about the 3D structure of natural metabolites, about 8,500. And this is sort of diving into a few of them. Ellipid maps, which I've mentioned before, which is developed by Shankar Submanian and has about 30,000 lipids covering many major lipid classes, about 25 different major lipid classes, including some that are exclusively in plant, exclusively in microbes, in insects. So, it's quite diverse. It's not restricted to animals or humans. But there's another database, which many of you don't know about, which is called My Compound ID. And this is not just 8,000 or 50,000 or 30,000. It's 11 million compounds. And these are metabolites, not just exotic synthetic compounds. So, it doesn't have the structures for these compounds. But what it's done, and I think what was really innovative about it, is that it created, it went through about 75, 76 different metabolic transformations. So, sulfation, glupuronidation, hydroxylation, methylation, demethylation. So, it took all of the compounds of endogenous metabolites in HMDB, about 8,000 at the time. And then it transformed them sort of systematically. And so, for the first past transformations, about 375,000 compounds were generated. And then those were then transformed computationally again to generate about 10 million. So, in all at 8,000 plus 376,000 plus 10.5 million. And you get 11 million potentially feasible metabolites, and especially their masses. So, this is my CompoundID website. And it is searchable, as are the transformations that are generated, that have been performed. And so, if you have a mass, now it's not going to give you the MSMS spectrum, but if you have a mass or you have a chemical formula, then you can search through this. And it won't necessarily tell you an actual structure, but it will give you a match and then a potential chemical formula for that particular compound. Now, if there is a structure coming from, say, the HMDB, that is shown. But this concept of metabolites of metabolites is something that I think is increasingly happening now, where people are computationally transforming metabolites, doing it in silico, because obviously you can't synthesize 11 million compounds, nor can you collect the spectra for 11 million compounds. So, this idea of running computers to generate feasible metabolites is something that's becoming what we call in silico metabolomics. So, those are examples of chemical databases. Another really important aspect of metabolomic databases are pathway databases. And to some extent, these probably started before metabolomics. And the best one known is KEG, the Chieto Encyclopedia of Genes and Genomes. BioPsych and MetaPsych is another resource developed by Peter Karp just outside of Stanford. And then Reactome, which I think was also started, initiated by Lincoln Stein here and then picked up by EBI. And then another resource that most of you haven't heard of called the Small Molecule Pathway Database, or SMIPDB. There are other pathway resources, but the whole point of pathway databases is really to link metabolites with other parts of biology. And that's linking the genes, the proteins, signaling processes, diseases. And essentially that's systems biology. And if you want to be able to integrate omics, all these omics together, often the best way of doing it is through pathway databases. Some of the better pathways is databases have tools to allow visualization, or gene mapping, or metabolite mapping, or protein mapping. Resources like KEG and Reactome will cover a number of major species, many model species, as well as thousands of microbes. So as I said, the grandfather of all pathway databases is KEG. I think everyone's probably taken a look. Who hasn't looked at KEG? A couple. So maybe by this afternoon, you'll have a chance to look at the KEG website, which is there. And it is sort of the number one resource for pathways. A lot of the focus on KEG is on catabolism and anabolism. So building up and breaking down. And in that regard, it's a very narrow focus of what metabolism is. It doesn't have most of the signaling pathways, which is really what most metabolites do. And this has, I think, been very limiting for the whole field of metabolomics. So as good as KEG is, it was designed not as a metabolomics database, but as a metabolism, catabolism, anabolism resource. The challenge, I think, for the metabolomics community is to try and extend this pathway information to include, you know, how is it that phenylalanine is toxic for infants for PKU? I mean, that's not shown in KEG. How is it that the Warburg effect, which is the major metabolic process for cancer, how is that shown? It's not in KEG. Many of the signaling processes for cyclic A and P are not in KEG. And you can run down literally thousands of metabolites in terms of their processes or pathways that just are not found in KEG. And that's because it's beyond the original scope of KEG. Nevertheless, KEG has a lot of compounds. There's almost 20,000 compounds, almost 10,000 drug or drug varieties, huge resource on glycans, and then altogether about 460 different pathways or pathway types. So it covers many, many different species. And, you know, the total number of compounds in KEG that would be sort of human, maybe about 2,000. And that's because it's just focused on catabolism and anabolism. So I'm going to talk about this small molecule pathway database. And this is a resource that we've been developing in the University of Alberta for the last five or six years. And the intent for SMIPDB was to try and create a resource that was more targeted to metabolomics, to try and cover not just catabolism and anabolism, but to also look at drug pathways, you know, how does aspirin work? To look at disease pathways, like the Warburg effect or PKU. And then to look at the regular metabolic pathways and then a bunch of other signaling pathways. The other thing we realized is that with KEG, when it was developed, it was developed using technology which was quite advanced in the 1990s, but is still rather primitive today. And so with KEG, you can't tell that the TCA cycle actually happens in the mitochondria. And most of the people that I teach have no idea that the TCA cycle happens in the mitochondria. It's because it's not shown in KEG. And so without this context of understanding where certain key reactions happen, whether they're in certain organelles, whether they're outside the cell, whether they're between membranes, there's people essentially lose context. They lose biochemical context. Likewise, many people don't realize that most proteins have a quaternary structure, many form clusters or complexes, and that's not typically depicted. I think a challenge also with things like KEG is you can't map gene or protein or metabolite expression changes onto KEG maps. And so the idea here was to try and allow that mapping. And then to convert lists of genes, lists of proteins or lists of compounds into pathways or disease diagnosis. So that was another novel concept of SMIPDB. So this is an example of what a SMIPDB pathway looks like. So it's a little more colorful than KEG. You can see that there's mitochondria and paroxysomes. You can identify not only how the metabolism affects other parts in the cell, but also the brain. And I think there's a liver there and can't make out some of the others. You can identify what happens when a gene is knocked out. There's an X through it and how it affects a certain metabolite. All of the compounds are visually depicted. So you can see the structures, the proteins, if they're dimers or trimmers are depicted as such. The organelles are depicted as such. The membrane is shown. If a reaction is happening in the mitochondria, it is shown to be in the mitochondria and so on. Everything is hyperlinked. So by clicking on things, you can get their links to the human metabolome database or to uniprot, if it's a protein. All of the pathways have descriptions about what they are. So this is about alanine metabolism. So explains what's going on. Now the focus here in SMIPDB is on human metabolism. So it's not for E. coli. It's not for frogs or plants. But in fact, there's quite a large number of pathways that are being generated in SMIPDB for these other organisms. And that'll be released sometime in the near future. You can map metabolites. So if you've got a list of your metabolites, and some are just simply listed, or if you also have concentrations, you can map those onto the image to see how things are colored. So they're highlighted in red if they are depicted. So you can see if there's a lot of metabolites in a particular pathway that are being affected or that have been measured through a metabolomic experiment. If you're able to do quantitative metabolomics, you can put in actual concentrations and see how they're linked and how they perhaps have changed in this way, potentially looking at how flux alters or trying to understand how the effect of knockouts or knock-ins affects overall expression. Now all of these pathways in SMIPDB are generated in the same way that you would have with a Google map. You can view and scroll and zoom and it uses those same types of graphics. And the mechanism we use to generate the pathways is another tool called PathWiz. So it's a subsection of the small molecule or SMIPDB. And this is a web server to allow you to draw pathways. And in fact, we've had thousands of pathways drawn by different people. So you too can make some really colorful, interesting pathways if you're favorite pathway or organism. And they aren't just pictures. They're actually machine readable. So they are compatible with Biopax. They're compatible with systems biology markup language. They're compatible with systems biology graphical notation. They can be read or saved as support vector or SVG and PNG files. And like SMIPDB, this allows you to do a Google Maps viewing and editing. So you don't have to download anything. You can just do it on the web. And you can also convert the pathways. So if you don't like the color schema, you can turn them into what look like keg-like diagrams. You can also convert them into change the color schema. So you can have a white background or a blue background or whatever you wish. There's a video or several videos for building how to construct pathways in PathWiz. Variety of editing functions to click, drag, and drop things. To label things, it does a lot of automatic labeling, automatic text mining. It's tightly linked to Uniprot and HMDB and other database resources. So you don't have to really do a lot of typing. You don't have to draw structures. You don't have to create organelles. You just have a menu that you drag and drop to create the pathways. Not only can you draw organelles, you can also have organs. And there's increasingly more emphasis on getting transporters depicted and transport mechanisms, which is not generally shown in most pathway databases. So after you've been drawing for a little while, again, this is what you can generate through the PathWiz system. And you can look at specific components of, in this case, a fairly complex metabolic reaction happening in, I guess, the ER and the mitochondria and some others, as well as looks like that's in the kidney. So again, this is a way of essentially trying to add more biological context to PathWiz, but also to try and encourage more people to contribute to PathWiz. With Keg or MetaPsych, the only way you could do that was actually physically be in those labs to draw them according to their specific principles. But this is a community-based project, so anyone can contribute to these PathWiz. Yes? In principle, I mean, you can draw arrows in any way you want. And you can draw compounds that you can show them as showing up twice if you wish. So I think in principle, yeah, you should be able to show feedback. I think it's also, I mean, it's still under development in the sense that if people have specific requests that haven't, it doesn't seem to be compatible with what someone wants. We're pretty good about adding those new functions. So a lot of recent work in plant metabolism, so they needed a whole bunch of diagrams for chloroplasts and thylakoid membranes and details like that, and so those have been generated. So a last part of this presentation is to look at what I call a comprehensive metabolomic databases. So we've talked about pathway databases, we've talked about spectral databases, we've talked about chemical compound databases, and a comprehensive metabolomic database essentially tries to be all of those at once. And arguably Keg is a comprehensive metabolomic database. It covers pathways, it has metabolite information. It may not have spectra. Eco-psych and human psych, the psych databases also have many of those components. I've mentioned metabolites, which is the archival database. So to be called a comprehensive metabolite database, it has to have at least a thousand compounds. Many of them are organism-specific. Some aren't like Keg or metabolites. They need to be continuously updated, but they have to have certain combinations. They have to have at least chemical information and pathway data, or they have to have chemical data and spectral data and biological concentration or function data. Or they need to have chemical pathway and spectral data, or they need to cover everything. So metabolites, I've mentioned a few times, the intent of this resource is to become the GenBank for metabolomics. So if you've done sequencing, you have to deposit your sequences into GenBank. If you're doing metabolomics, you should be depositing your data into metabolites. It's operated and maintained by the EBI. Christoph Steinbeck is now splitting his time between EBI and the University of Northern Germany. You can upload your data, you can upload experiments, you can upload spectra, compounds, lists, and so on. You can search through it, and many of the compounds in metabolites are also linked to Kebby. And in terms of reporting of your metabolites, these level one, two, and three, four, it complies with metabolomics standards initiative. So I'm here to actively encourage you to deposit your metabolomic data in metabolites. It's not required for everyone, but more and more if you're publishing in journals about metabolomics, there's an expectation that you're going to make your data public. So either you put it in metabolites, or you create your own data resource so that it is public. And given the metabolites will be around for a long, long time, it's probably safer to put it there than to create your own little website. Now at the University of Alberta where I'm at, we've been making metabolomic databases since 2005. So this is long before metabolites appeared. And it grew from activities in the human metabolome project. So that started a long time ago, 2005. And from that, we've created the human metabolome database, which gets about seven or eight million hits a year. Drug bank, which gets about 15 million hits a year. Yeast metabolome database. And then with Augusta, who just came here, we've created the phenol explorer database covers polyphenols. We have an E. coli metabolome database, food database, cow metabolism database, toxic exposome database, small molecule pathway database, and the list grows. So these grew from a project that was launched in 2005 called the human metabolome project. It was sort of the Canadian equivalent to the human genome project, but no one heard of it. But like the human genome project, it was intended to identify and quantify all the metabolites in humans, in urine and cerebral spinal fluid and blood, also in tissues, and to do both the high throughput experiments as well as to do the literature surveys to get known concentrations and their associations with given diseases. And the idea was to make all that data freely available and to make that electronically accessible. And so that led to the human metabolome database. That led to drug bank led to food database led to the toxic exposome database. And the idea then also with this project was to develop certain kinds of metabolomics technologies and software to improve coverage and metabolomics throughput. Interesting when we started the project or before we submitted the proposal, we went through a complete survey of all the metabolites humans that we could find in keg and human psych. And we came up with 690. And based on our naive understanding of the time, we thought that well, maybe there'd be a thousand if we could do some thorough studies. So in 2006, we released a human metabolome database. And the time it was, you know, three times bigger than anything else, 2,180 metabolites. By the time the project wrapped up from the official funding in 2009, we had 6,400 metabolites. 2013, we're up to 37,000. This year, right now, we're around 42,000. By this summer, we expect to have well over half a million. So it's grown a lot, but it's also changed our perception of what the metabolome is. And I think it's important to understand that it's not just about endogenous compounds, it's also about exogenous ones. And I've shown this picture already in terms of the known concentrations. And these other databases like the Toxic Exposome or Drug Bank or Food DB and HNDV. This is as it stands sort of today, but we'll be changing quite radically in the next couple of months. So each of these resources has their own web page. You're certainly welcome to look and explore them. Some of you may already have. And some of them you may never have known about before, but they all have their own unique strengths and the roles for certain applications. So right now, the human tabloid database has these 42,000 compounds. There's about 170 gut-specific microbes. We have lots of information on normal, abnormal concentrations, disease links, referential spectra. You can search for sequence. This is protein sequence. You can search for spectra. You can search for structure. You can search for pathways. You can browse through different biofluids like saliva or urine or blood. You can do very detailed text searches. And all of the data is freely downloadable. So this is what it looks like inside, whether it's pictures of chemicals or links to pathways or links to interactive spectra. You have what's called a data browser. The browsers have data fields. And the data fields are somewhat encyclopedics. It's not just simply lists. There's improved spectral searching, which I think NAMA has helped design. And these are allowing not only MS searches, but also MS-MS searches. You can also do spectral searches for NMR. You can search for similar structures by drawing them or pasting smile strings or inchy strings into the palette. And it'll instantly generate a structure and then find the most similar structure. So there's also a large collection on reference concentrations with normal and abnormal metabolite concentrations, covering about 15 different biofluids and about 5,000 plus metabolites. So this was designed originally for clinicians and physicians. And it continues to grow. And in fact, this one seems to be one of the more valuable contributions to HMDB and to metabolomics in general. So even though the Human Metabolome Project was focused on human metabolism, the one that actually seemed to have the biggest impact was this thing called Drug Bank. And that was because you realized that people deliberately or not so deliberately take drugs. And many of the compounds that are sort of unknowns are either drugs or drug metabolites. And well, you might look at, you know, and any given individual might have one or two or three drugs in their system. If you look at a population of 1,000 or 2,000 people, you're probably looking at several hundred different drugs. And if you look at a population of 100,000 people, basically you're going to see basic every drug that's approved or even not approved. So we decided to start building out this resource to cover both drugs, but also the drug targets, that is the proteins that drugs act on, the drug mechanisms, to look at the absorption, distribution, metabolism, excretion and toxicity, the mechanism of action, the pharmacokinetic data, to get into the drug metabolizing enzymes, the drug metabolites, the drug transporters, and then to equip or fill that out with the known spectra and along with the drug targets. And it was the drug target data that actually got most people interested in it, because before this, most people actually didn't know what the drug targets were, or you had to spend a long time sort of reading and reading and reading. And it was also, I think, through this that many people started realizing that it's not one drug equals one target, it's one drug equals about 10 targets. Drug Bank was used to develop all of the drug pages in Wikipedia, and is linked back to essentially all the drug pages in Wikipedia. It's designed to do sequence and spectral and structure searching, as well as text searching, and it also is freely downloadable. So, there's lots of images, whether it's for proteins or small molecules, lots of different browsing options and searching options and querying options from chemical structures to looking at different drug categories to doing sequence searching and extracting data. Over the coming couple months, Drug Bank is going to be enhanced quite a bit. It'll be having pharmacometabolomic data, pharmacogenomic data, pharmacoproteomic data, and pharmacotranscriptomic data. It'll also have predicted spectra for all the compounds that we can't get experimental spectra for, and a much, much larger collection of experimental drugs. Another area that we've mentioned, and that Dr. Escobar is known for, is the studies of the exposome, exposures. And this conference that we're going to have after this meeting on Thursday will feature Dr. Escobar and several other people talking about what's called the exposome and its impact on health. And I think there's some striking statistics and numbers that I think are relevant, but the bottom line is that in cancer, only about 10% of cancer is due to genetic origin, 90% is due to some exposure. And it could be a chemical exposure, it could be bacterial or viral exposure. And the intent of this database, and then there's others that are being developed like Exposome Explorer, is to track what are the compounds that we are exposed to and what are the compounds that are actually found in the human body. So you can include drugs and pesticides and herbicides and endocrine disruptors and PCBs and solvents. And this database was designed to sort of fill in information about mechanisms and binding constant data and target information for these toxic compounds. And just like with Drug Bank, have the actual targets identified. And to then link how these compounds change levels of gene expression or protein expression. And like with all of our other resources to provide referential spectra so people could consistently and easily identify these compounds. So Exposome is something that's continuing to grow in interest and of concern given that, in fact, most human disease is ultimately caused by chemical exposures of some kind or biological exposures. So it'll have at least lethal dose LD50 values that were gathered for animals. Obviously we don't generally have them for humans. Best we can typically do is to get LD50 data from there and as much as we can get is in the T3DB. Another resource is called FoodDB or Food Constituent Database. This has been worked on for a number of years. Again with Dr. Scalvera. Right now it's about 30,000 compounds. And the interest here is to find out which compounds are in which foods. And if you look on your cereal box you might think there's about 12 compounds in your cereal because they'll list vitamins and maybe a few minerals. Reality is it's probably about 30,000 compounds in your cereal. Many of them are natural. Some of them are added. A lot of these things, whether it's the food you ate at lunch or the coffee you're having now or the drinks you're drinking, have a variety of compounds designed for both flavor and aroma and color. Many of them have significant effects on human health. Now I have here the list of average plant food contains more than 3,000 and it's probably closer to 20,000 compounds. And so the food database still under development but being refined and to get to the same standard of quality that we have is about 900 different food items that are listed in the Food Constituent Database. Most of these are primary foods in the sense that they're the fruits and vegetables. This is not the database to look up what the composition of lasagna is. But it is a database to look up the composition of tomatoes or of wheat flour or of hamburger. And so those are the primary constituents of a recipe. But these are the foods that we've trot. Another one that's really closely related to food is yeast because we use yeast not only to make bread but to make wine and to make beer. And so this is one that we've been working on and it covers more than I think it's actually getting up to about 3,000 yeast metabolites even larger now I think. Under many different substrates it also includes a lot of different wine and beer compounds. It has a lot of the protein enzyme, metabolite associations, lots of referential spectra. Now I think the number of reactions and pathways has grown much, much larger than that thanks to SmithDB and PathWiz. So yeast metabolome has grown, other one that's grown considerably is the E. coli metabolome database. And this is intended to cover another model organism E. coli. And we're working on the Arabidopsis metabolome database. So between human, E. coli, yeast, Arabidopsis, we're covering all major kingdoms of life. And in all these cases we're trying to get information about more complete coverage of the metabolites of the reactions, pathways, and the referential spectra. Hopefully these will be linked into Metabolanalyst in the coming months so that people can do more extensive pathway analysis. In terms of trying to understand what these databases have, this is sort of a comparison of what you'll have or what typically seen. The human metabolome database or you could use YMDB or ECMDB or whatever tries to cover information on nomenclature, references, mass spectra, NMR spectra, pathways, structures, detailed descriptions, chemical properties, and physical data. KEG, PubChem, Kebbi, STBS, Reactum specialize in certain areas but not all of these. And so in terms of a comprehensive model for metabolomics, I think the human metabolome model, human metabolome database model is a pretty good one. It still has a ways to go and I think, as I said, there's still lots to add with regard to the number of compounds and pathways that are ultimately depicted. And as these databases grow in size, they're also a little more difficult to sort of maintain the same level of quality all the way through. But I think it's through these databases, these comprehensive databases, whether it's HNDV or KEG or Kebbi or Reactum or the psych databases, that this is how we learn about what these metabolites are. It's how we relate what we're measuring through metabolomics to biology, to other systems, to the genome, to the proteome, to the transcriptome. And so we were encouraging you partly through these exercises of a compound annotation to then use the database information to understand why the spectra that you looked at for EOE or why the spectra you looked at for endometrial cancer were displaying certain features. What does it mean to cancer? What does it mean to eosinophilic autoimmune diseases? Or in the case for the work that we just did with the untargeted work, what does it mean for kidney disorders? And it's through these databases, also looking through the literature still, like with PubMed, that it allows you to interpret these long lists of data. Now it's not just through databases alone, it's also through statistics. And we'll be learning more about that tomorrow about how to identify the most significantly altered metabolites in a given disorder or given condition, and then to relate that to the biology or to the pathways. So I think we're finishing a little earlier, but I'm probably people are happy to know that we're finishing early. But if you are interested in doing some work after where we have this optional time, you're certainly free to finish out some of the spectral identification. But then to start thinking about, okay, what are these metabolites really meaning? How does this relate to biology? Because metabolomics isn't just an exercise of compound identification. In fact, that should be the trivial thing. That should be the part where you just press a button. And then it's what comes afterwards. This is where the more exciting parts are. And again, I also encourage you to start looking through some of these databases and these links, all of them are given in your books or on these slides. And if you haven't seen these tools before, it's also probably a good idea just to look through them or to spend the next 15 or 20 minutes browsing. And as I say, the best way of learning is by doing. Okay, so I think that wraps up for now. Any questions that people have about what I've just talked about? Yes. Yes. Yeah, I think this is a real concern. Not probably just for just you and me, but NCBI and NIH are starting to worry about this quite a bit. CIHR is worrying about it. So many of the major funding agencies are worrying about it. There have been a number of databases that have just vanished, which is almost the equivalent of a library burning down. And it's because individual labs can't maintain them because of funding restrictions. And I don't know if we actually have a good solution to that. There is one data resource on the planet that is funded for eternity, and that's GenBank. So it's been funded through an act of Congress. So as long as the US exists, it will always be funded. But the fact that it's an act of Congress makes it uniquely stable. But resources like KEBI or the Protein Data Bank depend on cycles of grants. And if they're unsuccessful in a grant, then they could disappear, same with the Biomeg Res Bank. One suggestion has been is that much of the resources that we currently put into libraries, which were also intended to be around forever, could and should go into databases. And the databases, whether they're local to a university or to a country, should be able to access funds currently used for librarians or libraries to maintain those so that they're always available, always online, always at least curated or changed with current operating systems. Maybe that the data isn't added, but just like a every library has a book bindery to make sure the books are preserved. There should be someone who's trying to maintain after someone retires or after someone loses their funding to maintain that database so it's visible to all. I've been trying to pitch this for a number of years, doesn't seem to get any traction with the librarian community. I think so, but in fact it could be an opportunity for much more funding because most science is actually done through databases now. It's just this complete another solitude to solitudes where librarians are thinking with the preservers of knowledge when in fact most of the knowledge is sitting out in these data resources that are still very transient and and blowing away with with the disappearing funds. Yes, yeah often it is it's variable and it differs between archival databases where it's dumped in and you have to trust that people are entering data honestly and completely and then there's the curated data sets where it's gained collections of curators where you have to hope that they are sufficiently well trained or diligent to ensure that the data is of high quality. In many curated databases there's several people that are entering data so one will enter and then a second person then validates it or checks it. There are usually protocols to run through the data to ensure that it's consistent. Some of the more elaborate databases will have tools to perform automated validation, sanity checking and cross-checking but in the end it's ultimately up to people humans or expert curators to have either read or checked almost every entry to ensure that it's of high quality. No I don't think so although I think there are tools that do mine for HMGB it's called Data Wrangler which does presumably I think mine Wikipedia articles or links to that. It also uses another tool called Polysearch to extract some of that information I think I mean cast numbers are funny things and to technically once you have more than I think 5,000 cast numbers the American chemical side again actually starts chasing after you to collect money or to sue you. Any other questions? Okay.