 So, this is module 3, not module 4, but it's the fourth lecture today, so that's what we're going to do. So, we've got about an hour, and we're going to talk about databases. So, this is kind of a lead-in to what you were doing, where you're converting spectra to lists, but what do you do with those lists, or how do you actually identify these compounds or peaks, and what are they associated with? So, we're going to talk about the different types of databases that are available, the different types of database models. We're going to look at some of the publicly available NMR and MS databases. We're also going to talk about pathway databases, as well as comprehensive metabolomic databases. These are the resources that you could use to help analyze the data that you may have just collected in the last hour or two to interpret them. They're also databases that are linked into Metaboanalyst, as well, that are used to help in the interpretation. And so, depending on whether you want to do it first or second, or it's an integrated process, it's sort of up to you and up to the particular problem you're working with. What we're doing in metabolomics is kind of a mix of both bioinformatics and something called cheminformatics. So, bioinformatics is about biological data. Cheminformatics is about chemical data, like metabolites, but because metabolites are also connected to other parts of biology, proteome, the genome, the transcriptome, it kind of connects pretty well to bioinformatics, as well. Historically, the two have been like silos, very different. Cheminformatics is a much older field that was developed in the 1960s. It was originally purposed for organic chemists, and it was structured to be a user-pay, limited public access model. Chemical abstract services in the American Chemical Society make tens of millions of dollars a year selling cheminformatics data. Many other large companies also make tens of millions of dollars selling chemical data. Now, in the world of bioinformatics, it's a newer field. It was designed for the needs of biologists, molecular biologists, and geneticists. Instead of being a user-pay model, it was an open access and now become largely a web-based model. And most of the software and tools developed by bioinformaticians have been funded by government agencies, not by private agencies. Karen actually had a really nice little presentation she gave a few days ago about the evolution of how cheminformatics has sort of blended or become part of bioinformatics, and the challenges that many of the private companies have put up to prevent chemical data from becoming public. So it's still a challenge, and PubChem has had to fight many battles. Other data providers have had to fight battles. We've even had to deal with buyer complaining that we used the word aspirin in our database. So I think most are starting to realize the benefits of working and doing public open access models. But it's still one reason why cheminformatics and even metabolomics is lagging behind. We use databases for lots of different things. Many of us create our own sort of little databases to help consolidate data. We use that to help retrieve information. Many databases are intended, especially in the world of clinical work, to have reference values or reference data, reference sequences in the world of GenBank, reference images for analysis of medical or histological conditions. A lot of databases are used in machine learning to help with training and testing. Likewise, similarity searching. We do this all the time when we're looking on Google. Many of you will notice that Google has automatic spell checks and they'll ask, did you mean this? Or you can do image searches and it'll get things that are close and exact to the image you wanted. So we can search by image, spectra, structure, sequence. There's also a lot of utility and databases for prediction. And a lot of what we do in bioinformatics is about prediction. We like to be able to take some of the information we've learned, some of the trends we've seen and we say, well, next time I'm going to see this, this is what I predict will happen. Of course, prediction has been part of bioinformatics for many years, structure prediction, function prediction, property prediction, phylogenetic prediction, activity prediction. It's also been part of cheminformatics and drug discovery and drug design for many years. So databases lie at the heart of many things we do in bioinformatics, many things we do in the omics sciences and the biological sciences. There's a certain progression for databases. Most databases, and I've been involved in building lots of databases, actually started out for me as hobby databases. I was just trying to catalog things to help me teach classes. Sometimes for me just to remember things. I think that's the way a lot of other databases actually started, including some of the first sequence databases. Over time, the database might grow to be more than just a hobby. It starts to become curated. It becomes, if you have enough know-how, you can make it into a relational database. You try and make it relatively unique and non-redundant. You go from what was once limited depth and limited coverage to something that's greater depth and greater coverage. The final evolution of many databases is they go from that hobby to database to a fully archived open deposition, often redundant database. This is what GenBank is. This is what the Protein Databank is. This is what Metabolites, L-I-G-H-T-S, is. These are relatively large databases measured in gigabytes now. Users deposit their data. It's not so much archival or curator driven where curators enter the data by hand. Some cases, people realize that archiving is too difficult and they go back to originally curated databases. This is not only it's difficult, it's also very costly. Very large publicly funded centers can generally maintain archived databases and most laboratory developed databases end up being more curated databases. Examples of those curated databases we've talked about are things like the Human Metabolome Database. It is not an archival database. It's not maintained by a large public body. It's maintained by people in my lab who work very hard at it. As you move towards the simple hobby databases to these larger scale archival or even the curated databases, there's more need for data standardization. There's more need for automation to make updates consistent and extend those across all of them. Then you also have to have much greater querying capabilities because typically the user database or user size increases and queries become more exotic each time. I'll talk about databases that probably cover all three over the next little while. I'll go back to the reason for databases because of this problem that I identified early in the morning, which was genomics and proteomics had already moved well past metabolomics because in part they not only had searching tools like BLAST and Mascot, they had databases like GenBank and Uniprot and SwissProt, which allowed people to query with their data and to get answers. For a long time metabolomics didn't have either the query tool, the analysis tool, or the databases to do that. That's where ourselves and Jeff and the number of others have been involved. The initial problem was 15 years ago was that most of the data in metabolomics was in textbooks and in journals, in paper. It still is and there's a lot of it because people have been doing pure metabolite analysis and pure biochemistry longer than they've been doing gene sequencing or protein sequencing. There's a lot of material in there, a lot of structures, but because so much of that's in paper as opposed to electronic resources, it's meant that metabolomics is generally like behind genomics and proteomics. The other thing is that metabolomics, as you guys have learned just from talking about your specialties, maybe talking amongst yourself, is that people are in every area. We've got people doing plant and microbiome work, clinical work, drug work, we've got people doing NMR, others doing MS, others doing GCMS, but some people are more bioinformatician so others are more pragmatic chemists, we've got clinicians, and this just represents the typical collection you'll see for people doing metabolomics. If we were doing genomics, the audience would be somewhat more uniform. So I'm going to talk about some of the databases, I guess three or four different categories, spectral databases, NMR and MS ones. Many of these databases are for small molecules, not specifically metabolites. And there's just basically chemical property databases or compound databases. Those are things like compound names and physical properties. Probably the most important database we'll talk about are pathway databases. And I guess I'll emphasize this point at least one or two more times. But metabolomics still stumbles, still fails. There's a science at least to my mind because we're unable to do the proper interpretation of the data. I think we've given you some examples of almost or near automated spectral conversion. Upload an NMR and a few minutes later you've got everything read up. Upload a GCMS spectra a few minutes later. If you go to these kit-based MS systems, upload your sample in an hour or two later, you've got your results really fast, really automated, high throughput. But when we come up with our lists, we do an abysmal job at interpreting our data. We don't match or meet the standards that many proteomics and transcriptomics and genomics people routinely achieve. And that's because in metabolomics, we're obsessed with catabolism and anabolism, the assembly and breakdown of compounds. And that's because most of the pathway databases we use, i.e. Keg, just cover catabolism and anabolism. And we're missing, I'd say, 80% of the letters in the alphabet, if that's what we're looking at. So if you could only see 10 or eight letters of the alphabet, and if that was the only thing you could write with and the only thing you could read, you'd be missing most of the literature. And that's basically what we're doing with our very limited pathway databases. And that's in part because we don't include the signaling element that molecules, especially small molecules, have. We don't include the disease component that small molecules participate in. Those are not depicted in any of the databases that we typically use. The other database of import is the comprehensive metabolomic database. Many of these are organism specific, and they're particularly useful in metabolomics data analysis. So I'll talk about each of these databases in more detail. So the NMR databases, there are a number of them. These are examples of them, some screenshots. Most of them are ones you've never heard of, even if you are in NMR. So there's the spectral database standards in Japan, maintained by the AIST, as opposed to the NIST database. It's actually been around for a very, very long time, for more than 40 years. It has now more than 25,000 spectra from mass spec, tens of thousands of proton and carbon NMR spectra, FTIR spectra as well. The numbers are probably a little different than what you have in your books or references, but these are the latest numbers. It has lots of search tools, although most of the compounds are not metabolites, they are generally useful. On the other hand, there's a resource called the Biomag Res Bank. This one is specifically about biomolecules, and molecules that are metabolites. The Biomag Res Bank is better known in the world of structural biology because it maintains reference data for protein NMR, thousands of protein NMR assignments. But when they started offering this data for metabolomics, it only took about a year or two for the metabolomics downloads to exceed all of their protein downloads. So it's very popular, and it's actually quite well done. They've been dialing back on some of the spectra and metabolites that they have. So there's 387, I think your notes might say 680, but they've removed a lot recently. But they have a lot of spectra for these samples, both in proton 1D and 2D carbon proton. You can do a fair bit of standard searching. The focus of the compounds that they chose were primarily plant metabolites. But plants do share some elements of similarity with humans and other mammals, and so probably about half of them would be considered mammalian metabolites. They've done a really nice job of assigning all of the spectra, so that makes it quite useful. There's another open source database called NMR ShiftDB. There's about 50,000 spectra for about 40-some thousand compounds. It's maintained in Germany, and Chris Steinbeck was the one who originally developed it. Chris recently left EBI and is now back in Germany, focusing more on chem informatics. It is likely SBDBs is not restricted to metabolites. It includes many chemical compounds. It has some nice tools for chemical shift prediction, which are similar to mass spectral prediction. And you can search through a variety of mechanisms and names. So it is an impressive site, but it doesn't cover a lot of metabolites. So those are the NMR databases. Of course, if you guys used Bazel today, or if you've used Kinomics, those also have their own spectral databases. Likewise, HMDB has actually a much larger NMR spectral database than most of the others. But it is more strictly a comprehensive metabolomic database, so I won't talk about it until later. Now, there are mass spectral databases. We've mentioned a few of these already. We've talked about the NIST database. We've talked about METLIN. I've mentioned GOLM and MassBank briefly. Go into a little more detail. The GOLM database is one of the oldest publicly accessible databases for metabolomics. It's been around since the late 1990s, and was written about, I think, in 2003 or 2004. So it's a GCMS database. It has more than 2,000 compounds that are in its archives, and about 11,000 spectra that are linked to those compounds. It's compatible with NIST and AMDIS. Again, just like the Biomeg ResBank, there is a focus on plant metabolites, but they're continuing to expand it. You can both analyze and submit data to it. So this is the GOLM database website. How many people have ever heard of it? One, two, OK. That's better than most years. But it is a pretty impressive database. It's maintained by, I think, the Max Planck Institute and is continuously updated. So I've been following it for many years and always impressed by the fact that they're always adding to it. You can search through it. It's not the most user-friendly resource, but it is very searchable. You can look through different spectra and compare them. It has a lot of retention time data. And if you're looking for a free resource on GCMS, this is a good one. I've talked about some of the other ones, Metlin, NIST, MassBank. There's a number of commercial providers. GNPS actually has grown quite a bit more than what I have here. It's hard to track right now. There's MassBank of North America, MONA. Has anyone heard of MONA? One, two, three, OK. So this one has a very large resource. It's open. Metlin, again, I think, how many people have heard of Metlin? Everyone should have heard of Metlin by now. And then NIST. So in terms of spectral size, MONA has some of the largest, more than 230,000. It also has a lot of compounds, but a lot of the compounds are theoretical compounds with theoretical spectra. So the true number is actually about 12,000 compounds with authentic spectra. Metlin's around 13,000. NIST's around 9,000 unique spectra. There are unique compounds. MassBank around 11,000 or 12,000 compounds. Wiley produces its own respect to GNPS. When you look at what's available in terms of MS spectra, at least tandem MS spectra, there's only about 20,000 compounds that actually have their spectra available. So it's a tiny number relative to the number of compounds that have GC-MS spectra. And it's not much more than the number of compounds for which NMR spectra are available. Metlin, I guess it's 240,000, 68,000 high resolution, 13,000 compounds with MS-MS spectra. More than two-thirds of those are actually peptides, dipeptides and tripeptides. And most metabolites aren't tripeptides. It was just an easy thing to collect. And so they did it. Metlin's search, you can search by mass. You can search by various add-ups. You can get hits and results from that, as you would with other types of searches with other databases. The structures are provided. Links to the spectra are also provided. It's not downloadable. It's quite protected because, in fact, it's used commercially by a number of companies now. So it's become ever harder to actually access the Metlin database. But it is a fairly rich resource. In contrast to Metlin, there's MassBank, which is totally open. It was originally developed in Japan, and now there are branches not only in Japan, but also in Europe and in North America. So this is MassBank in North America. It has lots of different types of spectra from different instruments. It covers about 15,000 compounds. Not all of them are unique. Some of them are, you know, overlapped. So there might be a slight exaggeration. The Japanese site has been defunded, and it's only just maintained. And recently the website went down and hasn't come up. So if you want to be able to access MassBank as it was in Japan, you have to go to the MassBank Europe website. But it has capacity for people to do peptide and peak searches, rather just straight peak searches, for MS and MS-MS data. And it's actually quite nicely designed. Now MassBank of North America is an extension of MassBank of Japan, and it has many, many more predicted spectra and, in essence, predicted compounds. So it's having a different philosophy than the others. MassBank Europe also has a lot of data from contaminants, water contaminants, and that's sort of the exposome and kind of a new direction for metabolomics. So in addition to these spectral databases, which cover NMR, GCMS, LCMS, MS-MS, they're also compound databases. So some of the better known ones are things like PubChem, which all of you should know about. ChemSpider, how many people have heard of ChemSpider? Okay, good chunk. How many people have heard of Kebby? Not so many. How many have ever heard of LigandExpo? One. Okay, so LigandExpo is a collection of compounds, crystal structures from the protein data bank. So these include drugs as well as metabolites and cofactors. So it's an interesting resource. So Kebby has got about 55,000 compounds right now. It was originally developed primarily as an ontology database just to give definitions. A lot of the compounds are sort of borrowed from keg, drug bank, lipid maps. You can get synonyms. They do a lot of good work on synonym or proper naming. There's also information about the structure and it's quite searchable. The most popular chemical database is PubChem. There's now 94 million compounds in PubChem, about 230 million substances. Basically to qualify, it has to be a compound with fewer than 1,000 atoms. So that can include some pretty large, almost proteins. It has lots of contributors. And it has various ways of classifying the different compounds. There's quite a bit of information visible now in PubChem. It's growing all the time. There's actually huge amounts of information that are hidden in PubChem and so if you know people in PubChem, they can sometimes release that information to you. It's exceptionally well linked and the information that they're collecting is getting better all the time. However, PubChem, as I said, less than probably 0.1% of the compounds in PubChem are related to biological entities. So 99, 99.9% of the compounds in PubChem have nothing to do with biology, have never left the lab and will never ever be found in any organism. So if you use PubChem to identify compounds in a metabolomics experiment, you have a 99.9% chance of being wrong. ChemSpider is a similar model. It's maintained in royal cytochemistry. It has a slightly different model. Not quite as many compounds as PubChem, but more data sources. It has some interesting links and also has some interesting connections to various spectra. It kind of ebbs and flows depending on who's in charge of it in terms of how public it is, how useful it is. But again, just like PubChem, 99.8% of the compounds are not biological. They are not found in any living organism. Most of the compounds, in fact, in both PubChem and ChemSpider, probably were not synthesized. A lot of them are in virtual screening libraries where people think they made them, but they're not sure. So they've just uploaded what they think the structures are. This is how you can get millions and millions of structures. So it is really not a reliable indicator of the true chemical space that is out there. The best estimates we have of the number of chemicals that are actually produced, manufactured, are available at more than picogram levels. It's about 200,000. That's the total chemical inventory globally. And of those, only a small number, probably less than a few percent, are actually biologically relevant. Many others are things like dyes and food additives and things like that. Ligand Expo, this is one that no one's heard of. This is the protein data bank extraction of the small molecules in proteins, protein binding sites. Unlike other databases, which just give you sort of two-dimensional representations, these are the actual 3D structures of the molecules. And they are available in a variety of formats. There are other kinds of compound databases that are lesser, not as well-known. 3D MAPs, probably no one's heard of. NAPSAC, has anyone heard of NAPSAC? Okay, only Karen. Lipid MAPs, how many people have heard of Lipid MAPs? A few of you? Probably no one's heard of my compound ID. So 3D MAP is a 3D structure database. So it's a little bit like Ligand Expo, these are either crystal structures or modeled structure of natural compounds, mostly are plant. NAPSAC is a really nice one because it links metabolites, plant metabolites, to species data. So if you know what plant you're analyzing, go to NAPSAC because it'll probably give you a list of those compounds found in that plant. My compound ID is something I'll talk about a little bit more, but this is something that was developed in collaboration with ourselves in a lab in University of Alberta, headed up by Liang Li. Lipid MAPs has about 30,000 lipids and it's been maintained in UCSD for a number of years. It established the nomenclature, sort of the official nomenclature for lipids. But most of the lipid MAP data is actually derived from another database in Japan which had lipids. So it's, I think it's really just primarily a nomenclature database. So my compound ID was developed as a resource to address the dark matter of the metabolite, those 90-some percent of the peaks that you can't identify. And the assumption here is that the 90% of the compounds that we can't identify are metabolized metabolites. So you eat something, it goes through your gut, the microbiome and it's transformed into something exotic. And so what they've done in my compound ID is codified about 75 metabolic transformations, phase one and phase two, as well as some of the fragmentation events that can happen with metabolites just sort of through decomposition or even mass analysis. So it took at the time about 8,000 human metabolites from HMDB and then did essentially formula generation. It didn't do structure generation, it did formula generation to calculate molecular weights from that. So roughly a 40-fold, almost 50-fold increase for the number of starting metabolites to the number of first-bass metabolites and then from those 370,000, another 30-fold increase to get what are called second-bass metabolites. So there's 11 million metabolites of which 99% are theoretical. Structures are not there, but masses are. And so this is a website that's available through my compound ID. And so people can search masses with this to see if they can get potential hits. And in that regard, it doesn't give you a structure, but it gives you a potential match and says this could be a metabolite that's related to the starting compound which has gone through some of these transformations. So if you upload data to my compound ID, you can perform searches and it can produce potential structures if the structures exist or have been generated. As I say, most of the theoretical ones are actually just masses. So I'm going to get into the part that I think is the most relevant not only for today but also for tomorrow. And these are the pathway databases. So this is what connects metabolomics to proteomics to genomics to biology. This is ultimately how you interpret things. So there are a number of databases that are out there that are pathway-specific. Keg, how many people have heard of Keg? Who has not heard of Keg? Okay. How many people have heard of Reactome? A few. How many have heard of BioPsych? A couple. And then the small molecule pathway database? A few. Okay, so again, I think everyone's familiar with Keg. These are important resources. They do provide that linkage. This gives actually people doing metabolomics a slight leg up on the other people doing exclusively genomics or exclusively proteomics. Many of the pathway databases will cover multiple species. Many of them allow to provide visualization tools to do some kind of mapping, which also assists with interpretation. So Keg is the best-known pathway database. It's the oldest one, arguably, and maintained in Japan and has been around there for getting on, I think, 20 years. The pathway wiring diagrams are very compact and quite informative. They use essentially the same structure of all pathways and then map those pathways to all the thousands of organisms that have been either characterized or sequenced. Keg also contains compound data. It also contains drug data. But it's a little surprising. So, well, many of the compound databases like NAPSAC or HMDB have 50 to 100,000, Keg more than 50,000. Keg only has 18,000 metabolites. It has a lot of drugs. Many of these drugs are not really drugs. There's only about 2,500 known drugs so the other 9,000 are sort of exotic herbal medicine phytochemicals that no one really calls drugs. And then they have a big focus on glycans which used to be of interest to the people in Keg but has the number steadily shrinking. The total number of pathways in Keg is just a little over 500. And it's trying to cover all the pathways corresponding to all phyla kingdoms of life. So plants, insects, animals, mammals, microbes. And there's some pretty exotic pathways. Bottom line for the number of pathways in mammals is about 100. So if you're a human, you're a mammal. So there's only 100 pathways in Keg covering your biology. And of those pathways, all of them are about catabolism and anabolism. So none of the Keg pathways have anything to do with signaling. None of the Keg pathways have anything to do with disease or pathological conditions. And so if you limit your interpretation to Keg pathways by metabolism, you get almost nothing. Nothing useful. So the other thing that's been problematic with the Keg pathways and which led to things like reactome appearing and others is that there's very little biological context. If you are a naive user of the Keg pathways, you'll click on a generic pathway. And you might think you're thinking that this is a human pathway, but it may not be labeled as such. And so you'll see this description of photosynthesis. It's hard for me to say I didn't know humans would photosynthesize. Most people who have worked in Keg are not aware that the TCA cycle or citric acid cycle happens in the mitochondria because in Keg, none of that information is shown. TCA does not happen outside of the mitochondria. So there's no organelle context. There's no information about the different metabolic, catabolic, anabolic pathways that are followed by the heart or the lung or the liver, which are all very different. So without the biological context, without the organelle context, a lot of people are misinterpreting or misunderstanding biology and biochemistry. So that was one reason why we started working on the small molecule pathway database just simply anabolism or catabolism to things like signaling and disease. So the small molecule pathway database has a lot of pathways. I don't know, Karen would have a better idea of exactly how many, but the stats doesn't work on this one anymore. So about 400 drug pathways, 200 disease pathways, 220 catabolism and anabolism pathways, and then 40 other pathways that we can't really identify for sure or classify easily. What's done with the small molecule pathway database is to draw metabolic pathways, but to show their context, to describe where they are. Are these pathways that happen in the mitochondria? Are they pathways that happen in the cytoplasm? Are they pathways that connect different tissues to different cells, to different organs? We also wanted to be able to depict the molecules because right now with keg it's simply a dot and the proteins, which is simply a square. So we wanted to be able to show that information. We wanted to be able to show the membrane structure, the nucleus. We also wanted to try with SmithDB to map gene chip, protein expression and metabolomic data under the same thing. And it has tools that allow you to take lists of chemicals, proteins, metabolites and genes to identify pathways and to in some cases perform disease diagnoses. So this is what a SmithDB pathway looks like. You can see the pink and red thing, that's the mitochondria. You can see other types of organelles depicted. Circles are the proteins. The white boxes contain the structures of the metabolites. You can see in some cases the metabolite being produced. I guess this is for PKU. The aberrant molecule that's produced and the effects that it has on the brain and on other tissues. You can also see some of the cofactors where they appear. All of the pathways in SmithDB have text explanations, explaining what the pathway is. All the compounds are hyperlinked to HMDB, all the proteins are hyperlinked to uniprot. So again, it's trying to give you both a biological context and a chemical context and a pathway context through a text explanation. You can type in lists of metabolites and the metabolites will then be highlighted on the pathways. You can play around with checking things on or off and to see how many are visible. It has sort of the Google Maps viewing tools, so you can zoom in and zoom out and scroll left and right. You can also enter data related to concentrations and the concentration data is also mapped to the colors on the metabolites. Again, you can see organs and tissues depicted within the pathways. There are different views. Yes? I have struggled with this window previously. Do you have to have absolute quantification to do this or can you do full changes? You can have relative quantitation for this. You can choose any number. It will give you sort of a color or credit color scheme. You can also change the depiction. Some of the views in SmithDB are sometimes quite large and sometimes complicated. So you can switch it from a color view to a black and white view to a keg view. Also, the pathways are saved in different formats. They can be saved in a biopax format, SVML format, SVGN format, and a pseudo keg that can also be saved as SVG or PNG images. All the pathways that are in SmithDB are actually generated through a web server called PathWiz. So PathWiz for pathways. It allows you to basically have a vehicle accessible anywhere to produce machine readable pathways. We've had a few people contribute to PathWiz. It's actually an open access system. So if people wanted to draw pathways, they can be added in. They have to be pretty good in order to meet the requirements. But as a tool, it allows you to generate machine readable biopax SVML SVGN models to generate different pictures, full color to black and white, to keg-like to SmithDB type, and then it also supports viewing. Karen is the local expert in PathWiz and also in SmithDB, and has been working on these pathways for more than a year now, year and a half, most of her life. And these are some of the examples of the tools pulled on menus to try and generate or edit reactions, enzymes. You don't have to draw an enzyme. You don't have to draw a structure. Many of these things are available through links to HMDB or through Uniprot. You just have to be able to click and drag. And so again, this is an online web-based tool for drawing. And you can pull down these different modules and images. You do have to spend a little bit of time, you know, sketching out what your pathway should look like, because if you don't, it'll start looking pretty ugly. But again, just like with any drawing tool, it allows you to expand, shift, and rotate things to fit your screen. There's a whole collection of icons for livers and endoplasmic reticulum and transporters. And so pathways with cofactors, transport pathways, disease pathways, signaling pathways, protein and metabolite pathways can all be defected. You can also have zoom boxes that are associated with activities in particular regions of a cell with the ribosome or rather the mitochondria or endoplasmic reticulum. So what the intent of SMIPDB, and someday I hope SMIPDB will be incorporated into Metabolanalyst, is to try and capture those things in metabolism that are much more relevant to biology. As I said, you know, 80% of what's really relevant in metabolomics is not captured by KEG. That 80% relates to disease pathways, pathological pathways, and signaling pathways. Most metabolites play some kind of signaling role. And the classic example is the Warburg effect, which is fundamental to all of cancer, that there is no Warburg pathway in KEG. And the immune response generated by the Warburg pathway is well known, but it's not depicted again, all directed by small molecules. So these are things that we miss over and over and over again when we try and interpret our metabolomic data using just catabolic or anabolic pathways. The last set of databases I'll talk about are the comprehensive metabolite or metabolomic databases. So I'll include KEG in that because it links chemicals with pathways. There's ecocyc, human psych, sort of the psych databases is also metabolites. To pass mustard, they have to have at least 1,000 compounds. Many of them are organism specific. They have to be updated continuously. And they have to contain more than either just chemical data or more than just pathway data or more than just spectral data or more than just biological data. So they need to combine two or more elements together. So metabolites, which has been running for a few years, is technically the GenBank for metabolomics. This is an archival database. How many people have deposited data in GenBank or rather in metabolites? Okay, one. How many people have deposited data in GenBank? None. Anyways, this is a problem in the sense that ideally when you prepare data in metabolomics, you should be sending it to some archival resource. A fair bit of money has gone into establishing metabolites. It's maintained by the European Bioinformatics Institute in Cambridge. And you can upload your chemical data. You can upload your experimental data, method process, your compound data, your lists of data. It takes just about everything. It takes and combines that data, allows you to do querying, is linked to kebby. It complies with the Metabolomics Standards Initiative. And so I certainly encourage people who are collecting metabolomic data to try to pausing it to metabolites. As most of you know, we've also been involved in maintaining metabolomics data and databases. I guess technically we don't have the first comprehensive ones. The Human Metabolome Database, which appeared in 2006. A drug bank is another database, actually many times more popular than the Human Metabolome Database. So that links drugs to drug targets. And so that's used by many pharmaceutical research companies. There's a yeast metabolome database. There's a phenol-polyphenolic database for foods. E. coli metabolism database. Food metabolism database. Contaminant DB, SMIP DB. Toxic metabolome or toxic exposure database. There's also now more recently a new fecal metabolome database. So those of you working on microbial studies, that just came out a few weeks ago. So these are all maintained by people like Karen and like Manoj and Mark who contributed to that work. And the Human Metabolome Database was actually started in about 2005. It was funded originally by Genome Canada as a project. The Human Metabolome Project. And we were mandated to identify, quantify all the metabolites we could in common biofluids like blood and urine and cerebral spinal fluid. And to make that data freely available. That expanded from just human endogenous metabolites to drugs and foods and toxic exposome data. And it included experimental work that we've done in our lab as well as data that's in the literature and has been compiled by other people. When we first started looking at this, when we looked at the databases that were online, human, psych and keg, total number of compounds that they listed was 690. So that was the entire extent of the Human Metabolome. First release of the Human Metabolome Database was in 2006 and we tripled it in size. And it was big news because people didn't think the Human Metabolome was going to be that large. Then a couple years later expanded to 6400. A few years after that it was 37,000. In 2017 it was 42,000. And then as of 2018 it's 114,000 compounds. And as I said it looks like it'll be well over 200 or 300,000 compounds. So it's grown quickly, both as our knowledge has grown but also as our understanding of what is in the body and what constitutes the metabolome. This is a little dated and I probably should have updated it. But these are some numbers that I showed earlier about the size. The total number of endogenous metabolites now in the Human Metabolome Database is around 65 to 70,000. And the exogenous metabolites about 30 to 40,000. So these resources are maintained in a bunch of different websites. And their URLs are given here. With the Human Metabolome Database as I said it's now up to 114,000 compounds. About 200 microbial metabolites that are unique about 1100 that are associated with the gut metabolome which includes both endogenous metabolites as well as microbial metabolites. There's lots of information on diseases, lots of spectra. More recently we've generated a lot of CFMID spectra. So there's a couple hundred thousand spectra, MS and GCMS. With HMDB you can do searches against sequence, protein sequence, gene sequence. You can do spectral searches. You can browse, you can search by pathways and disease, biofluid concentrations and so on. Most, I think people have probably seen some part of the HMDB. It links to the small molecule pathway database. It has links to various spectra that have been either collected in the lab or predicted elsewhere. Right now there are a little over 100 data fields in any given metabolite entry. And the spectra can be searched through standard search tools. There's been a very recent upgrade to the spectral searching, a number of improvements for it, tandem mass spectra as well as paradigm searches with various rankings. You can do NMR spectral searches. You can do a variety of structure searches to look for similar or identical structures. You can look through a variety of biofluids and look for different diseases. There's several hundred inborn errors in metabolism that are also tracked. There are at least, I guess, close to two dozen biofluids or excreta that are measured lots of stuff specifically for clinicians and physicians and clinical chemists. Drug bank, as I mentioned before, is actually quite a bit more popular than HMDB in part because of its utility in repurposing drugs. So lots of drug companies make use of it and a lot of new drug treatments have been developed through this particular database just simply finding what was largely already in the literature but being able to explore it in more detail. It has lots of information about drug, drug interactions, drug food interactions, drug transporters, drug metabolism, mechanism of action, pharmacological, pharmacognetic, pharmacogenomic data, absorption, distribution, metabolism as well. A lot of focus, there are protein drugs, so it also has information on their structures as well as the small molecule drugs. Many of the same tools in HMDB are also available through drug bank, including chemical queries, category browsing. Both HMDB and drug bank now have a more formal ontology that allows you to look for things based on function, biological role, industrial role, health effects, and other related properties. Lots of information on gene and protein sequences and lots of ways to extract data in a more detailed way. A toxic exposome database, this is about the toxic compounds that are found everywhere, covers things like pesticides and herbicides and endocrine disruptors, solvents, PCBs, furans, carcinogens, and it's very much like a drug database, but the compounds aren't drugs or most of them aren't. It has lots of information on chemical genomic data. Food database is about the compounds that are in our food. About 700 foods are covered in there, but also information about the flavor and aroma and color that some of these food derivatives are associated with and their effects on human health. So it's a little more complicated than what you'll find on your cereal box. Yeast metabolom. Yeast is used to make bread. It's also used to make wine and beer and many other things. So there's more than 2,000 metabolites in the yeast metabolom database. Lots of detailed information about proteins, enzymes, but also about the products that are produced through fermentation, which are kind of unique to yeast, structured very similar to the human metabolom database. Same thing as with the E. coli metabolom database. This is a bacterial one. And again, the number of metabolites that are there are lots of reactions, pathways, information on transport and processes, lots of spectral data for the compounds. So what I'm trying to highlight is that there are databases for specific needs, and if you're trying to simply turn only to pubchem, you miss the point. If you know something about your system, and most of us do, there's no reason why you should be looking at just generic chemical databases. You should be trying to turn to either organism-specific, environment-specific, purpose-specific databases, because these will have the information that will allow you to interpret the results you get, but also limit the results that you were attempting to see or find. As I said, 99.9% of the hits you'll get with a pubchem search will be wrong, versus if you're searching through an organism-specific database. At least it tells you if it's there, or if it's been found in that particular organism. So this just is a sort of a comparison between different databases, the types of things that you'll tend to look for, whether it's information on nomenclature, references or links, the types of spectra, links to pathways, structures, descriptions, definitions, chemical properties and physical properties. What you typically need in a comprehensive database to be able to interpret things. So I guess it's time to wrap up, because it is now exactly 459. So you can continue if you'd like. You can come back to the lab to play around with some of the interpretation or data analysis or spectra to list studies you did, or alternately if you're tired and want to enjoy the nice day outside, you're done.