 for seminars is really going to be about how to interpret your data. And we're going to talk about databases for compound identification. But then we're also going to learn about statistical methods and tomorrow Metaboanalyst to interpret your lists of compounds and concentrations. So for this module, we're going to be talking about the different databases, different types of metabolomic database models. We'll look at some of the NMR and MS spectral databases. We'll also talk about pathway databases. And we'll learn a little bit more about what we call comprehensive metabolomic databases, which are distinct from spectral databases or pathway databases or compound databases. So again, what you did in your lab was basically to get lists. So from spectral lists, the next parts are how do you go from lists to biology, lists to pathways, lists to interpretation. So historically, there have been two solitudes, the same thing, I guess, in Quebec. But you've got chem informatics and bioinformatics. And they've evolved quite differently. And the way you can think about them is that chem informatics has been around since the 1960s. It was developed largely for the pharmaceutical industry and for medicinal and organic chemists. In the 60s, computers and softwares were very expensive and were a way of making a lot of money. And so in fact, American Chemical Society and others have used that model for many, many years and still do. So that most chemical information is not free. Most of it's very expensive. And most of it is very limited to public. It's supported large companies like Bilstein, Sigma, established things, MDLs, another group, CAS. On the other hand, bioinformatics, which came into the fore in the 1990s, was not designed for chemists, but was designed for molecular biologists. And by the 1990s, the web has started to appear. And there are very strong pressures on the US government, in particular, to make everything open access, which they did through an act of Congress that established GenBank. So the result is that in bioinformatics, everything is web-based, open access free. And it's largely funded by federal government agencies from NCBI, Genome Canada, EBI, NIH, and so on. So those two things have sort of made it a little difficult for metabolomics, because metabolomics falls in the chem informatics world, yet it's trying to interface with the world of genomics and proteomics, where everything is free and open access. So that's putting things a little bit in context, and partly explains why metabolomics is lagging a little bit. What do we use databases for? Well, databases, whether they're public or private, are basically a way of information consolidation. But when you build a web base, you can actually link it through a variety of techniques, software, hardware methods. In databases, you can do lots of information retrieval. You can do query matching. Everyone's done that some way or another. Databases also contain lots of reference values, reference data, reference images, reference spectra. And this is how some of the software you guys are just using has been working. So these actually are software tools based on large databases. We also use lots of databases for training and testing things. And this is how a lot of the tools that I've mentioned, predicted tools, have been tested and trained. Primarily in the world of chem informatics and bioinformatics and metabolomics, we use databases a lot, for both similarity searching and prediction. And so structure, sequence, spectral prediction, and for the search. Databases also evolve over time. And in the databases I've built and probably some of you also have been involved in, often they just start as a local consolidation of information you just wanted for yourself or your lab. So it starts off as a hobby database. In fact, GenBank started off as a hobby database. It was kept by a group, I guess in California, where they're just trying to keep a small number of gene sequences related to oncogenes, viral genes. Same thing with protein sequence databases, the 60s. Those are also sort of hobbies. But eventually people start asking for them and they become a little bit more than a hobby and so they start becoming curated and they start becoming a little more sophisticated. They might be relational. And eventually people say, well, I'd like to contribute to the database. And that certainly happened with GenBank and obviously the Protein Data Bank. And no longer is just something that's local to a small community, but it's open to everyone. So open deposition, the result is that many people are doing the same thing at the same time. So it gets a little bit redundant. Often because the database is so large, it also has to become distributed. In the case of GenBank, it's distributed in three different countries, Japan, US and Great Britain. So there's a trend as well. Hobby database is small, limited depth. As you get to larger databases, coverage gets greater, the depth may get greater. And then as you get to an archival open deposition database, it gets very extensive coverage or breadth, but not a lot of depth because you've got everyone who's just racing to put stuff in. There are only a few truly open deposition well-archived databases and most of them have to be run by government agencies. GenBank being one, PDB being another, and Unicrote being a third. Just because the costs and resources are absolutely enormous. As you move up in terms of the scale and scope of databases, also you need to become more standardized. There's greater dependence on being automated and the expectation for querying capabilities get greater and greater. So as I say, it's a challenge maintaining databases. It's easy to start a database. Most databases kind of sit in the middle, curated non-redundant databases and that's where I would say a lot of the metabolomics databases are at. So I mentioned this issue when we saw this picture before about how there's software already that can take an unknown gene sequence and unknown protein mass spectrum and give you information. Give you information about identification, possible concentrations and that. For metabolomics, the longest time there really wasn't that. So over the last 10 years, there's been a real concerted effort to do that. Prior to about 2007, most of the data for metabolomics was in textbooks. Basically it's been lagging behind genomics and proteomics by about 15 to 20 years. And typically because metabolomics covers such a broad area, you can see just in terms of the people here who are looking at all kinds of areas of science. People who are also chemists, people who are software specialists, people who are working in clinical work, drug research, plants, parasites. We've got a whole range of people. And so to try and create a database that meets those needs is quite challenging. The result is that it's led to a bunch of different types of databases that can sometimes are used in metabolomics. So there are the spectral databases, which are crucial. And you guys have seen how these spectral databases are being used in the exercise that you just did. There are compound databases. So these are things that allow you to associate a compound name or like a formula with a compound or a mass that allow you to identify compounds. The pathway databases are allowed to go from the compound list to some biological interpretation. And the comprehensive metabolomic databases combine the spectral databases with the compound databases and the pathway databases with a whole lot of other things, along with descriptions and details. Usually the focus always on a specific organism. But there can also be ones that are fairly comprehensive covering multiple organisms. So I'm gonna talk about each of these databases, but these are the ones that allow you to go from your lists of compounds and concentrations to getting some more information about them. So the NMR spectral databases, there are several that are around. And so we were looking at sort of defined biofluids, but there may be situations where you're looking at very unusual mixtures that are not blood or serum or urine or something like that. So you're having to try and match NMR spectra to something a little more exotic. So these are examples of databases that have reference compound from pure spectra, pure compounds. So one is in Japan, maintained by the Japanese Institute of Standards. It's been around since the 1970s. So it includes not only NMR spectra, so carbon NMR and proton NMR, but also has mass spectra and FTII spectra. You can do a lot of spectral searching with this. However, most of the compounds in this site are not metabolites. There's some, but the vast majority are not. Another one, which is maintained by the Biomeg Res Bank. This is a database that was originally established for protein NMR, but has expanded to metabolomics. And you can go to the metabolomics section there and you can find almost 1,000 spectra, well, actually more than that, spectra for almost 1,000 compounds. There's an average of five or six spectra for each compound. So it translates to something on the order of about eight or five or 6,000 spectra in total. You can search by names and synonyms, inchy keys and symbols, smile strings. Originally it was designed for plants, but it's also extended to include mammalian metabolites. So in that regard, Biomeg Res Bank has some unique compounds that would be of interest in areas of botany, food, and sometimes fungal studies. They've also assigned all of these reference spectra. So another database called NMR ShiftDB, it's version two. This is a community database where people submit proton and carbon NMR spectra, and it covers a lot of compounds. So it's up to 40,000. So it's bigger than the Biomeg Res Bank, it's bigger than the spectra for ACE, and it gets growing. But it's also one that's largely for compounds that are not metabolites. So interesting NMR ShiftDB was started by Chris Steinbeck. Some of you might know Chris who's been involved in leading Kebby. He also has established the database called Metabolites. It's not, as I said, limited to metabolites. There's lots of organic compounds, so I'd say metabolites might represent 1% of the total. It also supports chemical shift prediction. So if you have a compound pasted in, it'll predict the likely chemical shifts, which is very useful. You can search by name and structure. It also stores the files as Jcamp files, which is the NMR standard. It also includes chemical shift assignments that people have made. Another database maintained by the people who developed Biomeg Res Bank have a bunch of spectra, but also have a whole bunch of literature-derived spectra that they've added to. So it consists of about 20,000 compounds, but only about 2,000 half spectra. So I like the Biomeg Res Bank. It supports searches by names, chemical shifts. You can also do mass spec searches. It has data on chemical formulas, synonyms, species associations, references, and other public links. So those are some examples of spectral databases for NMR. We've also looked at, and I've shown this slide before, but these are MS spectral databases. So the NIST one is one. METLIN is one. We've talked about the goal. So in BASIL, what you're saying is we have incorporated that data in these other databases, or in BASIL, when we pick from library, can you pick these different libraries? Yeah, so in BASIL, actually, you have no library choice. You choose a bio fluid, and it has its own specific library. So those spectral databases I'm talking about are really intended to deal with unknowns that you couldn't, that didn't match anything, say, in BASIL. And so you might have a list of chemical shifts that say, okay, let's hear something way up at 8.9 ppm. It's not fit. What in the libraries have peaks at 8.9 ppm? And you could search through those, and it might suggest a half dozen compounds that have chemical shifts way up at that range. So no, with BASIL, you can't incorporate because it's very strictly controlled by the protocol. And in many cases, the spectra that are collected in these databases were not collected with the same protocol and at the same pH, or they were collected in organic solvents, which can also change the chemical shifts a little bit. So let's say you have your sample BASIL, and because you expect loads of your stuff to come out, you would get something that might be unknown if it's not in that library. That's what you typically do. But as I said, I think the real intent of these libraries is to deal for things that weren't serum or cerebral spinal fluid, which BASIL deals with very well. So if you're looking at, you know, pine sap, or looking at some jellyfish extract. And so there's a place for you to upload your files and to do a similar thing? No. You can only do peak searches. So you have to type in your peaks. You have to type in your chemical shifts that you've manually measured. So BASIL is quite unique. It's the only tool out there, really, that does automatic assignment and quantitation. But it's just limited. But it's just limited. So if BASIL were to work for everything, it would probably take everyone in this room working steadily for five years on every fluid to try and cover that. That's right. BASIL is being modified so that people can create their own libraries and do other fits for it, but it's not really online or those tools aren't really publicly available yet. Next year. So I think, I mean, the point, these are very good points that you've made, but the field is still evolving. Software isn't mature. It's, I mean, you saw even XCMS, which gets, you know, thousands of users and didn't work for everything and still couldn't handle most browsers. And this is because a lot of these things are run internally in their own labs, whereas in the case of GenBank, it's run by a huge government agency with staff of thousands. So it's the real challenge in these other fields where it's basically the generosity of a bunch of scientists saying, you know, here's what we've got. So, you know, Gary Shustak is spending, you know, lots of money, probably personal money, just to keep XCMS online working. We spent a lot of money and time trying to make sure that Bazel and HMDB all work for people. So these are sort of again, just our gifts to the community, but they're not as sustainable or easily maintained as, you know, Uniprot or PDB or GenBank. Anyways, getting back to the mass databases, I'll briefly talk about a few of them. The Golem database I mentioned before has a lot of GCMS data. What's nice about it, it has mass spec and retention index data. And it's done this for about 1,450 metabolites. A lot of them are plant metabolites, but many plant metabolites are also human and mammalian and fungal and bacterial metabolites as well. Lots of spectra, all the GCMS data is compatible with NIST format and Amdus software. So it is possible to extract and mine the Golem data and even to search it. As a nice looking website, and if you want to explore, I did encourage you to do so later in this evening. You can do searches with it. Tools, facilities are there. And this is an example of some of the things that you can get from that. Metlin is the database that you guys will have interacted with, especially if you've gone to that site that is in the public share, or if you were successful in getting your data processed. So Metlin has almost a quarter million compounds. And what it's done is, as I said, there's a lot of peptides and then a lot of plants, a lot of drugs, drug metabolites, a lot of exotic toxins, a lot of natural products. So it's very extensive. They have a lot of MSMS spectra. This represents spectra from about 12,000 compounds, of which close to 8,500 are protein, are peptides. So that means there's only about 4,000, what I call organic compounds, or non-peptide metabolites. That's still a very large collection, but it only represents about maybe 5% of the metabolome, at least in terms of the human metabolome. And only what's 1-60th of the number of metabolites that are actually currently listed in Metlin. So it underscores the point is that it's very, very difficult, and will be very difficult over the coming years and decades to actually get authentic mass spectra for all the compounds that we know about. So that's a challenge. And Metlin effort has been going on for 10 years and has had a lot of support from many, many companies and resources, which is one reason why they've been able to collect so many authentic spectra. But most of those spectra, as I said, are peptides. And so that's sort of the very, very low-hanging fruit. And most hits that you get, or most metabolites in mixtures, are not peptides. So with Metlin, if you've got peaks that you've been working with, you can enter a mass, you can choose whether it's positive, negative, or neutral, you can choose the type and adducts. And once you've chosen those things and you can just press submit, and it will list the hits. And you've got information about the compound, the name, and other information. You can do not only a single parent IMS search, we can also do MS, MS searches. So the data can be in MZXML, MZML, or an MZ data format. And so if you submit those, then you can also do MSM search. So that's similar to the CFMID website that I mentioned before. So both Metlin and CFMID support this MSMS search. Metlin searches against its database of measured standards, whereas CFMID searches against both measured standards as well as predicted standards. MassBank is a Japanese effort which is separate from AIST and also from Metlin. It consolidates many, many different spectra from many contributing labs, most from Japan, but a number from Europe and the US. And so it collects spectra not from a single type of spectrometer, but from Qtoss, triple quads, GC, FTICR, iron trap. So it's got a collection of about 15,000 compounds. So in that regard, it's larger than Metlin in terms of the total number of mass spectra compounds. But it's, I think it's smaller than Metlin in terms of the total number of spectra, yeah. So Metlin has 67,000, MassBank has 41,000. But the compound diversity in MassBank is much greater and much more impressive. So in some respects, it's probably a more useful resource than Metlin. So MassBank has a nice interface. You can browse the data, search the stats. You can also do peak searches. So gain allows you to do both paradigm mass and then SMS searches. So those are examples of databases, of spectral databases for mass spectrometry and for NMR. I also mentioned compound databases. So let's say you now identified your compounds. What are they? What do they mean? Alanine, yeah, it's an amino acid, but what's it used for? So there are resources that do have additional information that might be the physical properties of the compounds, additional links, literature, sections, other items. So one database is called KEBI. So the Chemical Entities of Biological Interest. There's about 45,000 compounds right now. Most of the material in KEBI is a derivative of what's already there in the literature or other databases. So they've mined KEG, they've mined database called lipid maps, they've mined drug bank, and they put that stuff into KEBI. So real focus on names and synonyms and ontologies. There's additional information about structure and it's very searchable. PubChem is certainly the database I think all of us have heard of. So millions and millions of compounds and substances. There's limits on how big the compounds can be. It's collected from many, many different vendors and depositors. It has just like KEBI, it has synonyms. It also has chemical property data. It has the structures and pictures, smiles and inchy strings. Some of them have bioactivity. They also have links to NCBI databases and PubMed. And in fact, that's quite useful if you're wanting to do some literature searches and finding out more about how this compound may affect biology or disease or whatever. You can search by name, by formula, by structure, by physical properties. ChemSpider is a European version of PubChem. It's run independently. It has almost as many compounds from a larger number of data sources. It has a lot of calculated properties. They've linked it to Wikipedia articles. They have Spectra for some. They have pharmacology links. They've linked it to mesh headings as well, which I think is now done as well in PubChem. So ChemSpider is trying to differentiate itself from PubChem and has, I think, a lot of really useful resources that most people may not be aware of. How many of you have heard of ChemSpider? One, two, three, four, five, six, seven, eight. Okay, so maybe half. Anyways, just so that you're aware. Ligand Expo. This is the collection of small molecules in the protein data bank. The reason why this is really important is because it actually links chemicals to proteins in a physical way. If it's crystallized, you actually see that compound bound to the active site of this enzyme or transporter. And so it links chemicals and metabolites and drugs to their targets, which is really important biology. You get not just a 2D picture, you actually get the 3D structure of these molecules as well. And it's searchable like all the other ones through names and formulas and in-cheese. But it's a small database, a few thousand compounds really, but still very useful. There's other databases, less well-known. I've mentioned a couple of them in passing, but for instance, 3D Met has the 3D structure of about 8,000, 9,000 natural metabolites. NAPSAC has about 50,000 plant metabolites. My compound ID, this is a resource developed at the University of Alberta, which is a database of metabolites of metabolites. So I mentioned that before, which are these transformation products, the ones in the liver and the microflora, and this is developed by Liang Li and Guo Guilin. So there's almost 11 million compounds in this database. And when people search against it, they find a lot of hits, especially in mass spectrometry. And then there's another one called lipid maps, which is maintained at UCSD, and it contains about 30,000 different lipids, covering lipids you'll find in plants, animals, microbes, and so on. And it's helped establish a lot of the nomenclature for lipids and the domics. I'll highlight my compound ID a little bit more. As I said, it takes the metabolites of metabolites. It has 76 transformations or fragmentation or modification sets. And so it's taken about 8,000 metabolites from the human metabolome database and transformed them based on their chemistry. And that first transformation generated about 375,000. And then those modified metabolites are then passed through the second phase of transformation, and that generates another 10.5 million, which is why it comes up with almost 11 million metabolites. It's actually a web server, so it's searchable. And so you can submit queries based on the high resolution mass spec of the Paranion mass, and it will come up with a list of possible or potential hits. It also lists all the types of transformations that it supports, and these are examples of some of the output that you can get. So not unlike Mettlin, but in many regards, much more extensive than Mettlin in terms of possible hits. So I mentioned, as I said, that knockout most study, where I said the first two hits that you saw were not the metabolites and Mettlin was wrong, I suspected if you tried searching against my compound ID, you might find something that's more closely matched. Okay, so these are databases, examples of different ones. There's no correct database. Probably detected some bias I have and some probably because of experience in what we've learned over the years. But each can serve their own purpose and each can be quite useful depending on the application you have. And so don't discard any, make sure you can use all of them or at least aware of all of them. So these are potentially useful for, again, identifying compounds, learning a little bit more about the compounds, but what do you want to know? How do you want to deal with them if you want to understand their biology? So there's a few resources that are out there that are useful for understanding biology and metabolism. The most famous, most widely used is the KEG database. But there's several others that are less widely used or less well-known. How many people have heard of the Psyche databases, Bio Psyche or Meta Psyche? One, two, three, four, five, six, seven. How many people have heard of Reactome? A few more, double hands, sleeping all over. How many have ever heard of SmithDB? That's the people in my lab. Yeah. So anyways, the point about pathway databases is that they're a really rich source of biological information. So they relate the metabolites to the genes, the proteins, diseases, signaling events and processes. So it's not just simply targeting information, it's biology. So the better pathway databases allow and provide tools to support visualization and gene metabolite mapping. A lot of them cover many species. KEG, literally hundreds, Psyche databases, a few dozen Reactome. Initially it was primarily human, but it's expanded again to dozens of other species. So KEG is largely considered the gold standard and it's really a great resource. Its access has become a little more restricted and support is a little lighter because of funding limitations. But I think everyone has probably, or most everyone here has seen a KEG pathway and each pathway is linked to a little metabolite card often. KEG has about 17,000 compounds that are metabolites. It includes a separate database of about 10,000 drugs. Now these drugs are variations. So they're the salt forms of the same drug. So in truth, there's probably only about 16 or 1,700 different drugs but it always depends on how you count them. Very rich in terms of glycan information. It's really unique in KEG and very rich. And with all the different pathways from all the different plants, animals, microbes, other organisms, something like 460 pathways. So very extensive. I mentioned SMIT, Small Molecule Database. I'm bringing this up because not that many people have heard about it. But it was designed in our group, partly because we couldn't find the information we were looking for in KEG and we couldn't find it in Metasite. And there's lots of missing data that seems to be there. So we started filling up and creating our own pathways. And because drawing pathways is difficult, we also developed some software to facilitate that. So right now, SMIT DB has 900 pathways. It has drug pathways, disease pathways, and it has metabolic pathways. These are all for humans. But many of the, or most of the data would translate very nicely to mice and rats, just usually a slightly different name and probably the only pathway that's different between rats, mice and humans is the ascorbic acid pathway. What's different in SMIT DB than, say, KEG and Ractome and Metasite is the fact that it has depictions of cell compartments, of organelles, of organs, protein locations in membranes or other organelles. It also displays a coturnary structure of the proteins. So there's just not a dot as it is in KEG. It's not necessarily a 3D structure, but a depiction of whether it's a dimer, tetramer, and so on. You can map gene, chip, or RNA-seq data, or metabolomic data to these pathways. And you can convert the gene, protein, and chemical lists into pathway or disease diagnosis. So this is what a pathway looks like in SMIT DB. So it's a little richer, I think, than what you might see in KEG. You can see the examples of the organs that are affected. In this case, it's showing, I guess, a phenyl ketone area and a disease pathway. And you can see the metabolites, they're drawn in detail. You can see the signaling processes, positions of where the metabolites and proteins are, cofactors associated with them, different components that are involved in the process and where they sit. Each pathway has a description, a little biography card. Each metabolite is linked to, in this case, the human metabolome database. Each protein is linked to the uniprot. You can map metabolites to the pathway, and so you can color them or highlight them. You can also provide concentration data, as you've measured it, and it'll be displayed in terms of the colors of the metabolites. Then you can also toggle off and on the color. So these are very color-intensive, but you can turn it to a black and white picture so that the colors of the metabolites are more visible. We're also modifying it so you can convert the pathways to look more like keg-like pathways, because people still like the keg wiring diagrams. And so you can flip back and forth between the two. Originally, all of these pathways had to be drawn on PowerPoint slides, and it took hours or even days, and then they had to be image mapped, which took even longer. So what we decided to do is develop some software called PathWiz. It's online now, and it's a web server. So you can generate pathways that look exactly like the SmithDB PathWiz, and it has all kinds of tools that allow you to draw these things fairly quickly. It also allows you to convert all the pathways into Biopax, SBML, and SVGN, and you can save them as SVG or PNG files. It uses a Google Maps-type viewer, so that's sort of a sophisticated JavaScript kind of viewer, and so it allows you to interactively manipulate things. And so it has a pull-down menu, and you can sort of select things. There's a video that I think is about to go on that'll explain how to do it in more detail, but you can add reactions and enzymes and elements, transport. It's sort of drag and drop, and you can adjust things with pivot points. Again, it's like using a standard drawing tool if you've ever used drawing tools for Adobe or Paint. And you can customize things. You can also draw in organs, organelles, cofactors. You can choose your own pictures or specific organs. And once you've lined everything up in the way you like it and adjusted things, then you can just render it, and there's your pathway. And as I say, the pathway can be generated in a colored format or a black and white format or a keg-like format, and that can be saved in all the standard pathway formats. Description just came out a couple of weeks ago and Nikolayk asked its research, and they're certainly welcome to use it if you'd like. So those are the pathway databases, and what I'm gonna now switch into is what I'll call the comprehensive metabolome databases. So these are the ones that combine everything together. So we've talked about spectral databases, compound databases, and pathway databases. So if you combine them all together, you get what I call a comprehensive metabolome database. So keg qualifies sort of as a comprehensive metabolome database. Eco-psych is another one that's quite comprehensive, very good on E. coli, metabolites, human psych. So to be a comprehensive metabolome database has to have at least 1,000 metabolites. Most of them are organism-specific. They have to be continuously updated, and they have to contain a combination either of chemical data and pathway data, chemical, spectral, and biological data, chemical, pathway, or spectral data, or everything. So some of them are different combinations. So keg and eco-psych are chemical and pathway data sets. Human metabolome database would be chemical, pathway, spectral, biological data. And metabolites would be chemical, spectral, and biological data. So metabolites is making an effort to become the GenBank for metabolomics. So it's maintained out of the EBI in Chris Steinbeck's group. How many of you have heard of metabolites? Two, three, four, five, six, seven. How many have deposited data into metabolites? Okay. So at least you've heard a few. Now, today, everyone will have heard about it. The point is that it is intended for upload, and in fact, there's gonna be likely requirement for many of the journals that anything you publish in the field of metabolomics will have to be uploaded either to metabolites or to a metabolomics workbench at UCSD. So it's gonna be essential for publication. And the idea is that you're gonna be able to upload the experimental data, the spectra, compound lists, if you've got the biological metadata about the process and sample collection. Right now, it's the fastest growing database at the EBI, at least in terms of amount of data being uploaded and rate, but it's also the smallest database at the EBI. Anyways, it supports a variety of search options. It's linked to Kevi, and as I say, if you wanna get spectral data, this is a great place to grab some. And it's compliant with the metabolomics standard initiative, that's MSI. So it's important that people know about this, but it's also important that you guys start thinking about depositing your data there. If we don't, as a community, A, these databases will disappear, but it'll be increasingly challenging for metabolomics to be viewed as a legitimate science. So metabolites is at the EBI. The other major resource is actually the University of Alberta for metabolomics databases. My lab or my group has developed, I guess more than a dozen different databases. You'll hear about the human metabolome database, the drug bank, the yeast metabolome database. We've worked on phenyl explorer, which is a plant compound database, E. coli metabolome database, the food database, cow metabolism database, the toxic exposome database, SMIT DB database, and then serum blood urine CSF metabolome databases. So a lot of these databases grew out of a project that started in 2005, which is called the human metabolome project. How many people have heard of the human metabolome project? Okay, that's better than it used to be. Anyways, it was sort of the project that no one had ever heard of. It was sort of the equivalent of the human genome project, but it was intended to essentially identify and quantify metabolites in the human body. And a lot of that was to try and get material that had already been collected because people have been studying humans for centuries, but a lot of it was stuck in journals and papers. And what we were tasked to do was not only try and grab that data, but also measured ourselves, confirm it, validate, extend it, and then make all that data freely accessible. And that's what we've been doing and have done. We also developed a lot of tools and technologies, software, analytical techniques, protocols that helped improve metabolome coverage and throughput. The interesting thing is that when we started with the project in 2004, the literature review that we did at the time suggested there was about 690 metabolites in the human body. So that was what was listed in KEG, that was listed in the human psych and cross-checking, and they were pretty much convinced they'd covered everything. So in 2006, we were pretty excited when we actually had more than 690 identified or compiled almost 2,200 metabolites. We actually had a press conference about it. And it got coverage in nature and science and everything else. But later on, we discovered there was lots of missing metabolites. And so by 2009, we were up to about 6,400. By 2013, it's 37,000. Today, it's about 42,000 compounds. But really what we think is probably likely well over 100,000 metabolites that are detectable in the human body. So it's grown pretty rapidly. And in fact, whereas we used to depict the human metabolome as being much smaller than the genome, it's now inverted. And the metabolome is much, much larger in terms of diversity than the human genome. We've seen this picture already, although it's a little distorted for whatever reason. But again, a whole range of different metabolomes from exogenous to endogenous. And so these are housed in different resources, drug bank for drugs, T3DB for pollutants and toxins, HMDB for most everything, and then food DB restricted largely to food metabolites. So the HMDB, as I said, about 42,000 metabolites, we've been able to identify 170 that are really unique microbial metabolites. In reality, our microbes probably produce about 2,000 metabolites in our bodies, but 95% of them overlap with other metabolites that are already in our bodies, so we can't tell whether they're microbial origin or not. It has a lot of data on abnormal and normal concentrations, lots and lots of disease links, lots of spectral data that we've collected, some of which also has been mined from other resources. Because it links metabolites to genes and proteins, you can do sequence searches. Because it has spectral data, you can do spectral searches. And because there's a lot of data, it has a lot of browsing tools. It has pathways and pathway search tools, structure searching tools, you can browse by different biofluids, very extensive text searching and also full data downloads. So these are just some screenshots of what's inside the HMDB, shots of the spectral viewing tools, browsing tools from metabolites, details, pathways, data fields, there's more than a hundred different data fields. Variety of spectral searching tools, not as good really as what you'll find in CFMID. But we're working on improving those and making them a little more extensive and robust. So you can do MS as well as tandem MS searching. You can also do spectral searching by NMR, typing in peaks. You can draw a structure in there with a Java applet and it'll look for similar structures or identical structures and list their level of similarity. You can also look through different biofluids and look at all the different diseases and abnormal and normal concentrations. You can look through different biofluids and also look at the different concentrations. And it was really designed for clinical applications. What's actually much more popular than HMDB is the drug bank database. And the reason why it's much more popular is because it's the first database to link drugs to their targets. So in fact, it's used a lot in drug discovery and drug development now. So it's a small database, there's only about 15 or 1600 compounds in it, at least of drugs. There's a lot of drugs under development that are also listed in the database. But it has a lot more biology and this is because people have studied drugs to death. So it has a lot of stuff about absorption, distribution, metabolism, toxicity, mechanism of action, pharmacokinetics. It has a lot of drug metabolites, a lot of transport data, lots and lots of data on drug-drug-drug food interactions. But it's also allows you to do all the sequence and spectral searches that HMDB has. So it too has a lot of views, a lot of images, it's formatted very similar to what drug bank, which HMDB is formatted. You can look through drugs by categories, you can draw structures, you can search drug targets by the sequence, and then you can extract data through the data extractor, which is essentially a very sophisticated MySQL query system which allows you to put together queries without knowing SQL. Toxic exposure, and this is the one, as they say, which is trying to address this emerging area of environmental metabolomics or the exposome. 92% of all deaths in North America can be attributed to exposures of either small molecules or microbes. That's, I think, quite striking because most of us are inculcated with the idea that we die from genetic failures or something like that. But really, when you look at it, and this is not new, it's just that no one's really talked about it, but there's many papers that have highlighted this, and this is what epidemiologists always bring up. And so being aware of these compounds that ultimately kill us is, I think, useful to know, and also useful to try and understand so we're not exposed to them. And it's not like Agent Orange or Sinai that's doing it. In fact, many of these compounds that are in the toxic exposome database are produced in your body. These are things like uremic toxins and oncometabolites that have rather significant roles in development of disease. So this database structured very much like Drug Bank and has a lot of sane type of information, but obviously not everything in it is about drugs. But the way that toxicologists speak is very much the way that drug pharmacists and pharmacologists speak. Foods, we eat other metabolomes, and so trying to capture that information is something we've been working on for a long time. 10 very years here has been working on it longer than he wishes to, but it's probably still gonna be worked on for the next few months. So beta release is out there. People can look at it. It's still being added to, but it's one that covers a lot of information about what we eat, including some of the food additives that are there, and information about flavor, aroma, color, as well as effects on human health. So there's lots of data on humans, but obviously people study model organisms. And so we've been trying to extend this to other model organisms like yeast. So there's a yeast metabolome database. It's also relevant because yeast is used in a lot of food productions, whether it's wine or beer. So it was quite interesting to explore that. We enjoyed trying lots of different wines and the excuse of measuring the yeast metabolome, but anyways, it's fairly extensive and covers a lot. Another one is obviously E. coli, so that's another model organism. And in this case, it's been studied much more extensively than yeast, and we're actually doing another update on it in this next couple of months. So this database, when it's done by the end of summer, actually will have pathways for every single metabolite. So that represents hundreds, if not thousands of pathways that are being prepared and drawn. So right now we have 125, but as I say, this will have pathways for every single metabolite, which will be the first organism to actually have that done. So you can look at different databases and you can compare them based on what they have. And some are more extensive than others. Some have, were developed for specific needs or niches. This is not intended to be exclusive or covering all databases, but it's useful to look at what's there and to be aware of what's there. So I think we're coming up to the end of our time. I don't know if people had any questions with respect to what I've covered so far in terms of databases or how to help interpret your data.