 Okay, so just with respect to what you guys have done this last lab was essentially a taste or a flavor of what's going on with, actually, could you guys be quiet? So the idea was to try and assess, or for you guys to have an opportunity to use different metamalomic software tools. Probably we would have liked to have had you guys try the XCMS system because this gives you an opportunity to try what probably the most you do is in terms of untargeted LCMS. But what we really wanted to show as well is that for many of you there are alternative techniques, NMR, GCMS, which are quite powerful, which are quite automatable, which are quite fast. Now some of the difficulties you guys have had because we just hit these servers probably harder than they've ever been hit before, with 25 or whatever going on. But on a normal day, at a normal time, these things would typically respond in a minute or less and get you really, I think, useful data, and it's quantitative data. And one of the points I really want to make here, and I think a take home message if this is the only take home message you get, is that whether you're looking at humans or rats or mice or plants or microbes, within a given species or even within a given genus or phylum, metabolism is very, very highly conserved. And it's unlikely that you are necessarily going to find completely novel compounds. As part of the metabolomics experiment, where things differ is primarily in their concentrations. And that's okay if you're actually just measuring differences in concentrations between cohort A and cohort B or sample X and sample Y. That, fundamentally, most medical diagnoses, as an example, are on the basis of concentrations. It's not, is the compound here and not seen elsewhere. In fact, the compound is always there. It's just, is it way too high or way too low? So you can think of it as a ternary index, high, medium, or average, or low. So just in that qualitative state, if you're measuring 100 metabolites, you have three to the 100 possibilities to define a phenotype, a disease, a condition, whatever. That's a big, big number. So only working with 100 metabolites and knowing whether it's too high, too low, or average gives you tremendous diagnostic, prognostic phenotypic power. Now, in many other fields, we don't have an idea of what high, medium, and low is. But in metabolomics, certainly with humans, for many mammals, even for a lot of plants and for many microbes, we do. We know what those values are. So this gets back to the point that if we've got good quantitation, if we focus on getting those concentrations, you can do a lot. You can publish a lot. You can patent a lot. You can explain a lot. And all of those things are what we try to do, either in science or in science industry. So even though, by shrinking down to the smallish number of compounds that GCMS and NMR provide, it still gives you an awful lot of information. Now, LCMS, obviously, the advantage and the appeal is that you're measuring potentially thousands of features, but you're not getting absolute concentrations, and in many cases, we don't know what the features are. So I think, as I say, be happy if you can get some concentrations, strive to get some concentration data, and don't be afraid to just work on targeted studies. Now, the other thing we wanted to highlight in this exercise was the idea of complementarity, that using three platforms instead of one platform is helpful. And what you get in one platform can inform what you should see in another platform. So if you're seeing a whole bunch of compounds by NMR, you should probably see a lot of the same compounds in GCMS. And if you aren't, then there's something wrong with your library, or something wrong with your method. Same sort of thing. You've got very high abundant compounds in an LCMS study, but you can't see them by NMR, or you're seeing things by NMR, but you're not seeing them in LCMS. Something's wrong with your library or protocol. And so these are things where, by looking at what you get from multiple platforms, you can A, understand shortcomings with your sample preparation, and B, also get broader coverage because there are some things that you won't be able to see just because of the chromatography step or the isolation step or the separation step. But if you're not doing any chromatography or any separation, you know, it's direct injection mass spec or direct injection NMR or direct injection DCM, you should see these same sorts of things. So that's another thing to remember and another point. The third thing I wanted to bring up with this exercise was the possibility of automation. So you guys are trying some of the very first automated tools for metabolomics. Now, they weren't perfect, unfortunately, but they're getting better. And I think you're going to see more of this happening. And if there are some of you who like coding or want to do that, this is something you could think about. How do I make it better? How do I make it faster? What are some of the other things? Because every field in analog chemistry or even general biochemistry and microbiology trends towards making it simpler, faster, and easier. And with computers, now you can say automatic. So as scientists, we tend to revel in the idea of, you know, I want to spend six months of my life figuring out compounds and processing data. And it's fun, at least for the first time. If you have to do it two and three and four more times, then it gets really, really boring. And so the idea here is if we can automate it, then, yeah, you can quickly move through to the next phase, which is what we're going to talk about mostly tomorrow. And why we wanted you guys to at least run through the exercise to get your data sets. And later this evening or afternoon, you'll be able to try and get a more complete set. We'll have other data sets that you can use to try out Metaboanalyst. Now Ann mentioned the Metaboanalyst material on GitHub. So these are, I think it's about 200 pages. And this is a very, very complete tutorial. So you can download it. But my own recommendation might be to have one, two, three, four, five, five printed copies that people could kind of split and share. It's a lot to print. But it will eventually show up in a current protocols chapter that you guys have a special preview of the full set. And Jeff spent a couple months of his life and I spent about a month of my life just trying to put that together. So I think it's really useful and you'll find it quite useful for tomorrow's application. So perhaps you could take a look through it and make a decision about and let Ann know if you guys would like a printed copy. Okay. So that's a preamble. We're going to dive into databases. And this is important. We've seen software. Now we're going to talk about databases. And often the two are integrated from Metabalomics. We're going to talk about how databases have evolved in the field of Metabalomics and in bioinformatics. I'm going to look at some of the different types of Metabalomic databases. I've talked a little bit about them. Talk about some of the different NMR and MS databases. We'll also talk about pathway databases and another thing we call comprehensive Metabalomic databases which are a little different. So historically the first informatics field to emerge in biochemistry if you want was actually chem informatics. It's a much older field than bioinformatics. And for a long time the two basically moved in separate universes. So in the case of chem informatics the first tools for software started appearing in the 1960s. A lot of it was designed specifically for organic chemists. And in the 60s the model for software was a for profit model. Companies, organizations, CAS and the American Chemical Society as an example were established and made hundreds of millions of dollars by developing software and or databases that they sold to pharmaceutical companies, chemical manufacturers, libraries and so on. Their large companies still MDL, Bilstein, Sigma, these were established purely for chem informatics and databases. Bioinformatics which you guys have mostly come from was really started in the late 80s, early 90s. It was designed from lecture biologists and it emerged at the time when the internet was emerging. So people pushed for a web based model, open access free software. There were and there still are but in the early days there were quite a few bioinformatics companies. They all went belly up largely because of this free model and largely because your tax dollars have been subsidizing this. Whether it's through NCBI or EBI or NIH, interestingly Canada puts nothing into it and we largely benefit from the largesse and generosity of American tax players or European taxpayers. GenBank is permanently funded by an act of Congress in the US. So it will only disappear if the US disappears. Most other databases are largely done through grant to grant. So one of the earliest databases PDB, Protein Databake, they have to apply for funding basically every year. They have teams of people applying for dozens of grants just to keep it going for you. So it's insane actually that something that important is basically on a threadbare rope to stay going. So what do you do for databases? Really databases are evolved to help consolidate data and when they became computer databases they helped with linking. With linkage it comes faster retrieval, faster querying. Many databases are also homes to get reference values and reference images, reference data, reference sequences. So same with GenBank and PDB, some of the clinical chemistry databases. People use a lot of these databases to train algorithms. It's how the spectral tools and spectral prediction tools are trained. It's how chemical comparison algorithms are trained. It's how many other things in cheminformatics are developed. Similarity searching is a new phenomenon because before it used to be exact matching, but now it's similarity. So this is how you can look for similar images on Google images and you can look for similar structures and similar sequences with BLAST or other tools and now you've got spectral similarity searches and you guys saw how that works. And then obviously through databases we can do things like prediction because we have been able to train algorithms. So databases are fundamental in sort of like that pyramid of life, databases lie at the base and everything above that relates to algorithms that use those databases. So they're extremely important and they're also hard to make. Typically databases start off as a hobby because someone realizes this isn't working, I'll gather a little set and see if it helps the field. So that's how we started Drug Bank. It was a hobby project for a summer student. It's how we started HMDB, sort of a hobby project. First versions are typically flat files, kept on Excel spreadsheets, but then you start realizing people want the data, so you start making it a little more accessible. You might make it a relational database so people can query it, start archiving it, start putting on the web, you start adding more information because people want it. Most databases end there. There's a few others, GenBank, PDB being examples, which are archival open deposition databases. So these literally have to have teams of hundreds of people that maintain and support the archival process. So if you ever deposited anything, so metabolites, has anyone ever deposited data in metabolites? Does anyone? Okay. Well, we'll have to have you guys do all of that. That's the current metabolomics database. Does anyone deposited data in GenBank? Okay. Great. Any data in PDB? Okay. So these are, again, archival databases that require literally millions of dollars a year just to support that deposition process. So as you go down, I guess the size, or style database, you increase the size of the community. So the open archived databases that are most heavily used, the curated ones are somewhat less used, but that's where most databases end up being. As you evolve from the hobby database to the fully open access archival databases, you have to push towards standardization. You have to have increased automation. You have to improve the querying capabilities, and then you also have to spend more money on curators to help ensure the data is high quality. So we talked about this, or we've seen this slide before, and this was this issue of how do you proceed from the raw data to get actual answers. You guys saw examples of tools like Bazel and GC Auto Fit, which are maybe the equivalent to mascot or some of the other tools, but underlying every one of those tools is a database. Underlying Bazel is a collection of 200 spectra. Underlying GC Auto Fit is a collection of 200 or 150 spectra. Any deconvolution software, any kind of semi-automatic or automatic tool will depend on some kind of database. One of the challenges when we started in the field of metabolomics before it had a name, which was, I guess, in 2000, 1999, most of the data was in textbooks, and most of the data still is. Certainly in print journals, people have been doing analytical chemistry of biofluids for 100 years now. There's 75 years of classic biochemistry, which is incredibly rich, but completely ignored, because you can't find it on PubMed. In terms of moving that data into electronic records and into databases, we are still about 20 years behind proteomics and genomics. The other challenge I think that's not appreciated is, in fact, metabolomics has a role in many different communities. Some of you, most of you might label yourselves as metabolomics researchers, but there are also people who are pure analytical chemists, and they just want to improve the analytical technology. There are people who are plant chemists and are natural product chemists, and they're very interested in natural product characterization. Drug researchers have a totally different perspective on metabolites and metabolism and the types of information they want to see in a database. Clinical chemists have a totally opposite view sometimes. There are people who use different techniques. Some people who are NMR, people want NMR spectra, not mass spectra. Others who are MS specialists want MS data, but not NMR. All of these communities are having very, very different demands. Within the bioinformatics community, you're also having people who want to see it structured in very, very different ways with very different or required standards. So you get pulled in many directions. The result is that there are many types of databases in metabolomics. There are the spectral databases, the NMR spectral databases, the MS and MS, MS, 10MMS, EIMS. There are compound databases, some of which we've already talked about, but then bringing a lot of the lists together are the pathway databases, which connect chemicals to genes and to proteins and to physiology and to the rest of biology. And then what we call the comprehensive metabolomic databases, which try and combine all of the elements of the spectral databases with the pathway databases, with the compound databases, with sort of encyclopedias. And so those are, I guess, a rarer breed, but we'll talk about those. So I'll talk about, first, the NMR spectral databases. There are several. Bazel is one that has its own spectral database, but there are others that were established in Japan, the SBDS, or SDBS spectral database of Japan. There's another one called NMR shift DB. And then the University of Wisconsin in Madison maintains two different NMR spectral databases. So SDBS has been around forever. It's produced by the Japanese Institute for Standards, ACED. And it has quite a number of spectra, MS spectra, as well as NMR spectra, as well as FTIR, and there's a variety of search tools. Somewhat like the NIST database, which you can buy. Most of the compounds aren't metabolites. These are essentially hazardous compounds or items of sort of industrial interest. Obviously some can end up in humans, but they're just not likely to be found in most metabolomic studies. There's also a limit in what you can download. And so it's somewhat less accessible than other databases. The Biomag Res Bank was originally established to archive NMR data from proteins and other macromolecules, but a number of years ago they shifted to include metabolites. And in fact, that became their most popular download. So it's sort of like the PDB established with proteins. They started to put small molecules and suddenly everyone uses the PDB. So this is essentially what happened with Biomag Res Bank. So right now they have about 900 metabolites and they collected quite extensive NMR spectra. They're linked and described in fairly detailed ways. Most of the metabolites they collected were for a plant, a rabidopsis. They have completed the assignments and they have also converted them to a format called NMR-STAR as well as I'm pushing it towards I think NMRML. The assignments I think are quite valuable. It's a lot of work doing assignments. Another database that was originally started by Chris Steinbeck who is heading the metabolomics initiative at EBI is called NMR-SHIFT-EV. He started it and then has moved on, but the database itself has continued and now over 50,000 spectra from 40,000 plus structures. So it's quite a bit larger than the Biomag Res Bank, but again probably only 1 to 2% of these compounds are actually metabolites that will ever be found in a natural environment. The other 98% have never left the lab. The other issue with the NMR-SHIFT-EV and it is a tool that allows you to predict NMR spectra. So you can actually upload a compound and it will predict your NMR spectra for free. But most of these were collected in chloroform or deuterated chloroform. And so most metabolites actually of interest are dissolved in water. And so there's a very, very different NMR chemical shifts for things in water than in chloroform. So what you collect or would use an NMR-SHIFT-EV you wouldn't be able to use in most NMR metabolomics experiments. I mention it partly because it still is a large resource and it is a resource that allows people to help with assignments and because it has that chemical shift prediction tool potentially gives you some guidance. MMCD is this database again developed at Wisconsin. They've collected a lot of the NMR spectra that they had from the biomegoresmic that they also supplemented it with a lot of mass spec data. So there's about 2,000 metabolites where there's mass spec data in this database. Not well known but it is a resource that you can potentially retrieve or supplement or create your own library. You can search by peaks and chemical shifts. You can look by for chemical formulas and names and synonyms. Information about the species where the metabolite was found and references and links to images. So those are the NMR databases. We don't have too many NMR people but hopefully after today you guys might be convinced that NMR is actually a worthwhile tool. Mass spec, again I think we've already talked about NIST. We've talked about METLIN, MassBank and GOMDB. The list is growing and in fact MONA has become a very popular resource. The GOM database is one of the original metabolomics mass spectral databases. It includes MS data as well as retention index data. It also includes mass spec tags or MSTs. So they have both identified metabolites as well as unidentified but consistently observed compounds where they have the spectra. So right now about 1,500 identified metabolites. They are compatible with NIST and AMDIS. So you can upload them into your NIST and AMDIS sites. The focus with GOM and with the people who developed GOM has been primarily on plant metabolites. This is just a screenshot of the GOM database and the website. A really obscure web address. Anyways, you can search through it and they are making periodic improvements to the database. In the early days it was really hard to work with but I think they've gotten much better tools. One of the things to remind you about and I talked about this a little bit before is the current state of spectral databases, especially with LCMS. This lists as of I guess about eight months ago the number of spectra that were in these different databases. So I've mentioned Mona, you guys know about METLIN. There are several others, MZCloud, I've mentioned NIST and Mass Bank. The commercial ones, Wiley, there's another couple of free ones like Respect and GNPS. So when you look at the spectra, the number of spectra, it's really impressive. MZCloud has 400,000 spectra. Mona had 236,000, NIST, 230,000, 250,000 spectra. Then when you look at the number of compounds, it's not so impressive. What's basically happening is that there's 20 different models of mass specs out there and everyone's collecting the same mass spectrum on the same compounds and reporting it 20 different times and sometimes at three or four different collision energies. So it's easy to get from one compound to about 30 to 50 spectra for that same compound. So the net result is that when you try and merge all the compounds that are really in these databases, there's about 20,000, even though it adds up to about a million spectra. So there's essentially a 50-fold redundancy in spectra. And when you think about the number of, therefore, unique compounds, it's really tiny. And many of these things are all over the place. So, you know, there's natural products, most of which you won't find in mammals, but you might find in plants. Lots of poisons, something you won't find in plants, but you might find in certain mammals. A large number of compounds, for instance, in Matlin, 8,000 of them are peptides. So the true number of small molecules in Matlin is about 4,000 or 5,000. MONA, they list 69,000. 60,000 of those are lipids, which have theoretical spectra. So Matlin has updated, they've just started grabbing everything in terms of compounds. So the latest version now lists like 240,000 compounds from every organism, every system, every industry. So it's a large collection of chemicals, but it's not distributed according to species. A lot of spectra, they have collected about 13,000 compounds with high-resolution MS-MS. And Matlin still has some really high-quality MS-TOF spectra, and it still probably is the go-to place to get referential MS spectra. And it allows you to do a lot of searches. You can enter masses, positive-negative charges, neutral, you can select which type of metabolites, which type of adducts you want. Or want to avoid. And so with these sorts of searches, you can plug away and it will come out with lists of compounds, such match with different degrees of matching. You can also do MS-MS searches, as well as just the simple parent ion mass searches. It's compatible with many different types of formats, from MZXML to MZdata. MassBank is another resource that was originally started in Japan. Funding for that was pulled. MassBank then moved to Europe and is maintained in Switzerland as a sort of a branch of the Norman group. And then it moved to UC Davis with MassBank of North America or MONA. The MassBank concept is great and in fact they did an amazing job of collecting mass spectra from a large number of different compounds, again not all metabolites, consolidate it from many different countries. And it was really unfortunate when the funding was essentially pulled. So the database is online, but it's not being updated. And so it has a nice web interface. It allows you to do a variety of peak searches. You can retrieve spectral data from that. A lot of that concept is now in MONA and also in the European version of MassBank. So those are examples of spectral databases. I haven't covered everyone that there is and there are some that are merging. But I think there's also caveats about some are just simply compound databases, others are actually truly designed to be metabolite databases. Some contain excellent search tools, some not. So the quality varies a fair bit. But I tried to highlight I think some of the better ones. Compound databases. We've talked about PubChem. We've talked about Kebby. We've talked about ChemSpider. Ligand Expo is what the protein data bank has spun off in terms of its small molecules. How many people use Kebby? Okay, not many. Anyways, it's up to about 45,000 compounds and it's growing probably about maybe a thousand compounds every two months or so. A lot of the compounds are pulled from other databases. So they got a lot from keg and from lipid maps and drug bank. The original focus of Kebby was on ontology and naming. But they've expanded it to include wikipedia entries to describe things. It's searchable by names and formula. It is about compounds that are biologically interesting. So it covers everything, every domain of life science. So it could be in the area of toxicology, exotic strange mushroom compounds. It could be strange sponges and things like that. All those compounds will end up there. Again, it's not organism specific or domain specific. But it does support searches. Pubchem, they had a massive update in December and I probably should have updated it, but so it's up to 83 million compounds and I think it's about 120 million substances. It's a restriction that you have to have less than a thousand atoms. They've taken data from many, many different groups and consolidated it. The database that's visible is actually apparently only a fraction of all the data they actually have. They spend a lot of their time negotiating to get some of the data released. I think how many people have looked at Pubchem? Almost everyone. It is, I think, a pretty amazing resource in terms of finally releasing a lot of the data that had been hidden and kept hidden by CAS and other commercial organizations. This has really helped a lot. They've done a lot more linking over the last couple of years so that it links nicely into PubMed, links to other databases on bioactivity. The trend is to try and basically be the database that swallows every other database in the NCBI. They probably will do it in large part because I think higher-ups realize how important chemistry is and how rapidly and widely accessed Pubchem is. It's going through the roof in terms of access compared to a lot of other databases NCBI has. ChemSpider, how many people have used or heard of ChemSpider? This was for a long time maintained by a fellow named Anthony Williams. He left a year ago to take over work on the chem and traumatic for the EPA. ChemSpider is maintained by the Royal Society and there are things that are visible and things that are not visible or not downloadable. It was sort of the issues with the Royal Society telling people what they could and couldn't use that led to Tony Williams leaving. As a result, I think it sort of lost a lot of the wind from its sails but it's still there. It's still got a lot of information and it's still, I think, quite useful. Like Pubchem, probably only one to two percent of the compounds in it have ever left the lab. It's not obviously the best resource for searching for metabolites. Has anyone heard of Ligand Expo? If someone had, that would have been a first because I think most people don't know about this resource but this is very biologically relevant. Every small molecule that is bound to a protein is in Ligand Expo that they have solved by NMR crystallography. Therefore, every one of these compounds is biologically relevant. Now, not all of them are going to be found in humans. Some of them represent experimental drugs that have really never left the lab but there's a lot of biological information and this is where it connects the small molecule to the protein and this is fundamental. How do we integrate proteomics with metabolomics? Well, it's right here. It's in Ligand Expo and unfortunately it's not mined. Obviously no one knows about it so clearly it is a useful resource that would really help. The other thing too is it gives you the three-dimensional structure of these compounds. Most of us are used to looking at two-dimensional structures and the three-dimensional structure has a lot to do with how it looks by NMR or how it fragments by MS, how it binds to proteins, its role or function. There are other kinds of compound databases. Some of which you may have heard, some of which you probably haven't heard of. So how many have heard of 3DMet? How many have heard of Napsack? Only the plant person. How many people have heard of MyCompoundID? All the folks from TMic. And then how many people have heard of LipidMaps? A few more. So 3DMet is a collection of about 3D structures of about 8000 or 9000 natural metabolites. So a smallish number, but it's still a collection of products, natural products, so these are useful molecules. Napsack is an amazing resource. It's a collection of plant metabolites that link to species. And again, in metabolomics, we really need that information. You know, why hunt for compounds that cannot or never will appear in an organism? Or you should be looking for compounds that should be in that organism. And so that linkage to species information is actually what makes Napsack unique and really valuable. MyCompoundID is something that I'll explain a little bit later, but this is a resource that if you're hunting for unknowns, this is the place to go. And LipidMaps was established at UCSD. It's about 10 years old now, and it's really the major hub for lipids. It's no longer expanding. It's been sort of static, but it covered a lot and covered a lot of the nomenclates, and essentially established the nomenclates for lipids that everyone is now going to be using. So MyCompoundID is sort of a... multiple resources all piled into one. So it was published, I think, in 2011 initially, and then there have been subsequent publications and updates for it. The neat thing about MyCompoundID is that it takes metabolites and it asks what sort of transformations can they go through. And it identified about 75 transformations or modifications that can occur. And what it did was it took those modifications and applied them to, at the time, about 8,000 metabolites that were in HMDB. So then it did this first pass metabolism with the 76 transformations starting at 8,000. And out of that, because of their different modification sites, some have multiple sites and things like that, it generated almost 400,000, I guess you'll see molecular formulas, and then it did a second pass metabolism starting with 375,000 using the other 76 transformations. So you've got almost 11 million molecular formulas. Now MyCompoundID doesn't generate structures. It generates M over Z or molecular weights. But that's still adequate for doing mass searches. So it at least can give you, here's the parent compound and here's the modification X and modification Y. You don't have an actual structure, so this is suggestive of what likely combination is there. So website, pretty simple. MyCompoundID.org. And offerings on it are increasing almost monthly in terms of the things that you can do. This is just an example of what you can select and choose, sorts of searches, types of reactions that are allowed, and supports, and these are the compounds. The interesting thing is that when people have used MyCompoundID, they can go from an identification rate of 5%, 10% among features up to 50%. Now it's only at the M over Z level, so that's basically a level 3 identification. So you want to confirm it with other information. But at least it's giving you a hint of what it could be or what it possibly could be. Okay, so those are some examples. So we talked about NMR databases, we talked about some mass spec databases, we talked about some of the compound databases, and then we talked about some more specialized databases that offer information that potentially allows you to search short or fine in the case of MyCompoundID, the ID unknown compounds, or in the case of Napstack compounds associated with very specific plant species. I'm going to switch gears, and we're going to talk about pathway databases. And I think pathway databases are again the link between the genome, the proteome, and the metabolome, and physiology. So this is at some level why metabolomics often sits at the top of that informatic pyramid because we know all about these pathway databases. We have to, but talk to someone in the field of genomics or transcriptomics or proteomics, they barely know about keg. They've never heard of reactome. They don't know what biopsych or metasych is. But if you're doing metabolomics, I presume has everyone heard of keg? If you haven't, you have to leave the room. Some of these others are less well known, but reactome, how many people have heard of or used reactome? So in fact it is now part of EBI, but it was actually started by Lincoln Stein, who is heading up the organization here and is formerly, I guess, Ann's boss of bosses. And then biopsych and metasych were established by Peter Karp and have been around. How many people have used or seen the psych databases? And then have anyone heard of the small molecule pathway database or SmithDBQ? So pathway databases, as I say, relate genes to proteins. Some can relate to disease. Some provide a little bit of information on signaling events and processes. They are visual tools. And this is something that also makes them very appealing, because you get tired of looking at spectra and get tired of looking at chemical structures. In the case of keg and metasych and reactome, they actually cover multiple species as well. And so this distinction and inclusion of species information is also what makes pathway databases particularly valuable, just like knapsack makes that information. So of course the best known database, pathway database, is keg. And I think the simple wiring diagrams and the hyperlinkable information and the fact that it covers so many different species and is very, very comprehensive makes it, I think, probably the most appealing pathway database. The data right now is about almost 20,000 compounds. They list 10,000 drugs. And I think given that there's only about 1,600 approved drugs, I don't know where they're getting all their drugs from. How many of these are illegal? But I think a lot of them have to do with these are salts or suspected early stage drugs. There's a huge resource on glycans, which I don't know how many people have ever used. And then together with the different collections, it's about 450, 460 pathways. Now for keg, you know, complete description of human metabolism, we represent about 100 pathways, humans. And then this exotic microbial metabolism, that represents maybe another 50 to 100 pathways. Plant pathways, maybe 120, I guess. So it's not as if you'll find 460 pathways for humans or mice or 460 pathways for plants. And I think you also have to remember that the pathways in keg are very, very generic. So generic that they can't even depict any cellular locations or subcellular locations. So most people I have taught don't know that the citric acid cycle takes place in the mitochondria because keg doesn't show it as happening that way. And so what's in keg must be true. So again this idea of where subcellular interactions, metabolism occurs is not captured or captureable by keg because it is so generic. The other point obviously is you have to be able to know what pathway you're looking at. Is it a human or the generic one? And so again I'll see people drawing the ascorbic synthesis acid pathway for humans and humans can't make it. But it's so generic that most other organisms have it or the photosynthetic pathway for humans. So again you have to be careful when you are using or searching keg to have some biological context. Okay that's good. Of course you just done that. So if you're doing proteomics you would look more at differential proteins, right? But you're saying more with metabolomics to look at different concentrations of metabolites and then try it. But how would you take that enrichment stuff with them associated with them? You can't just go straight to go for example and get their subcellular location. No, there are very few metabolites that have any kind of go annotation. There's a lot of effort obviously in other databases where people are trying to identify the localization of different metabolites. So HMDB has lots of information on localization either at physiological or subcellular. In metabolomics yes we're interested in concentrations obviously we want to know what's there. But because we have information about pathways we can associate perturbations in abundance either with changes in gene expression or protein expression that enhance expression of production or reduce production of certain metabolites. We can associate pathways with binding events or signaling events which also allow us to understand what's precipitating. So the pathways still allow us to do that interpretation. We can also monitor things like flux and whether things are increasing or bottlenecking and piling up at the back or flowing through too quickly. And again that allows us to again, via pathways, identify where these things where the blocks might be. But ideally you can integrate with our otheromics information like either find from other studies or they can remember. Yeah, so metabolomics really integrates exquisitely with proteomic data. It integrates exquisitely with transcriptomic data. It's being used widely in GWAS data. Metabolomics allows you to do a finer phenotyping. So if you have a plant that's drought resistant, okay, that's a phenotype. If you look at it metabolically you can say, oh, this plant is a proline-based drought resistance and this one is a sarcosine drought resistance. So now, yes, it survived the drought but these two plants have a fundamentally different mechanism via metabolomics. Okay, what are the genes that regulate sarcosine? What are the genes that regulate proline? What are the proteins involved? What are the pathways? What are the signaling processes that allow this to happen? And why do you use those things? And you get there's physiological reasons. Proline being very, very soluble. Sarcosine also is another one that says stress-response-metallite that's used across kingdoms, actually. So these are things that, anyways, it's, I think this is often where these pathway connections help and because people in metabolomics have to be so aware of the pathways it helps them with that integration. So it's the anti-organism, right? Yeah. I mean, you know, the organism then. Yeah. Okay, I'm going to talk about the small molecule pathway database partly because it's something that we've been working on for a long time but it was partly to address the shortfalls that we saw with K. So the fact that you don't have a pathway that shows sister gas and cycle happening in the mitochondria, the fact that you can't depict mitochondria, the fact that you can't depict the periplasmic space in microbial cells, the fact that you can't depict organs or other physiological events in metabolism. So we also wanted to move away from the pure catabolism, anabolism picture of metabolism to say how does, how can you use pathways to relate to disease? How can you use pathways to relate to drug action? How do you use pathways to relate to signaling? So currently, this is what SNPDB is. These are focused on human pathways but they would apply to all mammals. So it has a large number of drug pathways, drug action pathways. There's a large number of disease pathways. Many of them reflecting metabolic disorders, of which there are hundreds. Then there's about 200 plus metabolic pathways which is actually about twice the number that Keg has. And then there's about 40 other signaling pathways. Ideally, there should be about 4,000 signaling pathways. It's just, it's really hard to find that. And that's where, as I say, what I think most of metabolism is about. That's what most of metabolomics is about. What we try to do with SNPDB is try to allow the visualization and depiction of cell compartments, organelles, protein locations, tertiary-caternary structures. You can map transcriptomic data, proteomic data, and metabolomic data with these tools. And in essence, the idea is to be able to convert protein or chemicalists into pathways or even disease diagnoses. So this is what a pathway looks like in SNPDB. So it's a little more colorful than a Keg database. But you can see there's the mitochondria. There's that pink blob up in the corner. There's peroxisomes that are drawn. There are organs that are affected by the metabolic problem here. And I think this is PKU, I think. There's the metabolite that's being perturbed with a big star or explosion in the left corner. And then you can see pathways. You can see cofactors. You can see whether their proteins which are in green, whether they're dimeric, trimeric, tetrameric. So that's what a typical pathway looks like in SNPDB. Every pathway has a description. Every protein has a link. The metabolite has a link. Proteins link to Uniprope. Metabolites link to HMDB. It's all viewable the same way that Google Maps are viewable or browsable, so you can zoom in. So it has controllers to go left, right, zoom in, zoom out. You can also do some metabolite mapping. So you can click on the list of metabolites and turn them on and off to highlight them, turn them red or green or produce lists that you feed in. These are the perturbed lists that you identify in a metabolomics experiment. And it then will list or show the pathways that have been modified or seem to be affected. And then those are highlighted in the pathway set. So you can see if that makes sense or if it has something related to that. SNPDB also allows you to map metabolite gene concentrations sort of in a qualitative way. So it colors them as dark red, orange, yellow, green, so the red, green coloring scheme. You can turn the background off and on, especially with something like this where it's very color rich. You can turn the background so it's black and white but still see some of the color schemes. So how do you generate pathways for SNPDB? So we usually drew them all by hand using PowerPoint slides. That didn't work too well. Went through lots of summer students, large graveyard of summer students. So we developed a tool called PathWiz. So this is an online tool that allows you to generate SNPDB pathways. And we've had one person actually working on this for almost a year. And he's used PathWiz to generate many SNPDB pathways. But he's also generated hundreds of pathways for E. coli, hundreds of pathways for yeast, and hundreds of pathways for Arabidopsis. As a web server, it's sort of like Bazel and other things we've shown you. It's interactive, but it produces machine-readable pathways. So it's not PowerPoint anymore. So you can save things into Biopacks. You can save it into SBML or SBGN. You can save them as SBG or PNG images. And as you're building your map or pathway, you can zoom around like in Google Maps. It's gone through few iterations. It's not the easiest thing to use, but certainly having a few people work on it, it's iteratively improved a lot. This is just illustrating how you would sort of build pathways. There's different templates that you can use and you basically think of reactions. The best way to build a pathway is actually to sketch it out on paper first. And then after that, you can start building these things in. But it has a huge database. So if you type in names of proteins, it'll instantly grab the protein, know that it's a dimer or a trimer. It might know the cofactors. It has a huge database of chemicals. So if you type in a chemical, it'll instantly draw the structure for you. So you don't have to draw structures. You don't have to look through Uniprot or Pubcam or anything like that. All the data is accessible. You just have to provide the appropriate names and it'll create reactions. You move things around. So if anyone's done some of these pellet drawing programs or painting programs, this is largely what you do. And you can position things with different arrows. Render things in a way. So it's done in a very white background in this large collection of things that you can drop and drag. Organs, organelles, membranes, other things that are all viewable. And if you don't like one way, you can shift to another. It's a community drawing process. So you can make your pathways open access. If someone can go in, you can't necessarily damage what you've done. They would end up creating yet another one, but they can use what you've done and build on it. So that way your art isn't destroyed, but your art can be built upon. So people can add other information. One of the things that most people always forget or almost always forget to put in are transport processes in pathways. No one ever shows, certainly Keg, any of the membrane transport proteins that allow metabolites to get in and out. Again, this is just an example of a pathway generated via PathWiz and showing these zoom boxes where you can look in sort of the endoplasmic reticulum or the mitochondria in the different layers and then linking up, in this case, to the kidney. It's not ideal for printing. I mean, you can't obviously read these things. So we've been converting a lot of the pathways to, if you'll call them, Keg-like maps. So it's trying to make them very compact so that you can print them off. This view is ideal much for the same way that Google Maps is more useful. It allows it to sort of scroll interactively. Anyways, it's one where I think PathWiz enables pathway database creation. And now you can propagate these pathways. So if you've drawn it for E. coli, you can propagate it to pseudomonas or shigella. If you've drawn it for Arabidopsis, you can propagate that set of pathways to spruce trees or whatever. The other thing, as I say, is that it makes these things machine-readable. And that's been a problem. A lot of the Keg databases are written in a language called KegML, which only Keg uses. So it's sort of difficult, I guess, to reuse the Keg resources. But SBML, SBGN, these are all community standards. And so that's what submitDB and PathWiz try to use. So the last part, I guess, we'll talk about the comprehensive metabolite databases. And these are sometimes multi-species, sometimes they're single-species. Some of them contain pathway information, some of them came metabolite information. So Keg is recognized as a pathway database, but in some respects it's also a comprehensive metabolomic database. It doesn't have spectral data, but it certainly has a lot of other components. So the definition of a comprehensive metabolite database is, A, it has to contain more than 1,000 compounds. Generally, it has to be organism-specific or at least associates information about a metabolite to organisms. Ideally, there should be some continuous updating, or at least annual updating. And minimally, it has to contain two or more pieces of information. A, information about the chemical, and B, pathway, or chemical, spectral, and biological data, or chemical, pathway, and spectral data, or all four of these combinations. So a database that, as I said, I don't think you guys have heard of, but has anyone seen this metabolite? Just one. So this is maintained by the EBI, which is started by Chris Steinbeck. And it is intended to be the gen bank for metabolomics. So if you're collecting raw metabolomic data, this is where you're supposed to put it. You're supposed to deposit it. Many journals are starting to require you to deposit your data into metabolites before you can publish. So you can upload your spectra. You can upload your compounds and compound lists. You can upload your protocols and metadata. The net effect is that with people uploading data, they're starting to get large collections of spectra, large collections of metabolites, associations with specific diseases, and associations with specific organs, tissues, and functions. It is done in the metabolomic standard initiative, so it follows the standards that are required. It also compels you to follow the standards, which is also good. It has a tough interface to upload. It takes a long time, but there are tools that are appearing that make uploading much easier. Then there's a collection of databases that we've been building for the last 10 years. And so I think it's probably worthwhile talking about them because a lot of people use them. The human metabolome database, which is started in 2006. Drug bank, which is started in 2005. The yeast metabolome database. E. coli metabolome database. Phenol explorer, which is a polyphenolic database. Toxic exposome database. Food database. Small molecule pathway databases. And then specialized databases on cerebral spinal fluid, serum, urine, saliva, and so on. Most of these databases started in a project called the human metabolome project, which was started in 2005. No one's ever heard of it. Everyone's probably heard of the human genome project. But this is one that was sponsored by Genome Canada. And it allowed us to start working on both assembling these databases, but also performing a lot of experiments to analyze what's in blood, what's in urine, what's in the cerebral spinal fluid, and saliva. So a lot of experimental work was done to complement that. What we were required to do when we started the project is we had to make all of our data freely available. And we still have. So we made the human metabolome database freely available, drug bank, food database, T3DB, and so on, all of them available. And then a lot of the technologies we've developed have also been freely available, which is one of the things you guys have been using today. When we started in the project, we wrote up a paper, a proposal, and we said, based on what we could gather in 2004, we figured there were about 690 known compounds in the human body. Boy, were we wrong. Anyways, but this is the complete list of compounds that were available in KEG and human psych. So these are, you know, 10 years of these guys compiling data, and so, okay, that must be how big it is. Seems small to us, but that's all the information. So when we released our first version of the human metabolome database, we found out, you know, how wrong they were. They were off by at least a factor of three or four. Three years later, with more measurements and more literature surveys, we're up to 6400. By 2013, 37,000 today, it's about 42,000. And I think in terms of detectability, it's something on the order of between, I don't know, 6,000 to 100,000 compounds that probably should be detectable. Could be more. Could be a lot more. And then in terms of as improvements happen, and as we think about theoretical metabolites, things that transition or are temporary or very, very low abundance, we are looking at something on the order probably close to 2 million. We've seen this picture already, but this is how it breaks down. The human metabolome is composed of many metabolomes. Some of them are endogenous. Some of them are exogenous. We eat other metabolomes, so therefore, they become part of us. So we are part plants. We are part microbe. We are part mammalian. We're part synthetic chemicals. All of those are in our bodies, our skin, our tissues. And I think that's important to remember, because when we first did this, all the compounds that are listed in keg and humus psych were just endogenous metabolites. And as I think we started realizing that we are components of what we eat and breathe, I think that broader view has really changed what our perception of the metabolome is. So some of the databases I think that are important are obviously the human metabolome database, the toxic exposome database, drug bank, and the food database. So I'll talk about these a little bit more. So HMDB right now is about 42,000. We have about 170 gut microbial metabolites. In reality, there's at least 2,000 or 3,000 microbial metabolites in our body. The problem is that those 2,500 or 3,000 are identical to what our body produces. So we can't tell whether they're microbial origin or of our endogenous origin. So these 170 are the ones that are not made or cannot be made by humans. What we list are the low, medium, and high concentrations of metabolites. So this is why I say concentrations are really important. We link a lot of that to diseases, about 700. Right now we've got about 2,200 NMR spectra, 7,600 MS spectra, 2,000 GCMS spectra. We try to make this a resource, or try to make it a resource that links metabolites to other omics. So you can do sequence searches. So it would be enzyme or transporter that acts on this metabolite. It has a sequence. So you can search for those things, and you can look for similarity and homology. You can search by a spectra. There's lots of browsing, and different views that you can look at. Lots of pathways. 900 plus now in the database. You can search by molecular formula, molecular weight, but you can also search by structure, and structure similarity. Very advanced text search tools, and the whole thing is downloadable. So these are just some screenshots of the HMDB, showing some of the spectral viewing tools, pathway viewing tools, molecular browser. On average there are 100 plus, I think it's more than 102 now, data fields in the HMDB. So it's not just a database. It really is designed to be an encyclopedia. There are hundreds, if not thousands, of words written about each compound. Some of it's paragraphs, some of it's in terms of an ontology, or taxonomy, some of it's in terms of physical properties or references, but it's a lot of data. It supports general MS searching, just like METLIN. It also does MS-MS searching, and just like using CFMID. It does NMR spectral searching, just like MMCD or by Meg Resbank. It supports structure searching, so you can use an applet, draw something, and look for similar structures via the TamiMoto index. It also has biofluids database, and so concentration data for many different diseases, abnormal, normal, like urine, cerebral spinal fluid, blood, saliva, fecal water goes on and on. At least 15 different biofluids. Again, this is designed primarily for physicians and clinical chemists. The most popular database we developed was actually a drug database. And the reason why it became so popular was because it was the first to link drugs to their targets. So amazingly, all those drug companies had never tried to do that. And the result is that they've been using Drug Bank for a lot of their drug discovery efforts, and a lot of drug repurposing efforts. And in fact, it's been quite successful having that information. We didn't really intend it to be that, or use for that. We were just trying to track the drug metabolome, and we were just trying to link information about what are the genes and proteins that this drug acts on or is metabolized by. So sometimes, you know, an innocent project turns into something a little bigger than you expect, but it has been an interesting ride. So in this case, we had to deal not only with the needs of people from drugs and metabolomics, we had to deal with, you know, pharmaceutical and medicinal chemists and pharmacologists and pharmacists about the drug targeting, drug metabolism, mechanisms of action, absorption, distribution, metabolism, pharmacokinetics. So that data is in here. Of course, it has a lot of information on drug transporters. It has other information on drug metabolites. Lots of information on drug-drug interactions, drug-food interactions. But like HMDB, it has stuff for searching, sequence, text, spectra, masses. You can do structure searches. And not unlike HMDB has the same sort of tools, hundreds of data fields, lots of different viewing tools. You can query by chemistry. You can query by different categories of drugs. You can search by sequence. You can extract data through very complex MySQL searches, which are generated via just pull-down menus, so you don't have to know MySQL as a language. Another one is called the Toxic Exposome Database. And again, this grew out of the work on drugs because not all drugs are safe. But also the fact that we were seeing and others were seeing all kinds of contaminants in blood and urine, pesticides and herbicides and endocrine disruptors and certain solvents and exposures. So this also pushed us to develop this particular database. It's smaller than Drug Bank. It's smaller than HMDB. It's only got about 3,600 compounds. But like Drug Bank, it has the targets that these compounds act on. So it explains why they're toxic. It also links a lot of this to gene expression and gene changes. So it's again linking metabolite or chemical to protein to gene and gene expression. And then there are pathways as well. So it tries to combine all of these things in a systems view. Question? So is it a procuration in the literature or did you guys actually... Well, some of these we found, but there's no point typically trying to do or redo some of the stuff that's been done. Most of it's in the literature. Huge amounts that are effort made by the CDC and EPA and the NHANES studies cost tens of millions of dollars a year and they publish the data all the time. But it only covers about 200 compounds. But yeah, there's lots of literature and what they're finding in very sick people as well as healthy people. And it's sort of scary. But it is part of our metabolome. The other one that I think we all ignored or wanted to ignore was the fact that what we eat, we are what we eat. And this represents a huge part of our metabolome and arguably more metabolites in our body are derived from plants than just about anything else. So we are more plant than animal in many respects. Just because we're omnivores, we eat plant foods. Most of our calories come from plant derivatives. Some of it's hard to trace. There's also a lot of food additives, thousands of them, that represent the dyes that are in the food, preservatives, emulsifiers, surfactants. It's amazing. So right now the food database, our food constituent database lists about 30,000 metabolites. The average plant minimally contains about 3,000 different metabolites. Obviously some plants are much more complicated than that. It's just that there's so many varieties that sort of leads to that large proliferation of secondary metabolites. This database hasn't been published. It's sort of an underground, virally spread database. We're still fixing it. We may be fixing it for the next 20 years before it's really ready to publish, but it is quite comprehensive and it's modeled after all the other databases. Some people are working on yeast. The yeast metabolome database is online, has been for a while. We're doing a massive update. There are about 1,000 metabolites under different growth substrates. A number this year I think we're up to 10,000. So the yeast metabolome is pretty big, pretty complex. The other thing to remember is that yeast is important for wine and beer and for bread. And the substrates that yeast grow on produce all kinds of exotic compounds, which technically are part of the yeast metabolome and yeast metabolism. So lots of pathways, lots of reactions. We've redone all of the pathways. There are 66 yeast metabolism pathways from KEG. Now we have well over 100 that we've redrawn through SMIPDB and pathways. So it's going to be released later this fall. E. coli metabolome database. I did a very recent update of that and it's much more extensive. Large numbers of pathways, much more information on reactions. The reason why we focused on the E. coli is that we want to try and link metabolomics to metagenomics in the microbiome. E. coli is the classic microbe. It doesn't cover all of microbial metabolism, but a big chunk of it. And we're using E. coli metabolome to extrapolate metabolomes of many, many other microbes. And this is the idea of this transfer tool, microbes to metabolites, taking the genomic information, taking the pathway, machine-meatable pathway information and just cranking it through one after the other thousands of microbial metabolomes. And that way I think we can start relating the chemistry of the microbiome to the genetics of the microbiome. And really, and the big picture point of view, microbes are just chemical factories. They are there to process other foodstuffs and to transform them into either things that are more edible or consumable or better waste products. I think the challenge of viewing the microbiome is purely a collection of sequences is far too narrow. Many of the sequences are for microbes that are long dead or not functional. And you can get a much better functional view of the microbiome by looking at the metabolome. So I think to wrap up, I guess we're getting close to, and there's just sort of a comparison about the different types of databases and what each of them have. The comprehensive databases generally check off in terms of having lots of spectra, lots of pathways, lots of structures, lots of descriptions, lots of property and physiological data. So HMDB is an example. The psych databases in Keg are sort of in that realm. And there are others that are more associated with compounds or spectral information. Others that are more associated purely with pathway. So I tried to give you a perspective of different types of databases, but I think the key message is that in metabolomics we have very specific needs and trying to fit your metabolomic need into a database that was developed for a different task or for a different organism will just sort of mean you're chasing your tail. So I think it's important for people to know the organism they're studying, try and find the resources that are specific to that organism, or try and create their own resource for that specific organism. I'm very cautious and would urge people to be cautious of using chemspider, PubChem. They're great databases, but they were never designed to be used for metabolomics. And the people who developed them are usually shocked to find that people are using it for metabolomics. I think it's also important to understand that there are other tools out there that can help you. It's not just these comprehensive databases. There's useful information in a number of databases that I've brought up that are somewhat obscure, and I was, again, a little surprised that some people hadn't heard of some of these databases. I think the other message that's really important here is that too often in our interpretation of metabolomic data, we tend to look at it as catabolism and anabolism. We describe our findings in terms of amino acid biosynthesis or our findings in terms of lipid catabolism. And the fact is that many, many metabolites play a vital role in signaling, in homeostasis, in disease control, in the immune functions or immune system. And as we start scraping below the surface, I think people are realizing how important metabolites are at just about every level and in every disease process. Metabolites emerged or became essentially the first lines of defense for cells about a billion and a half years ago. Tabalites were used to guard against moisture loss, UV radiation damage, attacks from other organisms. Tabalites are your first line of defense in your body. In that regard, they're going to change before anything else, whether it's your white blood cells or your other complement systems to fight off infections or to fight off viruses or to fight off tumors. I think we're realizing as an example in the case of cancer, OICR, that more and more the classic genetic disease, cancer, is being recognized as a fundamentally metabolic disease. And that almost all the drivers for cancer, as an example, are checkpoints for metabolism. And we're now recognizing that there are oncometabolites, metabolites that cause cancer or drive cancer. And the list is growing. The biggest one actually is glucose. And if you look at the data, the consumption of glucose or sugar is very closely tied with the frequency of cancer in many societies. And people with glucose control problems, i.e. diabetics, have some of the biggest highest rates of cancer. And some of the most effective control mechanisms for cancer are diabetic drugs. So metformin is sort of the new wonder drug for cancer. Statins are the other wonder drug for cancer. And people who are on metformin and statins seem to be almost cancer-free. But this is an example just underlying sort of the epidemiology that people couldn't understand with a realization that in fact cancer is far more a metabolic disease than a genetic disease. And the pathways for that aren't anywhere in K. They aren't in most cases anywhere in textbooks either. And yet there's hundreds of people doing metabolomics in cancer, but very rarely sort of making those connections. I think we're seeing as well the importance of metabolites in signaling in most immune and disease responses. Stress responses for plants are classically chemical, but they're mirrored in microbes and they're mirrored in humans. So that conservation is also quite compelling when we look at cross-species and cross-kingdoms. So I think there's a lot to be learned in metabolism and because metabolism or metabolomics and study of chemicals is at least this conservation between things makes I think sort of these broader sweeping statements about how organisms evolve, how organisms defend themselves a little easier to make. The shared metabolomes between humans, mice and cows is like 99%. The genetic conservation is, you know, 60%. Shared metabolism between yeast and humans is probably 80%. The shared conservation at gene and sequence is perhaps 30%. So I think those are important things to remember and gain important lessons can be learned by looking at modern systems. So I think those are my parting words for today. I don't know if anyone has questions or comments of what they saw today. You're also welcome to carry on with finishing up some of the exercises. And Jeff and I'll sit around. Jeff has some comments. Maybe I should hurry right now while it's still online. It might go offline in the next 10 minutes. Okay, any questions? Yes. I don't know if it was the right phrasing question. When do the databases have, let's say, an annotation that say that well, this could have the realism of that word for processing with yeast by the confessional microbiome? Is there any easy way to figure that out? Not really. Except some cases by reading. So an example is what we eat. Many plants produce polyphenols. And hypochloric acid is something that shows up in urine in great abundance. And hypochloric acid, most of it comes from polyphenols. And there are pathways that illustrate how many polyphenols can progress through about three or four transformations that have become hypochloric acid. There's a database called Phenol Explorer, which we maintain. And it has a lot of the polyphenols, but it also has some of the polyphenol metabolites. And then there's another database that's going to be released, I hope, soon, called Exposome Explorer, which lists a lot of the bio-transformation or markers and those often represent bio-transformations that happen partly from the microbes, sometimes from phase one or phase two, the liver activity. We're working on a software tool called Biotransformer. And the idea, it's not unlike sort of the mycompound ID, but this one is we'll use machine learning to recognize structures and identify whether they are substrates or not for about 300 different enzymes. And then if it identifies that as a possible substrate, it then identifies where the reactions or reacting points would be, so the sites of metabolism. And then it performs an in-silico reaction and generates the structure. And then it can iterate through these things. And this is, we've done these runs and this is why we come up with this number of about 2 million compounds that we think that are metabolically feasible based on what we know of endogenous metabolites. So with that we can give you the source of the enzyme, the reaction, the starting product, the resulting product. But it's not quite ready yet. Any other questions? All right. In the afternoon I know people are tired. So, we'll take a break for now.