 Okay. Standard slides. Standard slide. So we're going to talk about databases. So I'll introduce a few databases. You'll need these databases to do some of the assignments or the tutorial that we've given you. And this is essentially now that we've gone past the annotation stage. How do we understand what we've just identified? So we could have done quantitative metabolites. We could have done targeted. We could have done untargeted. We're just now we have our list. These are the sort of the interesting metabolites. And so what do we do next? And that's largely what we're going to talk about with this discussion on databases and then what we're going to talk about mostly tomorrow as well. So this picture is what I call two solitudes. I think people have heard about this between Canada and Quebec. But this is, it's also true. Yes, yes. Rest of Canada and Quebec. So bioinformatics and cheminformatics. So most of us call ourselves bioinformaticians. In the case of bioinformatics it was a field really mostly started in the early 1990s. It was designed to deal with needs for molecular biology and genome sequencing. And what it evolved to very quickly, even though there were some early companies, very quickly became an open web based open access model. That's why we've got the Creative Commons license. That's why everything that we seem to get or work with is free. And we have great sites like NCBI that's funded by NIH. Resources that are paid for by the public to be used by the public. Cheminformatics started in the 1960s. So 30 years before bioinformatics is sort of an inquiry on. And it was to address the needs largely for organic chemists. And the model then, which is the model that's common in most groups is user pay. That's how IBM got to be so big. Computers were things you bought softer with stuff you bought. You paid money. What's more is that companies evolved around this software to create it, to distribute it, to sell it, to add to it. Things like CAS, Bilstein, MDL. So these are different models. And this is one reason why you pay money for the NIST database. This is why you pay money for the Konomic software. And it's why companies still exist in the world of cheminformatics and why there are basically no bioinformatics companies. So that's the way things have evolved. What we're trying to do, I think now is trying to bring those two solitudes together. And that's what metabolomics is trying to do. Tries to use a lot of the techniques, tools and philosophy of bioinformatics. And it's trying to bring that into the domain of cheminformatics. So the data in PubChem was actually a direct challenge to the CAS. And in fact, CAS was suing PubChem for putting or making data public. And it's why you can't put CAS numbers on PubChem sets. Anyways, databases are sort of the core that bring both cheminformatics and bioinformatics together. And so when we think about databases, we have to ask, what are they for? And typically with databases, they're way of consolidating information. And not only do they consolidate information, they allow you to retrieve information very readily because this is what computers can do so nicely. We can also use databases for reference values. And this is something that is very important in things like clinical chemistry and human metabolism. We also use reference images for histology, pathology, reference sequences for GenBank. That's very important. We use data in the databases to test prediction algorithms. So testing is a great thing. Databases are helpful for that. We also use databases to do things like searching and prediction. And these are fundamental to all of bioinformatics and to cheminformatics. So you can search for images, you can search for spectra, structure, sequence, text. These are all things that we do in bioinformatics. These are all things we also need to do in cheminformatics. Prediction is another thing. We talked about predicting spectra. We could also predict things like structure and function and relationships. But this is all enabled by databases. So databases in an emerging field, like bot metabolomics, or even in the early days of bioinformatics and microbiology, that always started off as hobby databases. Someone just kept a record. They just liked to collect things like stamp collectors. So the very first sequence databases are made by Russell Doolittle and Margaret Dayhoff. And they just sort of collected themselves and typed things in. I've done hobby databases as well. And as those hobby databases build, eventually either people say, can I use it? Can I borrow it? Or you realize there's some utility. And so you go from a hobby database to something that's maybe a curated database, one where you start adding elements to it. You expand the coverage. You might start making it into relational database. You start bringing in programmers who know something about relational database management. And that's the way some of the databases, again, the sequence databases went there. But then eventually people said, well, can I deposit information in this database? I'm collecting this data, too. And so it becomes an open deposition system, an archival system. Archives are things that anyone can bring to museums or not archives. They are things that a curator curates. So GenBank is an archival database now. So Gru from this hobby database that Russell Doolittle and Margaret Dayhoff put together and Gru to become a curated database, which then eventually Gru to become an archival database. So people can deposit stuff over and over and over again. They have to accept it. You can deposit junk. There is junk in GenBank. But hopefully it's sorted and you try and clean it up. But it's open access. And that's obviously made it tremendously popular. Now once you go to an archival system, you don't have enough people or resources typically to annotate it and to curate it. So there's sort of sometimes a loss of information, but greater coverage. So modest depth, more coverage. Whereas the extensively curated database, just like an extensively curated museum, has lots of information about every item in its collection. Cost and size of community obviously grow as things become more archival. As you go from the hobby database, you can enter it any way you want. But once you start going to an archival database, you need to standardize. You need to make things far more automatic. You need to have much more powerful querying capabilities. And this is the evolution that you've seen again with GenBank. This is maybe the third time we've seen this slide. And historically the reason why these fields have evolved so quickly is because of not only the databases, but also the querying tools that allowed you to search this data and extract this information. Metabolomics historically didn't have those tools. We've talked about some of the tools already that allow you to do that, but still this challenge was trying to get those databases. So in the case of metabolomics, the databases weren't databases. They were tables in textbooks and numbers in papers. And they were accumulated over 100, almost 120 years. And as I said, there are compounds that are being discovered today, where in fact they were described in the 1930s. Because of our laggardness of getting stuff into electronic databases, metabolomics is about 20 years behind genomics and proteomics. There's also a different challenge which has to deal with the fact that metabolomics is not just about chemistry or an analyte chemistry. It's also about people who do plant work, people who do clinical work. It's about people who do drugs. It's about people who are doing work in NMR and mass spec. It's people who do bioinformatics and systems biology. There are people who are also into the idea of standards. So trying to create a database that actually has a broader appeal for a broader audience, maybe not as large, but a broader audience in width than what GenBank or even the PDB has. So there are a bunch of databases for metabolomics that try and cover these bases. There are some specialty databases for things like spectral data. There are some specialty databases for mass-spec databases. There are some databases that are only about compounds. Others about pathways. And then others that try and bring all of these together. And we call those comprehensive databases. So I'm going to talk about them. And some of them we've already mentioned a little bit. So there are spectral databases. It's one that's maintained in Japan called SBDS. And it maintains NMR, mass spec, G-C-M-S-I-R. There's another database called NMR Shift-D-B. It's just strictly NMR data. The Madison Metabolomics Consortium Database gained surely NMR, Biomag Res Bank. And then we've also talked about the HMDB before. And the Kinomics Database. So I'm showing you, these are databases that some people use, have used. We use sometimes in our lab. This is STBS. And you can search by names and, like, the formulas. And you can retrieve very specific spectra that have been collected under very standardized conditions. And you can download these spectra, not en masse. You're only allowed, I think, 50 per day. But these are ones that you can keep and reference and use. It's been around a long, long time. Tens of thousands of mass spectra, tens of thousands of NMR spectra for lots of compounds, lots of search tools, but not all of these compounds are metabolites. In fact, relatively few are. Biomag Res Bank. Again, I've mentioned this before about the BMRB P-Search. Biomag Res Bank was established about 20 years ago for protein NMR. But about five or six years ago, they shifted into metabolite NMR. And this is actually proven to be one of their most popular resources. So they were quite surprised, but they continue to offer it because it's so heavily used. And so they have reference spectra, reference compounds, and you can search through them. 900 reference metabolites that collected about five or six spectra for each of them. So that's very extensive. You can search them. A lot of compounds are from plants. So, you know, this is a number of obvious human metabolites or mammalian metabolites. They've also assigned them. So this is very similar to what's in the human metabolite database, but the focus in the human one is on human metabolites. NMR shift DB, lots of spectra, more than anyone else as far as I know, proton carbon. And it's an archival database. People can deposit NMR spectra of any compound they want. They're assigned. So each peak in an NMR spectrum is assigned to a specific atom in the structure of the molecule. So that way you know exactly what each signal is. So NMR shift DB was actually developed by the person who runs Kebby now, Chris Steinbeck. Again, it's not metabolite specific. It's anything. So pretty exotic compounds. What's nice about it is you can actually do chemical shift prediction. So it's the only free tool I know that allows you to do chemical shift prediction. Search. And as I said, you've got those assignments, but these are assignments in organic solvents. And most of what we do in metabolomics is in water. Here's the Madison Metabolomics Consortium Database. It has lots of spectra. These are actually all taken from the Biomeg Resnain. And then it has a whole bunch of literature derived spectra. So you can search by shifts and peaks. And you can also get information on chemical formulas. So those are some databases. And I said, you know, we've talked about the human metabolome database. So that's another one that does include NMR. But I want you to be aware of some of these other ones so that you can use them. MS Spectral Databases, we've talked about actually all of these before. So it's the same picture. But we talked more about NIST and probably MEDLIN. So I'll just sort of mention Mass Bank. So this is a resource that's maintained in Japan. And it's really nicely done. It has some really nice search tools, a nice looking website. Nice facilities for peak searches and for spectral viewing. And for structure viewing. So nicely maintained, easily searchable. A range of spectra for different types of instruments. Ion traps. FJSR, TOF, QTQ, QTOFs. About 30,000 spectra from 14,000 compounds. And it's archival. It means many groups deposit into it from all over the world. So in addition to the mass and NMR databases, these are the compound databases. We've talked about chemspider. We've talked about pubcan. We've talked about kevi. And there's also structures, three dimensional structures that you find in the protein database. There's about 5,000 or 6,000 small molecule structures. And when I ask people if they'd heard of kevi, most of you haven't. So kevi, it's pronounced kevi with CH. About 28,000 compounds. A lot of the compounds are borrowed, the data information from keg with the mouse, drug bang. And a lot of focus on naming and ontology. They're really careful about the names, really careful with the structures. Yes. Say that again. This is their ranking. It means that it's gone through several validators. People who've checked the structure, checked the name, ensured its stereochemistry is correct. And originally they had, they made a mistake last year. They uploaded a whole bunch of unverified compounds and it just messed everything up. So they got so embarrassed by it, they took all of them out, and they started putting in a ranking on their compounds. So it's intended to be a small, heavily curated database. Michelle. Yeah, they do. And I think what's happening is so that the three star compounds are the annotated ones, and then there's a whole bunch of other ones that come primarily from Kemble, which are not so heavily curated. These are maybe called one star. But what you tend to see in the ones that they tend to push are these three star compounds because they're obsessed about ontology. What does three star mean? It's essentially that's a high quality annotation, just like a three star movie or a three star restaurant. Yeah, it probably is. Somewhere. So we've seen how you could search in their name and formula and structure. Pubcam. Actually, I couldn't find the most recent numbers, but this is from last year. So it's still, it's millions and millions and millions of compounds and substances. In order to get into Pubcam, it basically has to be a metabolite, although something up to 1000 atoms could correspond to a small protein. They take a lot of data from all kinds of resources and people have happily deposited their data into it. Substances are sometimes duplicates. So in terms of CIG, which is a compound identifier, that is a unique compound. So the real number of compounds is 31 million, not 75 or 80 million, whatever anyone is claiming. So look at the compound ID to find out how many unique structures, unique compounds there are. They get information about names and properties and structure and style and bioactivity and links. And so it's always expanding. And it's an amazing enterprise because it's a tiny number of people actually doing the work. But they've created some wonderful automation and it allows them to process so many compounds. You can search by name and molecular weight and formula and structure. And I think we saw some examples. Cam Spyder. It's essentially it's a partition coefficient. So it's a calculated or experimental or a log of how something will dissolve in water versus could be ethanol or could be chloroform. So it's whether it partitions into more hydrophobic or hydrophilic solvents. So long, Francis. Thanks for coming. So Cam Spyder is an attempt to try and improve on PubCamp. So it's more of a private effort, but it's public in the sense that anyone can access it. What they've decided to focus on is correct some of the errors or problems in PubCamp with respect to name, monoculture, and even structure. So PubCamp is an archival resource. If someone deposits incorrect data, they're not going to see it or know it. Cam Spyder is trying to get a sort of group annotation where people try and identify incorrect names and correct structures and correct that. Trying to use some of the tools that people in Kaby have developed. Of course, the pokes in PubCamp are really interested in fixing things. So they're trying to develop tools. So it's almost a community effort to try and make everything more correct, more complete. Cam Spyder also includes other things like articles, spectra, which you don't find typically in PubCamp. But those are sourced from other sites. So they take a lot of stuff from BMRB, drug bank, HMDB, and that shows up in Cam Spyder. Ligand Expo, this is the small molecules part of the PDB. So these are the structures that they help crystallize or antagonize or activate proteins and protein structures. What is so nice about Ligand Expo, and many people barely touch this and know almost nothing about it, is that it actually links metabolites and drugs to their targets. That's a really useful bit of information. If you want to figure out which drugs act on where and how and when. And a lot of the compounds they have are actually experimental drugs, things that could be eventually used or might suggest new therapies or new approaches. Instead of just drawing a two dimensional picture, you actually get the three dimensional structure of the small molecule. Small molecules are not cardboard cutouts. They have shapes and they do move and they have flexibility. And you can search for various ways. So I've mentioned, you guys may have seen these things, but Smiles, which is a format for entering structure, it's a letter code about seven, 10, 20 letters, instead of describe information about the structure. And then there's the International Chemical Identifier in She-Number, which is a standard way of presenting all chemicals. And what they're going to be advocating over the next year or two is that every compound when you produce your annotation list should be the In-She Identifier and the concentration. So it's not a keg ID. It's not its regular name. It's not the HMDB ID. It's the In-She Identifier and the concentration. Now we're just saying that when you annotate, so that's the process we're talking about where you go from spectra to list. So there's two columns, Identifier, Concentration. So there are other types of compound DBs. There's one called 3DMet, MapSack, which is mostly planets, lipid maps. How many people have heard of lipid maps? So there's, I looked at some people claim 60,000. I found 30,000. But there's the Zinc database, which is purchasable compounds that people use to screen. This is about two and a half million compounds. And this is information about these ones. So what's important about these two is that these are metabolite databases. So no false positives here. But some of them are plants, some of them are microbial. So again, humans don't photosynthesize. So don't look for plant metabolites in our blood. There's Zinc, which is two and a half billion lip maps, 30,000. Now lipid maps is popular, but it's all lipids from all organisms. So there's insect lipids that you'll only find in insects. There are plant lipids you only find in certain types of plants. There are mammalian lipids you'll find only in mammals. And they don't tell you anything about the source. So that can be confusing. So again, if you search by chemical formula mass, you get a hit. And now you're claiming that humans are full of insect lipids. You know, watch what you're doing. Okay, pathway databases. I think everyone's heard of keg. That's the best known best curated pathway database. There's also bio psych and meta psych and the psych databases. How many people have heard of the psych databases? And then there's reactive, which is maintained here in Toronto. How many people heard of reactive? And then there's another resource called the small molecule pathway database. How many have heard of that? So yeah, exactly. So pathway databases are the ones that we sort of target and say, Okay, you know, that's the end of the road in metabolomics, if we can take our metabolites map them to the pathway, we're done. And we know everything about the biology. And well, that's maybe not true. But it's sort of what we like to say we're doing. Anyways, the point is that the pathways relate metabolites to genes, they relate them to the proteins, they relate them to the processes and to diseases. And so that really does tie those lists to biology. And some of these pathway resources to give us tools to do visualization. And we can see and map genes and metabolites. Some of them cover multiple species, like keg and reactome. So keg is Kyoto and psychopathy of genes and genomes, it's been around for like almost two decades. I think everyone's sort of familiar with how you can click on various nodes, and it gives you little information cards. It's not as big as kebby. There's only 16,000 compounds or 30,000. They have a lot of drugs, essentially, six times more drugs than than exist. So we're everyone's puzzled about how they got such a large number. But I think what they've done is they've taken the different salt forms of the same drug. And that's how you can create lots of different drugs. Huge glycan resorts, which almost no one knows about it, no one uses. And they have about 400 pathways, metabolic pathways. Now, metabolic pathways are for different organisms. So for plants and insects and microbes and mammals. So mammal mammalian pathways that they have are about 80. So you're not going to find 420 pathways for humans in here. It covers all of the metabolic pathways for the diversity of life. So since no one's heard of the small molecule pathway database, I'm going to talk about it, partly because people like Bill have put their entire lives into it. The this is a resource that was developed primarily for human metabolomics. And to try and add a little bit more color to metabolism, over what you might see in K, or other databases. But it was also designed from a perspective of what do we as a metabolomics community need to do. Because K started long before metabolomics was even a word, same with the Ecosite and Metasite databases and same with reactome. So we wanted to try and create a database that actually could do and work with data that metabolomics researchers developed or have. So this has 450 pathways. So it's actually more pathways than K. But it's not a case that there are more metabolic pathways. We have slightly more than what K has for humans, about 90. But there's lots of drug pathways, and lots of disease pathways, and also some signaling pathways. And this is not the finished set. This could easily be, you could add zeros to each of these, not to this, but these other three. And then that might be reasonably complete. So this still though is the largest collection of human pathway information in the world. It has a variety of searching and browsing capabilities. It has tools for mapping metabolites. And it captures information that you don't see in other databases like the organelles, where the metabolism happens, or the substructures where the metabolism happens, or the organs where the action happens. So this is an example we're looking at a pathway. This is for a disease pathway for phenylketonuria. So it's identifying some of the organs somewhere also in here, I guess, is the brain, which is ultimately affected with phenylketonuria. So one of the things that is different than other pathway databases is that the structures of the metabolites are visible in the pathway. They're not just little dots that are ignored. So you can actually see some of the structural transformations visually in the pathway. By clicking on the metabolite, you're immediately linked to the HMDB pages, which provide about 100 data sets, I guess, about the information on each metabolite, you know, descriptions, chemical, physical properties, disease associations. You can see, you know, here's the liver, here's the lipid membrane, here are certain organelles within the cell where some of these stuff happens. Here are the enzymes, here are some of the cofactors, here's some of the ones that are downregulated, perturbed metabolites in phenylketonuria. You can also click on the proteins, and that takes you to unapproach descriptions on those. You can see arrows and pathways. You can also see which ones are activated, which enzymes, which metabolites are present. You can click on them and highlight them in various ways. You can map metabolites. So once you've got those metabolites clicking on the checkboxes, you can see if your list of metabolites that you generated today, you want to see what's there, you can see what might have been perturbed. You can also enter those lists of metabolites in a query tool with the SMIT DB. So type your list of compounds that you measure today, see what's there, or what's perturbed, or perhaps what's affected. You can even put in concentration data, relative or absolute, and you can color code the metabolites in these pathways, ranging from dark red to orange to yellow in terms of the intensity or quantity of these. So you can see what's upregulated or downregulated relative to other compounds. So there is a picture of the mitochondria and where some of this metabolism is taking place. So again, that's not information that's not normally depicted in pathways, because in fact, metabolism does take place in specific organelles and organs. And that's not usually captured. So that's the small molecule pathway database or SMIT DB. So each of these things, these databases, spectral databases, compound databases, pathway databases, cover a piece of metabolomics. Ideally, what you'd like is to try and bring all of that together. And so keg does have pathways, and it does have short descriptors about compounds. MMCDB does have spectrum, it does have descriptions of compounds. So these are starting to be somewhat comprehensive in terms of covering more than just only spectrum. These are examples of the first really comprehensive databases that tried to combine pathways with spectra with compounds with descriptions with disease and all kinds of things. So these are more like almost encyclopedias. So that's the human metabolome database and drug bank. Two other databases that we've been working on that have been released recently. One is the yeast metabolome database, YMDB. And then another one, which is E. coli metabolome database. And I know there's a number of people working on yeast and bacteria. But these ones, yes, there's a psych database on yeast and yes, there's a psych database on E. coli. But they only cover about half of the metabolites that are covered with these databases. And again, this is because the perspective on metabolomics is much broader. And the field is moving probably much faster than some of the other databases can keep up. It's another perspective is that yeast is not just an experimental organism, it makes wine and beer. And in fact, the value added components in wine and beer absolutely critically generated by certain types of yeast. And so this is of considerable interest to the food and beverage industry. So in terms of the human metabolome database, right now, there's just over 8500 metabolites. 120 of those are confirmed to be microbial metabolites. The database has normal and abnormal concentrations for thousands of compounds. It has links to hundreds of diseases. And as thousands of NMR, MS, somewhere, GCMS spectrum, sequence of spectral search tools. We've talked about those browsing tools, pathway tools, structured tools, specific information about different bio fluids, tech search and data downloads. No, that wasn't in. So if you want to try and write a note in or you can draw it. So the human metabolome database grew out of a project that was funded about five, six years ago through Genome Canada. And that was the human metabolome project, sort of a parallel to the human genome project, didn't attract near the same attention and didn't have the same budget. But what it did do was create the human metabolome database, as well as a bunch of tools and other resources to associate metabolites to diseases and to certain concentrations and to get those lists that we can associate with certain bio fluids in the human. But then we also started adding things. So beyond the human metabolome, we realized that there are also drugs and realizes also food, we also realized there's toxins, bacteria. So these things have been made freely available over the years. And they say there are now millions of users that access these things. Project also was designed to develop some technologies and software. So some of the software you guys have used some of the stuff you will use tomorrow about people like Jeff, came out of this project. And again, it's been made public. So this is the bioinformatics solitude, which says let's make everything freely available. So in the human, there's many metabolites of different metabolomes. We've seen this before, HMDB, drug bank, food DB, metabolites, toxins and they range from high to low concentrations, numbers total over a little over 35,000 or 40,000. The access points for these databases for the human metabolome. And then I gave you those links for yeast and E. coli metabolones. You'll have a chance I hope to look through these databases and to see some of the links to see what's available to look at the spectra, to look at the pathways, to look at the browsing tools and search tools and browse at different levels. These are the types of data fields you'll find. Some are better annotated than others. You can perform spectral searching. And we talked about that before. One of the tutorials or the tutorial that we've given gives you a tutorial on spectral searching. The pathway tools, so you can search through the SMIP DB pathways which are all linked into HMDB. Yeah, question about the pathway tools. Yes. So is this pathway purely for the metabolome mix pathway O? Is there any pathway tools that you can connect a metabolome mix with protein omics, genetics, I mean genomics? Is that part of what I would like? Well, I think everyone would like to have something that connects all of these. I mean, this does a pretty good job as does keg because remember, these are metabolites, but these green things, those are all proteins. So we can color, you know, information about the metabolite levels. I don't think we've got a mechanism tool for coloring in intensities of proteins, but that wouldn't be hard to add. And so proteins to genes is kind of trivial. So that's one way of seeing or visualizing those connections. So it can be done or is sort of being done. Engineering, people that? I think the engineering pathways may have that. I don't think they have the metabolic focus. But they certainly have the genes and proteins and some some metabolite. Then there's the bio fluid database. So you guys looked at cerebral spinal fluid. So we've characterized cerebral spinal fluid. You guys will be there's serum we characterize that there's urine, we're just finishing that one up. And then there's saliva, which is also being done. And so those are reference concentrations for those different bio fluids, also for different diseases. And then this is particularly of interest to people like clinicians. Drug Bank, say most of us, I hope we're not taking every available drug, but those of us who are not hypochondriques, this is something that the gain was designed to deal with drugs and drug targets. And that's actually the reason why it became so popular was because it was the first database to link drugs to their targets. And so this has actually been used to discover or repurpose a number of drugs for diseases that previously had not been treatable. So this one's it's been cool and that it was a database project, but it's actually saved lives. And in this case, we've used the same concept, same model that we use for HMDB for for for drug bank. There are query tools that allow you to draw structures, look for similar structures, look for certain classes of drugs and drug types. You can search the sequence. This is you can search for sequence in HMDB. This is associating the enzymes or genes or proteins or receptors to the drug or HMDB to the metabolite. And then there's more complicated SQL searches. But since most people don't know how to do you SQL, this is given an interface where you can and and or certain fields so that you can extract things out and and look or construct more specific queries. So this is just a quick comparison between what's possible with some of these very complete ones and the ones that are more oriented towards very specialized areas. And each, you know, has certain strengths. But as I say, the HMDB drug bank and a few others have really been designed to try and cover all bases, because they were designed from the perspective of what does a metabolomics user need. And I think, you know, as time of all, there'll be other databases, which will be better than these ones, which will be more community efforts. And hopefully funding that comes through over the next few years from the European Union NIH will allow these sort of comprehensive resources to be run by the NIH rather than by our little lab.