 So, again, welcome back everyone, we're going to be talking about databases for chemical, spectral, and biological data. The lab that you guys went through was mostly just to give you a flavor of things. And what you'll find is that in a number of the labs, the way we'll do it is we'll give you a few slides, sort of as pointers, and then we'll let you go to it where you can work on your own. You don't need to move at the pace that we move when we're just giving you the quick overview. And of course, the intent is that you can download the slides and use those as a guide if you've got a separate terminal or if you printed them off, then that way you can use them to get you through the process. This time, unfortunately, I guess we had a few minor changes to the layout, so that may have caused a little bit of confusion. Being in these times of COVID is obviously a little tougher than we've all been used to. If you didn't get the things to work in the hour or so that we had, there is the time after this. We've set aside for an hour or hour and a half of additional lab work where you can try your hand at a few of the other data sets we have, and that is an opportunity for you to explore things a little more closely. So this is module three instead of module two. And what we're going to be learning about are the different databases and database models. We're learning about a range of metabolomic databases which are used by various programs and groups to help with compound annotation. And that's a common theme for today where we've been learning about whether it's MR or GCMS or LCMS to get annotations or names to those peaks and at least something about their concentrations or relative concentrations. Tomorrow, as I said, we'll be focusing on interpreting that data. But I'm also going to highlight some databases, particularly things called pathway databases which are linked to tools like Metabolanalyst, but on their own also help a lot with the interpretation. So I've mentioned two terms before, the words bioinformatics and cheminformatics. Almost all of you have heard of bioinformatics, and this is what CBW is named after, so that's doing computational work for biological substances, proteins, DNA, RNA. Cheminformatics is sort of a separate field that developed specifically for chemical compounds and for chemistry, and it sort of had its own path. And those two fields have only recently started interacting. In terms of cheminformatics, it's actually an older field that's been around since the 1960s. It was developed specifically for organic chemists, and back in the 1960s, most software and most computer systems were commercial. So the concept of being able to access chemical data historically was you buy it from companies. There are large companies, Chemical Abstract Service being one of the most significant ones, Sigma, MDL, Beelstein. These are all organizations that have or established or have been making money, millions of dollars a year by selling chemical information. And the world of bioinformatics, a field is newer. It really picked up steam in the early 90s. It was specifically targeted for people in the field of molecular biology, protein chemistry, genetics, and instead of commercial software, the idea was to have web based open access, which is what this course is. And in fact, most bioinformatics software has been funded by federal governments. Genome Canada, the CHR, the NIH, NCBI, EBI have all put in hundreds of millions of dollars to help with the development and maintenance of bioinformatics resources and databases. So there are two different cultures and I think that's important to understand some of the challenges that we're still facing in the field of metabolomics, which tries to blend both cheminformatics with bioinformatics. Now key to almost all of these techniques, cheminformatics, bioinformatics, whatever is a database. So databases help consolidate information, they link information, they help with data retrieval, data queries. Databases are supposed to contain reference data so they can contain numbers or images. Obviously in bioinformatics, they include sequences. Many people use databases as a collection to help train and test predictive algorithms. They're also used frequently for things like similarity searching. So whether it's looking for similar images, probably all of you have done that on Google. Similar spectra, similar structure, similar sequence, similar text. You can and we'll show you, you can do this in many resources. It's common certainly in bioinformatics. It's common in cheminformatics. The other thing that databases help do is not only assist with the training of prediction algorithms, but they also can be used to facilitate prediction. In bioinformatics, people use it to predict things like phylogeny or relationships or properties, but we also do the same thing in cheminformatics to predict structures and function and spectra and relationships. So the core of a lot of the world, informatics world is based on databases. I've been involved in the creation of many, many databases over my lifetime and this is how many of them evolve. Many cases I start making a database on my computer and it's done in Excel, should be a cheater, even just a word file. It's a hobby and then you realize actually some people may need this. And so what you do is you start expanding the database. In my case, I usually bring on some graduate students or hire some summer students and we start adding to the database. And we make it a little more extensive. We make it relational. We'll put it into a warehouse. And then if the database seems to be really taking off, we may end up creating not only a curated database but an archival database. That's a database where other people can deposit it. It means there's open deposition whereas a curated database, it's still an internal database. As you move from the hobby database to the archival database, as you move from things that might be just a tiny collection of sequences to something like GenBank or a tiny collection of chemical structures to something that's as large as PubChem, there's a progression in terms of the coverage and the depth. Now some databases tend to be very deep but not very broad. Others can be very broad but not very deep. And the larger they get generally archival databases are not particularly great in terms of depth. Now the size of the database as it increases so too does the size of the user community and so do the costs of maintaining the database. So archival databases like GenBank, like Uniprot, like many others are incredibly expensive to maintain. In our case, my work in my lab has been creating curated databases. Those are a little cheaper to maintain but they still cost a lot in terms of time. As things evolve typically, as you go from the hobby database to archival database, of course things have to be more standardized and more in automation and you also expand the querying capabilities. As I mentioned, this is a slide you've seen before, is that the key problem in metabolomics traditionally has been the lack of databases or servers that access databases so that you could upload spectra and get an answer. So today you were given examples where you could upload spectra into Bazel, upload spectra into GC AutoFit, upload spectra into Metabolomist R and you could get metabolite identifications and or concentrations. So that problem slowly being overcome but all of those databases and all of those servers that we have depend on databases. So the challenge still with metabolomics is that most of the data on chemicals is still in textbooks or print journals. There have been hundreds of thousands of papers and journals and books published over more than a hundred years covering both clinical chemistry, classical biochemistry, natural product chemistry, toxicology, environmental chemistry. The other thing to remember if you're coming from the field of proteomics or genomics is that metabolomics lags behind genomics and proteomics. Maybe 20 years is an exaggeration but it doesn't have the infrastructure that those other fields have. The other thing is that because of diversity in chemistry you're often trying to deal with different communities and user needs. Some chemical databases are great for organic chemists but they're useless to physicians. Some spectral databases are great for NMR spectroscopists but they're useless to mass spectroscopists. Some databases are structured only for chem informaticians but they don't allow bioinformaticians to work. Some are very non-standard, some are very stator compliant. Some are what are called fair, findable, accessible, interoperable, reusable. Others most of them are not. So this broad collection of communities makes creation of databases in metabolomics also challenging. So I'm going to talk about five different groups or classes of databases from metabolomics. I'm going to talk about general compound databases. I'm going to talk about experimental spectral databases where they've got reference spectrum. I'm going to talk about predicted spectral databases, organism-specific databases and pathway databases. Compound databases usually have large numbers of compounds from many different sources in organisms. They are for level three identification, simple identification or putative identification or class. The experimental spectral databases are these reference databases. They allow you to get up to level two identification. MS-MS spectra, the predicted ones, potentially allow you to get level two. Organism-specific databases greatly narrow down the possibilities and often give you a pretty solid idea of what you're looking at. And pathway databases help with data interpretation. So I've mentioned this before and I'll just reiterate again. Level one, level two, level three, and level four are the levels that we use for metabolite identification. You can use them for both NMR and for MS. These are part of the metabolomic standards initiative. They're rewriting these levels over this next year and slightly modified versions will be coming out soon. As I said before, level one requires an authentic compound, an authentic standard. So very rarely do people actually reach that. Most people are identifying compounds at the level of level two. This means you're matching to reference spectra. These are the spectral databases. Level three means you're matching to molecular formulas or mass to charge. And then the vast majority of compounds, especially on targeted studies fall into level four. They're just simply unknown. They're features, they have a mass, they maybe have a retention time, but that's it. So we highlighted the strength of having molecular formulas using seven golden rules, some of the other software tools that are available to allow you to take high resolution mass data and to get a pretty good estimate of what the actual molecular formula is. Once you have a molecular formula, then you can start clearing databases to say what is the structure possibly. And sometimes that can be very helpful. So two of the largest databases for chemical formulas and chemicals in general are ChemSpider and PubChem. So PubChem has got almost 100 million compounds. There's probably around 50 million compounds that are quote hidden, that are in the database but not accessible to everyone. There's certain limitations that compounds have to be less than a thousand atoms, which means that they could have molecular weights in the order of, you know, 5,000, 10,000 Dalton's possibly. So it's not purely small molecules. It's a collection of many, many other databases. So it's a database of databases. They distinguish between substances and compounds. Substances can have duplicates. They can be impure compounds or pure single compounds. There's a lot of information and every year PubChem actually gets richer and richer in terms of the data that it contains. And there's a lot of effort to expand on it. It's quite searchable. You can get, you can search by name, chemical formula, like a weight range structure, a whole variety of other search capabilities. ChemSpider is a little different. It's maintained in the UK. It's a smaller collection of compounds about 20 or 30 percent smaller but has more data sources. And it also draws on more data. So it will have spectral data, have pharmacological links. It does link to MeSH, and it has links to Wikipedia articles along with some descriptions. It's spotty. However, ChemSpider hasn't been as well maintained as PubChem over the last couple of years, partly because of some changes with staffing. You can do molecular weight searches. You can do formula searches. You can look at ranges. Here I've typed in a looking for molecular weights between 89 and 89.099 Dalton's. I'll get a list, in this case, 408 possible compounds with this molecular weight range. And of course you could spend a few hours sort of combing through to see if any of these things seem reasonable or not. As I mentioned before, the vast majority of compounds in PubChem have never left the lab. In fact, some of them may not even exist in reality. They include lots of, I would say, putative compounds that people have used in chemical libraries, in which there's no evidence that those chemical libraries are actually containing those compounds. What I want to highlight here is sort of picking up on where I was commenting about the type of compounds. Trying to match molecular weights or derived molecular weights that M over Z values to these databases is a mistake. The estimates are that less than 0.01% of these compounds have actually left the lab. So that means if they haven't left the lab, they are not in the environment. They're not going to be found in trees or plants or in your body or in parasites or microbes or waterfowl. So if you have a database where 99% or 99.9% of the compounds are not by definition or cannot by definition be in the environment, this is not the way of looking for compound matches. So what you can do is narrow things down to look for databases that are more focused on biological compounds. Kebby is an example. It's a collection of chemicals of biological interest. So it includes plant compounds and insect compounds and mammalian compounds and even drugs. There's knapsack which covers a whole range of different plant compounds. There's lipid maps which covers a whole range of different lipids from a wide variety of of organisms. And then another one which is called my compound ID which I'll talk about. So Kebby has around 60,000 compounds. It's collected from a wide variety of areas. It's been focusing more and more on enhancing its ontology. It focuses a lot on getting proper synonyms and formulas and structures. And you can search Kebby by name formula and structure. So if you have a biological sample and you're trying to figure out what does my molecular formula or what does my M over Z value match, this may be a better database to search than Pubcam or ChemSpider. So David, sorry, quick question. So how much overlap is there between these databases? There's no attempt whatsoever to be exclusionary just in one database and not the other at Brazil. That's right. I mean you will find most of the compounds in Kebby are in Pubcam or ChemSpider. But you can't obviously detect that. So you'll along with, you know, if you do a mass search or molecular weight search or formula search, Pubcam will return all of them. So you can't say just select the Kebby compounds in Pubcam. So you basically have to go to the three databases or four databases or whatever. That's right. Yeah. So Kebby, as I say, includes many things. So if you are looking at, you know, let's say E. Coli, Kebby doesn't cover all of the E. Coli. It doesn't even cover all the human metabolites. It covers a few of them and it covers a few E. Coli metabolites. It covers a few plant metabolites and these are kind of just randomly chosen. Not sure why and how they choose them, but that's so it's a collection of different biological compounds. Napsack has focus on plants, but it's all plants. So if you're trying to look at carrots and bananas or ginseng, it'll have them, but not necessarily always tied to the species and not necessarily very thorough for certain types of plants. Lipid maps is a great resource for lipids, but it doesn't cover, you know, organic acids or amino acids and it includes lipids from a variety of different organisms. So not just mammalian lipids, but you'll get insect lipids or fish lipids or plant lipids. I'll talk about my compounding idea a little later. Well, now I guess I will. So this is a very different database and it's one where there are no structures but there are molecular weights or M over Z values and it took a small number of compounds from the human metabolome database about almost 10 years ago, 8,000 of them and then it ran them through 76 different metabolic transformations and it's done that through one pass and then through two passes. So roughly 76 times 8,000 gave them about 400,000 feasible structures and then 76 times 375,000 gave them about 10 million feasible structures from second pass. As I say, these are not structures but the key with my compound ID or the revelation was that many of the unknown compounds that you guys can't identify are probably metabolites of metabolites. These are the things that conk the second home or biological transformations of drugs or biological transformations of food and more and more this seems to be the case that we're finding these unknown unidentifiable metabolites are just variations of known metabolites but they've gone through phase one or phase two or microbial bio transformations. So my compound ID is accessible. It hasn't been updated that recently but it has helped a number of people suggest or identify possible structures. These are examples of the molecular weights and the transformations so it doesn't generate a structure and the structures aren't necessarily verifiable so it gives a hint. It's not you know it's not as thorough as say kebby or hMDB or even pubchem and the fact that those provide structures and references and a whole bunch of other things but this does allow you to look at possible bio transformations and for the real compounds which it started with at least from those where it has an hMD identifier it links directly back to hMDB. So I think it's important that if you are using molecular weights or molecular formulas or m over z values to identify compounds that you choose the right database that you treat the results suitably skeptically that you understand that these are only good for level three identification and that information about the biology or the organism is is key. People doing higher quality metabolomic studies work with compound identification via spectral matching. So these next sets of databases are more important and this is how most of the community is at least now moved towards which is to use experimental spectra and experimental spectral matching and this is level two identification and you can do this for NMR or you can do this for mass spec and still level one requires authentic compounds unless you spent hundreds of thousands of dollars getting a library of authentic compounds most of us are stuck at using level two identification where we match to spectral databases. So in terms of NMR there's a number of databases specific for NMR the spectral databases in Japan or SBTBS the Biomagres Bank, NMRShiftDB. This is a database maintained in Japan it's mostly for organic compounds it has carbon NMR, proton NMR, it even has infrared spectroscopy their MS data as well it's a bit of a weird database because it limits how many times you can access it it has various rules it doesn't have a whole range of frequencies for NMR so that's a little limiting as well but there are lots of spectra you know 14 000 NMR spectra 54 000 IR spectra covering quite a number of compounds most of the compounds in this database are not metabolites but they are a rise from organic synthetic efforts on the other hand database called Biomagres Bank does have compounds that are metabolites this database was originally established for proteins and protein NMR and my background is in protein NMR so I was very much aware of this database and helped launch it many years ago but over time they realized that people also wanted to put metabolites in it and when they introduced metabolites the popularity of this database shot through the roof so it does provide a lot of data about 400 reference compounds have an average of five to six NMR spectra they have 1d and 2d spectra they have information about the names and synonyms and smiles initially most of the focus was on the plant but they've now expanded to many other mammalian metabolites and they have assigned all of the spectra which is a lot of work and so this is a useful tool for not only compound identification by NMR but also for doing things like prediction another database which has maintained in Germany and it was originally started by Chris Steinbeck who ran metabolites and kebby for many years NMR shift db or second version now NMR shift db2 it also has a large number of spectra and structures again most of these compounds are organic synthetic ones they're not metabolites but it has tools for chemical shift prediction and you can search by structures or chemical shift peaks and it has a number of chemical shift assignments so those are the NMR spectral databases obviously there's a spectral database in for NMR and HMDB there are NMR spectral databases used in both magmot and basal and there's NMR spectral databases in the kinomics software as well if we jump to mass spec we've talked a little bit about this before so the NIST database NIST 20 for GCMS as well as LCMS I've mentioned the METLIN database which is linked to XCMS and then there are these mass bank databases mass bank in Europe mass bank Japan mass bank of North America these have large collections of experimental and even theoretical spectrum so the METLIN database I guess is getting up to around 850,000 compounds which is absolutely astonishing because even just a few years ago they only had about four or five thousand compounds they were able to access large numbers of lipids apparently and natural products from Reignard Lake libraries and have been collecting literally millions of MSMS spectra they've also run I guess at least 80,000 compounds where they have experimental MS spectra and it is recently been I guess made available through commercial purchases although we've been inquiring for a couple months now and they say it's almost ready and it looks to be about five thousand dollars you can access anyone can access the METLIN database but you have to log in and register so it's a remarkable resource and it's expanded very significantly in the last couple of years so you can do mass searches you can look in neutral positive and negative modes you can go through tolerance you can select which addicts you want you can click on a an entry and see the mass spectrum and unfortunately you can't download them and so it's it's still not as accessible I think as the open source community would like now mass bank is another resource which is open you can download the spectra and it is essentially networked across Europe and Japan and the US it covers spectra from many platforms just like METLIN much smaller number though about 80,000 spectra from about 14,000 compounds it's an archival database whereas the METLIN database is a database that's curated so all the spectra and the METLIN database come from Gary Shustak's lab whereas the spectra and mass bank Mona and others come from many many labs around the world so you can search by keywords you can select instrument types you can choose by exact mass you can do formula there's tolerance you can search by inchy key or splash looking for peaks and peak lists it's a fairly simple interface and as I say it's similar both to European and the Japanese version you can enter a list of peaks and then send off a query to see what matches and based on this set of queries these are the molecules that matched to the spectra that we've queried so you can manually enter peak lists and then manually query and get the results for matching molecules so another version of mass bank is called mass bank of North America and it's maintained by Oliver Fien and it includes many other sources and is substantially larger than mass bank Japan and mass bank Europe partly because it includes a lot of predicted msms spectra and I'll get into the idea of predicted spectra a little later so it uses lipid blast to predict lipid spectra but it also has experimental spectra it ranks the rate quality of the spectra and you can do standard upload download and spectral searches and and queries these are some of the statistics on it and some of the sources it draws it from human metabolome data bank mass bank gnps prime and data in pubchem it's used in software like nist and ms dial and other vendors and it follows the standard of fair principles so if you look at at the number of spectral databases at least where they have msms authentic msms spectra not just ms spectra they're sort of breaks down this way so mona has a very large number of many of which are theoretical so it's about 12 000 compounds where there's authentic msms spectra metland has about 80 000 compounds with authentic msms spectra cf mid which i'll talk about has about 21 000 compounds with authentic ms spectra but both cf mid and mona have predicted spectra and predicted compounds there are other resources like mz cloud which has recalculated ms squared msq ms to the fourth power spectra for about 8 000 compounds the nist databases around 13 to 14 000 compounds mass bank 14 000 and so on there's a fair bit of redundancy so when you look at all what's all the databases all together have about 85 000 compounds with msms spectra so even though the numbers are impressive like you know 500 000 200 000 compounds that either includes just only ms compound ms spectra or theoretical so certainly the size of ms spectral databases is greater than nmr but it is not sufficiently large to actually help with identification of many metabolites because many of these compounds actually aren't metabolites they're just things that you could pull off the shelf or that were easily accessible so we have this numbers game where you know based on the number of chemical formula we can expect there's about 32 billion chemicals about 2.4 billion would be chemically feasible we know that there's about 160 million compounds that have ever been registered in pub cam or cast we know that there's about 1.5 million spectra in lc and gcms we know that there's about 30 300 000 chemicals with e i ms data there's only 85 000 chemicals with high resolution ms ms data and about 800 metabolites with high resolution nmr so as we get down this filter there's relatively few compounds for nmr not that many for ms ms somewhat more for e i ms but only a fraction of the actual you know chemical universe is covered by these or even the metabolomic universe how many compounds have patterns on them you can include patterns and and those are often accessible um there's still you know information about them is is accessible you're just not supposed to synthesize and sell them um but yeah there is so so how many so how many compounds have been patented i guess it's my question um i am not sure i think it's it's several hundred thousand i believe half pound okay interesting yeah sorry for this but no that's fine so given the lack of experimental spectra which are needed to do um component identification we're we're facing a dilemma in the world of metabolomics we could either systematically just as the human genome project did systematically sequence you know the human genome or all these other genomes so in the case of metabolomics we can try and synthesize all the all the known metabolites we have structures we have likely ones we could also modify them run them through you know artificial livers and artificial gaps which will generate modified versions we can purify them from natural products we got this massive effort you know it cost a billion dollars to sequence the human genome the estimate right now is it would be between five and ten billion dollars at least try and generate the estimated five million compounds that we need to complete the metabolome and it's not just the human metabolome it's the plant and microbial metabolome that's not going to happen um and it's not going to happen in a short period of time either because synthesizing isolating purifying chemicals is not fast not as fast as sequencing so the feeling is that the only way that metabolomics can kind of play the same game and to get around all these unknown or unknown unknowns or it's called the dark matter of the metabolome is to move towards computational predictions to computationally predict spectra nmrms and to even predict compounds predict the biotransformations of those that's a lot cheaper instead of billions of dollars it's measured in hundreds of thousands of dollars and it's also feasible so that segues into this idea of predicted spectral databases where we take the compounds that we know about and we generate their likely fragmentation patterns so believe it or not the very first artificial intelligence programs written in the 1960s were designed to do ms ms prediction so the field of artificial intelligence and machine learning was launched with the concept of predicting ms ms spectra from compounds now back in the 60s they probably couldn't get very far mass spectrometers weren't great and computers weren't great either but efforts have continued and and one of the more successful efforts is this thing called CFMID competitive fragment modeling it's a project that grew out of some people here at the University of Alberta to reanalyze how we predict ms ms spectra and to look at how if we had a compound how accurately we could predict both the ei ms and the esi ms ms spectra so it's now evolved so it's a web server and it's gone through several iterations yet trained on a large number of spectra some of them from metland some of them from other resources at different collision energies so mentioned now mass spectra require at least to get the ms ms you have to have a collision cell and you have different energies that fragment molecule the data is from a q-tof ms ms and it essentially learns by experience it slowly learns through a hidden markoff model where things should fragment and how things should fragment it's been entered into another number of contests and is one and has been progressively getting more and more accurate right now it has about 330,000 computationally generated esi ms ms spectra from 100,000 compounds in the hmdv it also has predicted spectra from keg and it also contains a lot of experimentally collected spectra from about 22,000 compounds in addition to using machine learning techniques latest versions of cfmid have also used rule-based methods to predict ms ms spectra for lipids, acylcarnitines, polyphenols and others so it's been getting progressively smarter if you want progressively faster and has been used to generate more and more ms ms spectra for things that we will probably never ever get ms ms spectra for it also does a lot or has generated many many ei ms spectra and i won't get into that now but so an example of the lipid predictions are shown on the right this is cfmid so the blue is the experimental spectrum and the red is the predicted spectrum on the left side is the lipid blast version for the same molecule blue is the predicted and red is our blue is observed and red is predicted we evaluate the quality of a prediction ms ms spectra through a jacquard score you can also use something called the dice score a jacquard score of one means it's perfect and a jacquard score of 0.28 is meaning it's not great so cfmid turns out both on its machine learned and on its rule-based methods to be quite accurate so it's gone through several iterations in 2006 it was able to get about 55 60 correct on a special test it did very poorly in lipids and the dice or jacquard scores weren't perfect so we made improvements to the lipids and then it went through the same test to gain or the test data set and the performance jumped to around 72 percent more recently we've upgraded the machine learning tools and used additional neural nets and the performance improved all across the board in terms of predicted spectra and now it's up to around a little under 80 percent at being able to predict a compound from a mass spectrum the predicted mass spectrum so let's say you've got your ms ms spectrum and you say i have no idea what it is it doesn't match to anything in the database or it doesn't match to anything i can find you can use cfmid and odds of getting a match or prediction are about 75 or 80 percent so with cfmid you can do compound identification if you've got spectra you can also take spectra that you've collected and it will do peak assignments you can also take structures that you've drawn or structures that you're thinking of or structures that people have sent to you and it will do ms ms spectral prediction so there are three tasks that it can do most people are interested in the compound identification but people who really like mass spec like the peak assignment possibilities and and people who are doing de novo compound identification really like the spectral prediction properties um you can once you've got if you're trying to do predictions you can say okay i just want to predict only or identify a compound purely from the predicted database or you can choose its own collection of spectral databases and there's up to seven or eight of them i tell it what type of spectrum a mode paradigm mass mass tolerance and then below there you can paste in the different peaks so you can put in the 10 electron volt that's the 20 electron volt and the 40 electron volt collision energies you can do it for one or two or three actually the more information that or more spectra you have at different collision energies the more accurate it is but most people just submit data for one collision energy so it takes a few seconds and once it's run things then it provides a list of the compounds that it matches the overall score the similarity and then links to different databases it also provides information about the source whether it's predicted or known or whether it's or which which database in particular um you could also compare the spectrum both with your input spectrum and the peak intensities to the predicted spectrum vice versa as you can see it's not perfect the intensities don't match always but in the case of mass spectra you never see a perfect match you can hover over things and it will identify the specific fragments and it is able to figure out what the fragments are what CFMID is i guess part of the leading trend towards in silico or reference free metabolomics it's basically the community throwing up its hands and saying we surrender that we're never going to be able to afford or ever be able to get all of the reference spectra for all the compounds that we know exist in nature so we have to move to something that's more computationally based um and by not having to resort to synthesize or re-synthesizing compounds of course it's cheaper it's easier it's faster um the other point about CFMID is it does provide provenance it tells you which source organism things are from so my beef has been that many um uh databases including kebby including pubchem including lipid maps including the databases like mona and metlin don't tell you where the compound is from whether it's ecoli or um plant or human or food or contaminant um without that information you you can end up thinking you're seeing you know licorice products and drug products in in ecoli and they don't do drugs and they don't eat licorice so this goes to the point about trying to use organism specific databases um and this is an important thing that the community really needs to move towards so i answered i the most common mistake was using you know large databases thinking that big databases are better but not realizing that most of the compounds in big databases are not biological but even searching against these really excellent databases like nyster metlin or mass banker mona can still lead to lots of false positives because they mix um the compounds or spectra from many many different organisms they don't distinguish them so they throw metabolites from insects from fungus from medicinal plants and microbes from foods and from pollutants and they throw them all together and say you know take it um they don't have a clear provenance we don't know where they're from and when we do metabolomics we usually know which organism we're looking at um we also often structure things so that we're analyzing if it's a lab rat we know exactly what we fed it if it's a microbe we know exactly how we've grown it if it's humans we often put people on defined diets and we know that there are certain things that humans cannot eat so i think when you're trying to choose a database you know choose first of all databases that have biological compounds so that's why napsack and keby are good choices but then what you really want is not only the biological compounds you want compound databases that are specific to your organism so if you're looking at ducks it would be nice to have a duck database if you're looking at e. coli it'd be nice to have an e. coli database if you're looking at humans it would nice to have a human database and as i said you see all too often at least when i review papers people identifying cosmetic compounds in yeast or drug compounds in bacteria grown in defined media or rats with licorice derivatives like annies when that's not possible given the food they eat so really what you want to do is search against spectral database whether experimental or predictive with real provenance data so this has been a major effort in my group for many years so at the University of Alberta we've been creating database resources organism specific or purpose specific databases for 15 years the first database we started with was the drug bank it's actually the most popular database of the suite here but it links drugs where they are where they're found drug metabolites and also their targets human metabolome database followed shortly after and so it's focused purely on human metabolites that's called hmdb there's a yeast metabolome database that's called ymdb ecoli metabolome database a food database a phytochemical database contaminant compound database a toxic exposome database polyphenol compounds an exposure database and then there's several other pathway databases which i'll talk about later path bank and snpdb so these are a different breed of database but they're also designed to be more suitable for metabolomics you know keby wasn't designed for metabolomics pub chem wasn't designed for metabolomics chem spider wasn't designed for metabolomics but these databases were so the human metabolome database grew from what was called the human metabolome project started in 2005 and that project obviously didn't get as much publicity as the human genome project but it did generate a lot of data and it's been used to create the databases like the human metabolome database uh drug bank food db it has metabolites in metabolite ranges and urine and cerebral spinal fluid and blood tissues it has detailed descriptions of all those metabolites and a lot of biochemistry the project itself also was focusing on developing technologies like basil uh like gc autofit and other tools including even metabol analyst to help improve metabolome coverage data analysis and throughput so lots come from the human metabolome project and a lot of the things you guys are using today and we'll use tomorrow originated with it when we started the project and we were trying to figure out how big the human metabolome was the only databases we could resort to were keg and human psych and at the time they only had 690 metabolites when we created the first version of the human metabolome database we tripled that we got up to 2180 compounds and we were pretty excited thinking that was pretty much the entire human metabolome and boy were we wrong so in subsequent updates we've been adding more and more metabolites and as of 2018 there are 114 000 in the last release we'll be coming up to a new release in another year and by then we expect there'll be about 200 000 compounds along with close to two and a half million predicted metabolites you guys have seen this picture before and just sort of breaks things down into them just the number and types of compounds there are you know within the human metabolome there are food components and drug components and toxic environmental components there's also literally tens of thousands of environmental contaminants things that you get from your clothing things that you get from the air and from what you drink so the numbers of compounds just keeps on growing some of them are at levels that are too too low to detect but others are actually quite prominent and we still don't know why or what they are I'm going to talk about a few of them in a little more detail the human metabolome database HMDB food DB drug bank and the toxic exposome database so Francis and others asked how do you distinguish between microbial metabolites and human metabolites and it's hard because there's a lot of overlap but HMDB we have distinguished a lot of gut microbial metabolites clearly we've tracked a lot of the normal and abnormal concentrations it's linked to many different diseases there's a lot of NMR spectra a lot of MS spectra for both ESI and EIMS with the HMDB you can also search against protein sequences because we try and link every metabolite to the appropriate synthetic genes or proteins you can do spectral searches MS and NMR you can browse your pathways and pathway searching tools you can search by structure you can search according to specific fluids you can do all kinds of complex relational queries and standard text searches and everything's downloadable these are examples of some of the pathways the big blue thing there the examples of some of the spectral viewing tools which use javascript outlets and allow you to browse or hover over spectra the detailed descriptions the browsing tables you can open up a given metabolite and you'll see up to a hundred and some data fields that cover nomenclature and synonyms and formulas and structures two and 3d structures the taxonomy using classifier an ontology which is still being developed all kinds of predicted information as well as spectral data and all of the spectra can be searched you can put in peak lists and you can choose different combinations of adducts and different subsets of the database and it will give you the matches and the scores through these searches so you can not only do ms spectra but you can do ms ms and so that will give you and will search against both the known and predicted ms ms spectra again giving you a score it's called the fit or purity you can do NMR spectral searching you can put down peak chemical shifts and it will find matches or close matches to those you can search by structures they're applets so you can draw a structure and it will look for similar structures scoring them by their structural similarity you can go to various biofluids there's many different conditions and lots of information about the normal and abnormal metabolites for many different biofluids and many different metabolites as i mentioned you find you have to serve different groups and so a lot of the work for the biofluids is oriented towards what physicians need work for but the spectra is more going towards analytical chemists a lot of the information on the molecular biology and biochemistry is oriented to the bioinformaticians who want to mine hMDB these are some data sets tracking the changes of hMDB from 2006 to 2018 and you can see that in the early days numbers of compounds were kind of small information about diseases and disease links was kind of minimal the stuff in red is probably the more relevant stuff in terms of both the number of spectra so now up to 430,000 spectra number of compounds with predicted spectra gcms spectra the large number of pathway maps and a large number of reactions and ontology terms we've been fixing the structures unfortunately most of the commercial software that's used to generate lipids just kind of randomly generate some so we've been re-rendering lipids so that they can look a little more pleasant or standard we've introduced as I say the spectral hovering so people can hover over ms and nmr spectra or peak spectra for both gcms lcms and soon nmr we've improved the addict calculations and fixed up some small areas that we noticed and have more than doubled the number of addicts that are calculable the descriptions have been updated and improved we're still continuing that I spent most of my weekend this weekend updating about 100 descriptions in hMDB so curating requires lots of work and we actively continue to curate these databases drugbank as I said is the most popular resource it gets around 40 million hits a year it's used by most of the world's drug companies for drug repurposing so when covid hit lots of things started lighting up in drugbank because everyone is using it and the reason why they use it is because it links drugs to drug targets and a lot of people find that very useful for something called drug repurposing so this is what they're trying to do for covid because it will take 10 years to approve a new drug for covid but if we can find an existing drug based on its binding properties you have a potential lead um so it includes both small molecule drugs and experimental drugs it has lots of information on mechanism of action metabolism pharmacokinetics lots of information on drug metabolism and drug metabolites and other drug targets toxic exposome database I briefly mentioned it's a smaller database but focus on the toxic compounds in drugs pesticides herbicides disruptors solvents carcinogens things that are nasty and things that you don't want to have in your body or in the water or in any other animal but it's modeled very much like a drug bank so it has a lot of information about the toxic targets the genes and chemicals it also has a lot of the reference spectrum one of the more popular databases is is called food DB and this is a little outdated so there's now more than 70 000 compounds in food DB um it's been updated it still hasn't been published but in fact it's our third most popular database behind drug bank and hMDB um and the fact that you can find literally thousands of compounds for anything from an apple to an orange to your cereal is quite intriguing and it's certainly taken us in new directions to look at what is in your food as we've been evolving to more the microbiome and also looking at things like wine and beer and bread yeast play an important role and so we've created this yeast metabolome database uh that covers a lot of data for wine compounds um yeast metabolites on different growth substrates a lot of the gene and metabolite associations many spectra and now there are many more pathways probably several hundred pathways now in the yeast metabolome database ecoli uh also an important microbe um and uh certainly a model microbe and this is our effort to also extend the microbiome uh metabolome um so sorry david i i need to interrupt you um if you want back a slide to the yeast page uh is uh so i i totally agree with your your your gripe about the integration of of many of these databases with their various organisms and so forth and uh one do you actually link or subclassify these metabolites on uh because you can take this to a whole different level if you want to not to the genus species level but to like strain level with respect to metabolites yeah so with yeast we're mostly modeling after sea services so we don't have some of the specific strains um and when we started the database there wasn't a whole lot of strained data um but um the um and what about uh linking it with uh folks uh sgd the saccharomyces genome database or do you already do that um i think we provide some links to their uh genome data through us i don't think they link back to us um but yeah i could i could i'm on their sab so i could work here with them with you right yeah yeah okay oh super thanks no because i think i think it's it now yes yeah and the thing is i mean um i totally agree with you and and the the challenge is i'm also as you know i'm an editor on the database journal and there's i i see a lot of databases you know popping up over the planet and there's a lot i'm and i'm not challenging you on this front but i'm saying this is a general comment there's a lot of people that just reinvent the wheel and um and so the ability to actually integrate with existing resources and make each other stronger and better i think is a really sort of important thing and it requires extra work because it requires you know apis it requires you know sort of these databases to programmatically sort of talk to each other and and because as you alluded and mentioned a few times when the curation is this sort of labor very labor intensive and if you can have some of this programs that actually make the links between the various uh types because they do pathways they do um and so sgd does do pathways and they do keep strain information when it's available and they do keep you know there's a lot of and like like one of the big industry of course that's interested in this is the wine industry and of course the wine is using yeast strains which are proprietary and but if they but they obviously do a lot of metabolomic analysis and so forth so it's a it's a very interesting space to explore career i'd say yeah no i think so and those are great great points um you're right i think in the world of databases there's a lot of reinvention of databases or databases of databases um i think one of our challenges but also one of the strengths is that the databases we create are are i guess i'll say original so uh they are oh yeah yeah for a lot of other databases um no absolutely and i think i mean and and the the credit and the the mark of such databases is that they get used i mean so i think databases that don't bring anything don't get used and the ones that do bring new things and new integrations and new perspectives and new linkouts and new link ins and so forth do get used and you know you having something used millions of times a year is is definitely a true sign of success and and i'll bring something new to the table that didn't exist before but uh i you know i've worked yeah i've worked with many databases as myself yeah for those of you don't know that you know francis cut his teeth at the ncbi and worked in many of the key databases that everyone now uses um i used to be in charge of gen i used to be in charge of gen bank so you may have heard about it yeah so um maybe i'll we'll just carry on here but just sorry yes the i think the point is that the colline metabolome database is really i think a very useful segue into people you know there's a lot of people in the class they're doing microbial metabolomics and this this was designed specifically for that of course the number of pathways has increased quite a bit over the last year or two but it has lots of reference spectra and though i a lot of people want to use ecocyc the database there uses a lot of literature curated stuff which you know has things like ampicillin as a E. coli metabolite and that's because people put in ampicillin to stop E. coli it's not it's not a an E. coli metabolite so tools and databases like these organism specific ones or purpose specific databases that i've mentioned allow you to produce and annotate those lists of metabolites and we've given you some examples in the lab and then i've shown how some of these databases can also be used to identify metabolites but what they do is they produce lists and lists are nice but what we really want in biology is understanding and pathways are a vehicle to both integration and understanding pathways link metabolites to genes and proteins but they also link it to physiology and to pathology and to many other fields so i'm going to talk about pathway databases here and i think in the interested time sort of speed things up a bit most of you are aware of keg database certainly the most popular pathway database in the world some of you have probably used biocyc or metasych they started about the same time as keg and that's maintained by peter karp others of you have heard of reactome that's also done through activities in toronto and ebi and francis i think has also been associated with reactome for a while and then something that most of you've never heard of is called smith db or the small molecule pathway database i'm going to talk about all four of them briefly so they're a great source of biological data they relate genes to proteins to metabolize diseases signaling events and processes they often cover multiple species they're very visual and we are a visual species um keg as i said is something that everyone has knows about and there are many keg wiring diagrams and many keg sub databases um sorry sorry i know i'm going to interrupt you and i'm slowing you down and it's and students are saying where the hell is he stopping david but i'm going to do it anyways so it's keg um keg ran into some financial problems that's right yes as they resolve these problems well it's it's certainly up and operational um you can't download keg like you used to up until about five or six years ago you could download vast quantities of data i think some people have been able to figure out how to get some of it downloadable but if you if you want the local database you have to pay money um so you know it's it's still viable it's still doing some great work um but like a number of databases in japan they've had to move to uh sort of a fee for service model okay but you can still view it it's still publicly accessible it's just the downloads and sort of the extra bells and whistles cost money um it covers a huge number of organisms uh six thousand organisms most of those are microbial about 99 percent of them um it has sort of a standard set of 500 pathways about a hundred and some pathways um for humans and about 85 for microbes and those are just sort of recycled they have some disease pathways and some protein signaling pathways the number of compounds in keg is relatively small it's sort of stuck in around 18 or 19,000 compounds but it covers you know all species of life plants microbes animals it doesn't do a good job separating them right no it's a bit of a pain you have to know yeah to select which pathway and then some things are highlighted uh some of them are kind of dubious you know you'll find in a human pathway you know chlorophyll synthesis sort of thing so you have to realize that no this is a generic pathway for all plastids um so yeah it's not it's it's limiting um i mean it doesn't show what's inside so most people who use keg don't don't know that the tci cycle takes place in in mitochondria because the keg doesn't actually show things like mitochondria so everything just kind of happens in space uh it doesn't show transporters so obviously something that's produced in the cell has to get outside and stuff that the cell needs that has to get inside so there's almost no information about transport likewise almost all of keg metabolism pathways are on catabolism and anabolism yet probably 90 of what metabolites do is for signaling and so we have kind of a warped view of the world uh in metabolomics thinking that everything is about um catabolism and anabolism and that's fundamentally wrong and part of that's because keg was developed in a time before metabolomics existed keg was developed specifically for um pathway analysis sort of an obscure field um both to complement biochemistry textbooks yeah basically you're right so it was had a different purpose and they still maintain the purpose but it really wasn't intended to help or enrich metabolomics it's just that people found it in the metabolomics world and they've made use of it instead of like you know you found a sailboat and now you're using it to to fly down a highway by sticking wheels on it so we've repurposed keg in the way that it really wasn't intended to be the reactant database um has has been developed over the last 15 years and it's an effort it's a giant effort at university of toronto and also um ebi um originally it evolved as a pathway for signaling but then has expanded to include metabolism um so there's about 15 model organisms there's about 1500 pathways in each organism so it's not as extensive as as uh keg with 6 000 organisms but it covers disease metabolism signaling transcription pathways to say that the organisms are distinct distinct yeah so they're they're much more careful yeah and it reflects the fact that originally reactant was was about protein pathways and protein protein interactions and then um metabolism kind of got added a little after the fact um you know obviously each of these databases deserves quite a bit of time but we don't have that so I'm just sort of flying through uh biocyc includes ecocyc metasych this is the cyc collection or cyclopedia collection uh covers many different pathways so it's a very large collection uh so it's similar in concept to to keg but has much smaller pathways um it has some pathways that are manually edited uh and uh some they're not so the number is huge but everything is sort of automatic um the model organism that biocyc used was ecoli so it imagines that humans are just larger versions of ecoli so you get some kind of queer odd looking pathways in um in biocyc and as I say most of them are single reaction pathways rather than you know large complex ones like the tca cycle or the lipid synthesis cycles so as I said most pathway databases that have metabolic data in them just show catabolism or anabolism and that's the idea that metabolites are just bricks and mortar but as I said at the beginning metabolites are much more than that they actually play huge roles in signaling immune function people have heard of immuno metabolomics inflammation homeostatic events epigenetics most of that's driven by methylation processes that's purely chemical drug action tissue repair cancer is looked on more and more as a metabolic disorder almost all the chronic diseases including even Alzheimer's appears to have fundamental metabolic and signaling um relationships and so the role with metabolites and their signaling immune function immuno metabolomics and inflammation have nothing to do with catabolism or anabolism I mean the most important signaling molecule in your body is called glucose and many of you are probably feeling a glucose low right now some of you drifting off to sleep um if you get something sweet in your mouth or in your stomach it'll get you working again but it's not just activating your brain it's actually writing a whole range of other actions and this is an example of a molecule that's around five millimolar in your body and so the number of receptors and proteins that glucose activates is amazing and none of it's shown in keg none of it's shown in reactant none of it's shown in any database frankly so we're missing a lot and we're misinterpreting a lot because of that so it's partly because of that and those problems that we started creating this database called the small molecule pathway database and we decided to create a database specifically for metabolomics to try and capture those things that weren't in traditional databases so this is sort of its layout every pathway has a description every pathway is colorful every pathway has some information about where the process has happened within the cell and within the organelles there's almost 50,000 pathways now some of them are metabolic diseases some of them are metabolic pathways some of them are drug action and more and more are related to signaling pathways we try to get information about compartmental organelle information paternity structures of proteins and try and relate not only the metabolites but also proteins and genes to that and so you can take an input gene protein or chemical metabolites and upload them to smithBB and it'll generate matching pathways or disease diagnoses this is an example of a pathway showing information about how this particular thing which is can't even read it here it's too small for me to read but how it links to actions and the brain and muscle and other tissues highlighting where the action takes place in the mitochondria how some of the metabolites move between peroxisomes and other organelles how some mutations have made this particular enzyme non-functional leading to higher levels of this metabolite which cause brain and muscle damage so you can see the membrane ideally it would show how the metabolites were pumped in or brought in or kicked out and all of the metabolites are linkable to HMDB as I say every pathway has a detailed description with lots of references and you can indicate which metabolites need to be highlighted and so they will be colored in red in the in the pathway so this is through the SMIP mapping you can also include concentration data and then that will color the pathways according to different ranges ranging from yellow to red to green depending on those concentrations the proteins are all linked to uniprot as I said the metabolites are linked to HMDB you can re-render the pathways in color or black and white you can make them look like keg if you want you convert them to color or printer friendly versions they can be in different graphical formats they can also be stored in systems biology markup language biopax format there's a pathway format and systems biology graphical notation formats all the pathways are downloadable all can actually be edited so if you want to contribute to path bank and smith db you can use a tool an online tool called pathways that allows you to make these machine readable pathways so you can access it through this way and by drawing a pathway through this tool you automatically create a pathway that is fully compatible with biopax sbml sbgn it has a drawing a bunch of drawing palates and allows you to add reactants and enzymes and different biological states you can render lipids as this way or simple ones you can drop and drag things rotate things as necessary it's all done on the web there are various icons to illustrate you know the liver or the er all the metabolites are rendered automatically so if you provide a name it'll generate metabolites name it it's tools to facilitate the preparation of proteins and this is an example of one that's been generated in this case lipid biosynthesis showing things that are happening within organelles or within the er are within different parts of the cell and how all of this primarily takes place in the kidneys in this particular case there are videos on youtube there's a journal of visual editing or education in this about how to do pathways with pathways you can also when you draw one pathway you can propagate it to other organisms or you can replicate it when you're making pathways that are very similar so that saves a lot of time so you can take the organisms pathway for arabidopsis and then propagate it to grape wines if you have the genome associated for it using pathways and set bb we were able to generate a large number of pathways for 10 model organisms recently and produced a resource called path bank so this now has a little over a hundred thousand pathways for humans fruit flies, yeast, C. elegans, E. coli, mouse, rat, cow, arabidopsis and like smithy supports the sp. mal, sp. gdm, biopax and so on it has a lot of in this example plant pathways and plant compounds from arabidopsis so it's quite extensive certainly allowing people to now or hopefully in the few in your future to visualize SNP or gene variants with different colors and to render it in these black and white or color forms we compared path bank to a number of other databases wiki pathways biocardia reactome both in terms of their size the scope the type the linking the signaling drug action descriptions summaries and so on um in some cases path bank is as good some cases it's not as good some cases it's quite a bit better each pathway brings its own strengths but as I say our focus is to try and highlight or bring in more information about signaling pathways from metabolites drug actions because drugs are chemicals disease pathways and other things and to sort of change the way that people think about pathways in metabolomics so i've gone through a lot of databases and i'm trying to do it fairly quickly but they're really key to compound annotation and data interpretation many of the databases that we traditionally use weren't really intended for metabolomics but we just sort of found them and repurposed them as I say it's kind of like finding a sailboat putting some wheels on it and using it to go down the highway it's it's not what a sailboat was intended for so what we need to do is you know either as a community redesign them or enhance them and so i'm trying to highlight some of these newly emerging databases like Path Bank or SNPDB highlight some of the other organism specific databases that many people may not be aware of in metabolomics the provenance the origin of metabolites is really key to reducing false positives and incorrect interpretations and in most metabolomic studies we know exactly which organism we're looking at so if you know that use that information for for helping you and your data analysis