 So, hi everyone. I'm going to present some new technologies in elastic search to facilitate that mining of the human microbial databases. I'm Michael Leclerc, working currently in the Arnaud Rois laboratory at the San Prospectalier de l'Université Laval. And before just getting you the course content, before entering the course subject, I'm going to talk about just big data in the biology world, where biology applies. And to stick with the subject of the summer school, I'll show what resource we have for the microbiome and those related to it. And finally, I'm going to present some solution to mine the microbiome data, including elastic search. Everyone hear me correctly? So, big data in the biology world. Just big data itself. In a few years ago, Eric Smith from Google told us that from the down of civilization until 2003, human can generate in five exhibit of data. And now we produce that kind of data every two days. It wasn't perfectly accurate because one year after that, a guy from a company really checked the study about that and exactly 23 exhibit of information was recorded and replicated just in 2002. So now we record and transfer much information every seven days, which was almost accurate but not exact. Eric Smith worked in social networks, internet technologies. We have a lot of things, a lot of data coming from that. But now, in science, we have the physics, which contribute a lot in the big data information with the experience with the large electron collision, for example. In biology, going to talk about that. And the next would be the Internet of Things, Internet of Objects. We're going to have a lot, a lot, a lot of data coming from that. So that's going to be the future. And probably in the Internet of Things, we're going to have some biologists taking record of your sample, blood sample, whatever. What is big data? It's defined by the four Vs. The first one is the volume where we are taking in exabytes now. The second one is the variety. So it's coming from videos, Internet, science, whatever you want. Another V is the veracity. Veracity in certainty of data. It's a major issue for big data technologies. In biology, you can't always trust what you have in your samples or what's coming from your sequencing technologies. So this is a huge problem. You need to clean before getting too far in your analysis. And the last one is the velocity. So you have the same kind of data just coming again and again in different timelines. So what's generating so much data in biology? You already know. It's the sequencing technologies. The cost per genome decreased a lot after 2008. And at 20 years ago, it cost a billion of dollars to just sequencing the first free reading bacterium and years of data analysis. And now for $200, you can have your personal genome mapped in a few days with a 23-year subcompany of Google, almost. But at a certain point, now the next generation sequencing grows outspacing the computational resource. That's the issue today. And you have free data here. Oops. The first one is the hard disk storage, so in megabytes per dollar. It's doubling every 14 months. Before NGS, we have the base per dollar just doubling every 19 months. So it was okay to store it. It was cheaper and cheaper to store the data. But now the next generation sequencing, the number of dollars you need to sequence your base per is less fewer. So now we have space issues. We need the terabytes, petabytes of data to store all your sequencing information. And maybe at a certain point, it will become probably cheaper to re-sequence instead of storing sequencing data. What's the effect of the local sequencing of research? Before the next generation sequencing, you spend a lot of time to sequence and a lot of money to do it. And this is only the part of analysis here. This is the only storage and for the sample collection experimental design. This mouse is very sensitive, sorry. And now it just beyond the two boxes here. Gosh. Now the sequencing is very fast, cheap, and it takes a long time to do the sample collection because you can retrieve thousands of people, thousands of collections of samples. And now the downstream analysis takes a lot of time and money. So now these are the challenges of the big data. How efficiently store everything, what kind of volume, what kind of structure, how we can link together, how quickly we can explore and search with all this data, do we split our samples, what kind of algorithm we can use to do that, what kind of request, what kind of software we can use and architecture to store that and analyze everything. Before entering all this creation and propose you some side analysis search, I'm going to just explore what we have in the microdata. You already know that the metagenomics of microdata is coming from DNA extraction and sequencing. You have your reads, your sequencing. Do I have a laser somewhere? No. Okay, not so. And once you do your assembly and you search against database, you want to answer two questions. Who is there in your samples and what they can do? The perigenic classification is very important. You want to know what kind of population you have in your sample and what's the pathways, what the gene, what the genetic involved in the sample you have in your end. Someone in a few years ago mapped the diversity of the human microbiome. Oh, thank you. March? Yeah. It's okay. It's very sensitive. It's fine. It's a little better. All right. There's a backup. It's okay. It's fine. It's fine. It's fine. It's fine. It's fine. Okay, so this is the diversity of the human microbiome. Just a few genetic trees showing all the values of the population we know today. You have also the presence in various tissues such as the steel, the plug, the tongue, skin, everything else. And the abundance in each tissue. Someone tried to attempt a map of the diversity of the microbiome genome and identified the number of species for every gender. This is a clustering of a species-specific genome. And that's all the complexity of the microbiome analysis and the diversity of various collection of samples. Let me see something here. What do we have in the map microbiome resource? The main database is not under the human microbiome project. The objective of this database is to provide you to characterize all the microbiome in the communities in multiple body sites. And it took the correlation between the change in the microbiome and the human health. It's just a aim, in fact, to determine if there are a set of microbiomes common to each human. If change in the microbiome can result in a different state of health or an advance in the disease, at the same time develop all new technologies for treating the complex microbial system within the natural environment. And maybe another idea is to begin to deal with the legality and social complication that may arise from the human microbiome. What's really useful in that database is the reference genome. You have all reference genomes. You can map your sample. This is a very good resource. You have a lot of child gen sequence and 17X sequence. These are all samples from BIO study. So you have all the clinical data from BIO's patient and all the genetic of the sequencing experiment from this patient. So you can do the exercise to extract all the information from those and map it to reference genome to extrapolate the population of each sample. You have all the data such as functional databases, metabolic reconstruction, somewhere in SC2, so very useful database. The second one is the MGREST. This is an automated platform to analyze metagenome. So it quantifies all your microbial population based on the sequence data. It performs all quality control, automatic annotation, comparative analysis, and it compatible with a lot of metagenomic data. This is what this tool provides you. So you just give the samples where they're coming from, and it takes care of checking the quality control and to map against the database of full genome. And it tries to quantify the population of each bacteria that may be present in all of your samples. This is just a sample test from the website. You can try. It's very easy. You have various ways to represent your data. It's a very good tool. This is CHITA. It's almost the same of MGREST. Same thing. You have a study management platform where you can keep track of multiple omics experiment data, upload and analyze your oral data, and generate some visualization. It's less advanced than MGREST, but I think it's older. You also have the Earth Microbial Project, which aims to characterize the global microbial taxonomic and functional diversity all over the planet in all human populations. It provides the gene at last, ensemble genomes again, visualization portal, and metabolic reconstruction. This is the map of... This is an interactive world map showing all the position of each Earth Microbial sample on Earth, and it's connected to its most similar other sample within the database. You can extrapolate the kind of bacteria working on microbial miles similar within the population in humans on the planet. You have all your sequence and your population in your samples. Once you have done all your analysis, and now you need to dig further and answer various questions such as what this population means for the body, is it linked to a disease you are studying, what are the impacts on the health of the patient, what kind of product or so my bacteria population can create and deliver to the body, et cetera. And in summary, what's genetic of the microbiome tells you is that that's what you do on transfer. So for that, you need to go to all bioinformatics other related resources. The first one is, of course, PubMed contains more than 26 million documents, you already know that database, all bioinformatics literature, election journal, books, and you have the links for the full text content of the publish website. So you have a very simple search, very basic, efficient stuff. But it's not the perfect search engine for the literature. Personally, I prefer Google Scholar because you have the full text content, it's very useful, and you have PubPito database. Those guys did a manual, well, a manual, text manual, almost manual creation of the literature. So by text mining technologies, they did a lot of work to annotate all biological entities and their relationships. So, for example, if you look for a gene, it already did the work of passing all full text documents to extract, to associate the PubMed number with the gene. You can filter by bioconcept, chemical disease, gene, whatever you want, and it would give all the paper to research. Now, if you have bacteria, so it means metabolites, the human metabolite database is easier to study all the metabolites and the small molecules that the body and the gut bacteria will produce. So you have all the information for each small molecule, metabolite found in the human body. It also includes the bacteria products and the human products, so you can do the relation between both. It has a simple search engine, so you can browse by metabolites, disease, pathway, whatever you need, depending on the genetics of your subpopulation you are analyzing. And you can search by chemical, molecular weight terms or every text query that's very wide and it's very fast. Of course, you have keg pathways, you have all the high-level information, functions and activities for the biological systems for other use levels. So cell organisms, so if you have your bacteria, you just put in it and it will tell you all the pathways known in the bacteria. You can also find the ecosystem, so depending on the gut or skin, and you can search by many types, depending on the pathway, functional autologues, gene molecules, biochemical erection. There also have some search engine for drugs that can be related to the gene you are studying or even by topic, pathogen, plants and even bacteria if you want to buy an organism. It's not super user-friendly, but it's efficient. You just have to enter a text and it will provide you all the pathways. That's the search-by-passway or you can do it by autology, by virus, whatever it is. That's SMPDB, a small molecule pathway database. It's kind of related with keg, but it has more, it's really unfiesing on the small molecule pathways from the new man. It designed to support the pathway recidation and discovery, metabolic, transcriptomic. So you have all drug actions, disease, and it's a little bit more new than keg, and you have all the pathways again, but you can search by drug action, very cool if you need to target your bacteria population. And talking about drug, you have drug bank. So supposing you are looking for a way to recover from dysregulation, for example, of your microbiome flora, homeostasis in the gut. You will find here all the drug that could eventually target specific bacteria products or even pathways so you want to target them with metabolism. So drug bank has I don't know the number, but 2,000 interactions between drugs and their targets. And you can search by drug or the pathways or genes and provide you all the drugs and effects and secondary effects for each of them. Another cool DB is the food DB. So you have the database of all food consistent, chemistry, biology. So depending of if you want to study the impact, for example, on the gut microbiome, so depending on the room, diet of the patient on the food DB for each component you have all these this is just a sample list of the company can found that physiological effect presumptive health effects or for should color taste whatever you want and you can probably associate what the diet of the patient with the the gut microbiome with that. Very simple browsing window while you have all the known food for humans. Then you have also toxin DB. This is great to estimate the impact of bacteria toxic byproducts on the health of the patient. It combines all the information on toxin data and their targets. You have all the bacteria found food and fungus toxin which contain the database and you have almost 90 fields including all the chemical properties, toxicity values, molecular interaction, medical information. This DB may be very useful to estimate the toxicity of a dysregulation of a population of a bacteria population in the skin or in the gut. This is very simple browsing again. You can filter by fungal toxins or bacterial toxins and you have depending on your knowledge in chemistry may be useful. Then for the more pure chemical database you have cymbals, you have kb, this is a dictionary of unique entities. This is at the atomic level so maybe not for biologists, more to chemical guys. And of course uniprot for the protein sequence and functionality depending of what all your microbiome samples are secreting. So now that you have all these great resources how to mine them, how to dig to put in a relation that's the challenge for biologists. You have so many database and what I show you is just a sample I didn't do all the literature but probably it's just 5% of what's existing in the biology. So you have many database, many web surveys, some of them can explore your data which is nice. But you need to jump to a website to analyze the data if you don't know any programming thing. You need to use external tools also to create the relation. For biologists you can do it manually for each database using online tools and you end with a browser with 20 tabs at the end of your analysis and if you know a bioinformatician then you go to programming. You download the full sets of each database you connect them you manipulate the data with AIR package conductor the connector contains a lot of package and probably if you have a database in your mind you probably have a tool to extract information from that and if you are a pure programmer so Java, Python, Perl you have a lot of library to explore biological data or you can go to elastic search I think that's the future elastic search is in the middle of the big data tools that are created in the last years this is not the only solution but what's nice with that is it's an open-source searcher so it can index any kind of heterogeneous data which is perfect exactly what we have in our own. It's a no SQL database, you don't need to create any tables, any schema, it's JSON so it's a new from Google already compatible with the JavaScript it's a near real-time search engine so it's very useful to analyze logs from a computer so data which come every second, every minute so maybe not exactly our case but maybe in the future when you have electronic samples on your board it's very fast you have high resiliency and massively distributed what it means is you distribute the database on many clusters and they are all integrated at the same times which make it fast so this is only the database in itself and you have already you have tools or provided with elastic search to do the query on the database this is the visualization interface it's very powerful you have the dashboard, you create dashboards and visualization depending on the data you have on your end and it's capable to provide any kind of graphs, analytics I'm going to show you and we use in the lab Kibi, Kibi is the fork of Kiba now it's an English company we try to implement the relation between the data of the indexes so when I say index I think to permit is an index foot DB is an index because the problem now is that elastic search does not support the relation between the database so that's why Kibi is here to create these relations it's a basic concept I won't go into details you have the resource of public database you store it in an elastic search starts with some data processing because depending on the source of the data you have in your end and you push that in your clusters and you have multiple indexer which takes care of ingesting your data put that in the storage indexing index storage and then you have elastic search searchers which are Kiba now or Kibi this is the kind of visualization you have from dashboards of Kibi or Kiba now you have heat maps map walls any kind of charts so everything is in JavaScript we have some curves histograms heat maps, relation graphs this is the relation graphs here you have some metrics some just numbers you can pop up like that world clouds very useful too very simple tables to have your informations and it's already adapted to get real-time real-time data so it updates itself so it's very cool for any monitoring systems but it can be also to be adapted to biology so that's why we are creating currently Kibio.science it's a web platform just dedicated for live science and to summarize a lot of bioinformatics data we won't be able to do all but the most known and we are currently creating engine for frequent updates to match the latest records and it relies on elastic search and Kibi, I'm creating that Regis a PhD student under the supervision of Arnold so the concept is simple the goal is to gather the main bioinformatics database within Kibio.science and creating all the relation and link them together she's just two examples but currently what we are, we have created a program that digests every kind of database depending there in XML, CLCSV from an FTP site or any API but most, a lot of bioinformatics repository proposes now the G-zone directly so we just take the G-zone we send it to elastic search, we don't have anything to do in that case so it creates also elastic search has something very cool it has an automatic mapping usually when you have a RAC database or MySQL database you need to create a schema of what what kind of element will go in your database elastic search does everything alone so it takes the first data it takes it generates the kind of data it will digest after that so if you have a stream, integrals or whatever everything is done automatically the problem is when you have a database like PubMed you have very real information on it so that's why we need it to create a dungeon to to help elastic search do the mapping so some database are very easy to push in the elastic search other are more complex too and we have QB to do the visualization search and all the dashboard we have created with all visualization in the future what we want to do if we have the computational resource is probably to accept user data so imagine we already have accumulated science with all PubMed and FoodDB, Toxin database and you arrive your own data in a CSV for example from Excel and we push that and we do the link with every other database that's what we want to do in the future and also I show you the very basic visualization already implemented in elastic search but we want to create more visualization such as expression it maps for gene expression even gene networks maybe a genome broader or more interactive but a lot of people are already working on that so probably won't do anything on it survival curves, we have a lot of collaboration with our searchers in clinical in the clinics survival curves, biostatistics they are very interesting in that so we need to implement tools that do some calculation and probably some machine learning and for that we won't recreate the wheel there are a lot of already implemented JavaScript libraries to do that all these kind of visualization already are already implemented in JavaScript so our goal is really to adapt it to Kibbe, to Kibbe and our elastic search all this is already known in the literature so I'm going to show you just a short video which I have not been implemented in Kibbe.science directly but it's on Campbell we are working with Kibbe guys the company's solution and we are implementing with them Campbell so let me show you you have all your index activities and once you go to the dashboard you have a few tabs depending on you have already created visualization and it's automatically updating once you go to tab to tab because the tab to tab base is split in values sub-sampled and every time you do a search it updates everything on your dashboard so you have automatic filters you can also filter by date or you have the syntax already implemented very easy to use to create filters and the blue buttons are the relations so this plugins allows to filter out other sub-tabs in other dashboards and every dashboards are linked together at a certain point so when you do a real research you can click on everything on the on the dashboard and if we add the filters in the green here to filter out to define your research so you have a world cloud you can click on it I'm going to show you on a real example so this is the PubMed publication I want to retrieve here all the so I have a search bar this server is on Compute Canada West Cloud this is something in test too it can be expensive now it's a high space okay so it can be because it's kind of expensive to have a grid server to digest all this data but still it works so we are going to use it this is a five node server so we have five servers this is PubMed so we have more than 26 millions documents it does not contain the full text on the abstracts let me show you the content of PubMed here you have all the fields that PubMed provide us for each of one you have the ideas, you have the article and this is all the information that we have in PubMed all the authors all the things, PubMed date it's not very user friendly to see that like that so that's why we can create visualization and exploit it in dashboards so here that's the number of publication by date so from 1944 to today in 2017 we haven't updated science since January so that's why we don't have a lot of paper from that and everything is clickable if I click on the date I have an automatic filter that filter out everything that is made within that year and the main topics in that year in 2000 in 1990 all the main topics that have been published to this year this is the journal which have been the most publication that year number of reference journals and you can go to any word that you are looking for and everything is updated automatically and this is all the details so the content of PubMed this is not super nice because it's in construction but one day it will be nicer and you have every detail for each paper you have in your hand you can of course do any kind of search so for example if I have a I didn't entertain on my if I was looking for a term, molecule or whatever so it's looking within 100 yes, PubMed it's 150 gigabytes of data and the search does in few seconds exactly on this server so it's very fast and for example, I can check for disease in the microbiome too I checked one year I don't remember the word and each time you create a new search you can go to the relation you have done with the QB relation what it does within each PubMed document I have a list of genes for example a list of mesh shadings or medical addings and if this if an ID in PubMed exists in another database this relation will tell me so if I go to Peptator for example I click on it it goes directly to the Peptator dashboard and it tells me how many entries I have how many entry gene I have from the search I just done in PubMed I also put Toxin database Toxin 3db so if I am going to Toxin database I clicked on two things at the same time it didn't like Ashen Metabolite I think it's 180 gigabytes of data but it shouldn't be slower than PubMed go to Omim you have all the diseases you see just in one view that's something we don't have when you go to database how many entries we have gene IDs gene symbols we can create any kind of matrix we want from the database this is a mapping on all on the genes on the chromosome location and CTO location from every genes including in the database you have all you can click on everything Omim type that's the entry type you also need to know what's in the database I pushed a lot of database in L6 search without exactly knowing what's happening in each you can search from you can also click we have generating links so if you want to go directly on the website so if I click on the link I am going to go on the website because it's a Mac I am going to be able to come back you can search for specific entries for example if I want to get for that approved gene symbol I filter out and boom I have everything on that gene specifically it doesn't have a protein that's weird okay I would have hoped I would have two proteins so I don't know if I could skip I don't have protein any kind of gene that could be in Omim rest and peel maybe that one okay go for it good we have protein here so I am going to HMDB and I take all the that's nothing I have the protein itself it was very fast and the world cloud provides you every the metabolites that interact with that protein I can try to come back on good T3DB works this is that goal I just made yesterday this is all the targets from targets I wanted to go to the toxins before okay good category okay good so this is all the toxins in T3DB 4 seconds we have all the content of the database you have the toxins per category synthetic carbon, pesticide, drugs by cellular component by states you have some metrics we can do we can just create like very clear for the database and I don't know if I looking for something like vinyl chloride and I doing a filter on it then I have like a Wikipedia page like all the description the search treatment for that specific toxins mechanism of toxicity, risk level metabolism symptoms when I digested all the content of the toxin database everything was already provided with it so it was very easy to put and I have the 42 targets for that specific toxin so if I click on it it's filtered out automatically on the next dashboard and it provides me all the targets from that specific toxins so all the bio entities so I don't know exactly what's going on all the genes are so targeted by that protein and you can export the table in CSV so if you want to do your own calculation on your computer so this is this is a computer science right now it's going to be up to date probably in the next oh yeah okay so now in conclusion I could say that your bioinformatics and metagenomics are only for big data tools and Elasticsearch maybe one of the solution because you see this is the main known database or at mySQL and is this Elasticsearch it's currently new and it's going up and up and up and I don't know at what point what it's going to be but we hope that technology will continue to be updated and for the long time currently computer science is that it's done but we hope to have a public version this fall but you can do it yourself if you have at home your own a simple computer and you want to digest all your expression data you can install Elasticsearch upload the data even in CSV because tools exist to upload your CSV tables and it will be converted in JSON format and if you have some already databases also in JSON you can push them in Elasticsearch and it will be easy for you to explore the data so this is our lab thank you to listen to me today and if you have any questions for all these documents currently we have 40 millions of documents from the various databases we have just 5 nodes I don't know the power of these nodes but it's on how is it called it's no it's on you just create virtual machine on the fly just created 5 and we don't know exactly what are located but theoretically you need to have each VM separately from others to not have the same EIO on the same hard drive it should be the case but we aren't sure about that so it should be faster when we can be sure that it will be separated for each server yes I was just wondering if I'll say you're important in fact when I want to discover something I don't know to omim for example you have all those fields here but you have created before some search I don't want to do that one so omim and you extract your columns like this and you want to extract them for example I don't have a plugin yet to do that but it's on its way from kibby so you're going to have a large table like that and you just want to you're going to have to extract that but seriously it works by curls so exposed things get and it just send a json as a query and you retrieve all the document from that in json format and the conversion is very easy this is just a query tool to have a visualization but in the back end with just a json query you have all your documents in your end without using that so you provide your column names and you don't know everything you want no more creation just and to create the visualizations this is just a list of visualizations you have and if I for example if I do a tag a tag cloud from I already did those tag here when you create the tags you have the aggregation of terms and you have all your fields here so if I go to the chemical list search for text and I want to 25 this is how I create the visualization boom I was just thinking that the most useful thing would be like when your lab has a lot of data it's a way to give it all in one place for everybody in the lab yes exactly so I don't need to go to no no you don't need no no no exactly we just want to try to to create to pre-create all these things because for example per med it's crazy if you just they get the XMLs and try to send it to the SX search it's going to crash very fast because for example you have volumes it's numbers at the first time it's just this is an integer and suddenly you have a special issue you have a letter in it it's a string it won't like that so that's why we needed to create an engine to do the mapping ourselves and we will provide a certain point a big json of all per med you will download it and you will push it on your elastic search on your lab and it's done that's an example but the work of transferring all the known public database within the SX search in the json format we will do that work good