 Okay, so Mark presented an overview. I will give a more technical presentation. So we'll give two presentation of 30 minutes. You will have a coffee break at the end of the first presentation. So you will have time for questions and then a coffee break. And so I will get into more technical details to explain to you really how we produce the data. So I'm checking the chat. I see that I got the chat. Okay. So yeah, more technical details. As Mark is the first time, I give a presentation through Zoom. So it's a bit disturbing and I try to do my best and check the questions if you have any. Don't hesitate to ask questions during the presentation. I try to see them and otherwise you will have time or so at the end of the presentation. So an important aspect of the data integration with BG is that we try to identify where and when, in which conditions, genes are actively expressed. And I would like to first explain why we do this approach. How is that important? Okay, so I try to change my slide. Okay, so this is an overview of the BG pipeline that Mark already showed you with a different source of data we use. And in this presentation, in this 30 minute presentation, I'm going to speak about the reanalysis of the data to detect active signal of expression. So I will speak here of this specific part of the pipeline. Okay, so we try to detect the conditions where a gene is active. And I would like to show you that it's non-trivial to say where a gene is actively expressed because there are two sources of noise. First, a biological noise in the gene expression and then a technical noise depending on the technique we use. So for instance, there have been experiments conducted in 2002 to estimate these biological background noise of the gene expression process. So for instance, here on this graph here, they use yeast GFP reporter and they measure the change in different individual cells of the gene expression over time. And you can see here that in the different individual cells, there is a large variation of the gene expression for the same gene. Just so in the same cell, this will change over time, but also between cells, you can see that there is a large variability of the gene expression. And so these authors in 2002, they say basically that there are two sources of noise. Intrinsic noise that comes from really the chemical reactions of the binding of the RNA polymerase to the promoters of the ribosome translating microRNA into proteins. So these are really noise coming from the chemical processing called intrinsic noise. And then there is extrinsic noise, for instance, from the availability of nutrients in the cell. And here in these experiments, they use two genes, so two fluorescent proteins under the control of the same promoters. And basically if the two genes, the expression of the two genes varies in a curated way, that will be extrinsic noise coming, sorry, intrinsic noise coming from the chemical reaction. Sorry, I'm saying the opposite. So if they change in a curated way that would be extrinsic noise, for instance, availability of nutrients. And if the expression of these two reported genes under the control of the same promoters vary in a curated way, that would be intrinsic noise. So due to the stochasticity of chemical processing. And basically they show that the both sources of noise are very important. So what I want to convince you here is that there is a biological noise. There is a variation of gene expression just from the stochasticity of chemical processes. And then the question is, at which point can we say that a gene is expressed? Then there is another source of noise when you measure gene expression. It comes from the actual technique you use to measure the gene expression. So here it's an example. It's a pipeline to construct an RNSC library and analyze an RNSC library. So basically at each step of this construction you can have source of noise. So you have here a population of cells. You will extract RNA from these cells. And then for instance, you will isolate a specific RNA species. So in BG that would be mRNA with the poly A tail. So we will be here. We will select mRNA from the poly A tails. And you will fragment your mRNAs, convert to CDNA, construct your library, do amplification and sequencing. And then you will end up with sequencing read that you will need to map back to the genome of the transcriptome and quantify the abundance. And at each of these steps, you can have a technical noise. So for instance, in RNSC, I show examples, just example of noise. So for instance, you can have a non-uniform fragmentation of the mRNAs during the library preparation. So that would be called a positional bias. Depending on the position of the genome, you will not have the same fragmentation. So you will have a bias estimation of the gene expression because of that. You can have a sequence bias. So the fragmentation strategies and the priming strategies lead to have sequences that are non-random. So you will have amplification targeting more specifically some regions of the genome rather than others. Then you can have errors in the read. So when you read your CDNAs, depending on the techniques you use, you can have errors. So you will read a nucleotide, what actually it was a T maybe. And then when you have your read, you try to map them back to your genome and you can have errors in the read mapping because of gene homology, two sequences would be too similar. Or you can have regions in the genomes with short repeats that make them extremely hard to align. So at all these steps, you have lots of source of technical noise. So you have a biological noise and you have technical noise. And the question is, at which point a gene is actually expressed? When you detect, for instance, one single mRNA molecule in a cell, was your gene actually actively expressed? Or was there just random stochastic noise of the transcriptional machinery that just produced one mRNA? So in BG, this is what we try to identify. At which point we have an expression value that is really a signal from active expression. And then, so also for the technical noise, to handle the technical noise, ideally in an experiment, you will always have technical replicates, but in practice, it's not often the case because you have a cost limitation or you have a technical replicate that we fail. And more and more often, we see experiments with no technical replicates. Okay, so what I'm going to present right now is how in BG, we detect active signal of expression. And what we try to do, basically, is here you see a picture from an in-situ hybridization. So it's a zebrafish embryo using an in-situ technique and the stained area here, are areas of expression of a given gene. And what we try to do is that we try to convert all our expression data whether aphymetrics, RNSE, EST to this kind of signal of expression where you can see here, the gene is expressed. Here, the gene is expressed. Here, the gene is not expressed. So we try to integrate all the data in BG kind of in a similar way, transform the signal of expression into present absent expression codes. So I will go into details for each technique about how we do that. And now we start with aphymetrics because I think it highlights clearly the needs for dealing with technical noise, for instance. So in aphymetrics data, you extract the mRNAs from your cell and then you label them, you bind them with a fluorescent probe. And so you have here your mRNAs that are with a fluorescent probe. And then you will hybridize them on a chip where they are probed. And these probes will hybridize with your mRNA and then you rinse the chip and you will keep only the mRNAs that were hybridized with some probes. And these mRNAs will have been banned with a fluorescent probe. So then you will be able to measure a signal intensity, a fluorescent intensity. And so you will get an idea of the level of expression of your gene from the fluorescence level on your chip. And how these probes are designed is that for each gene you have a set of probes that will be called a probe set. So the different probes in a probe set will target different regions of a transcript. And so for each probe set, you have what is called perfect match probe which are supposed to bind with your transcript. So these probes, they are 25, 25 mere oligonucleotide probes. And then you have what they call a mismatch probe which are not supposed to bind to hybridize with your transcript. So what they do is that they take the sequence of the perfect match probe and they just change the nucleotide in the middle so that it will not hybridize with your transcript. And what you see in this graph here, so the idea with the perfect match probe and the mismatch probe is that with the mismatch probes you will measure the non-specific binding signal intensity and with the perfect match probes you will have both a measurement of the non-specific binding and of the specific binding. And on this figure here, what you show is a signal of expression. So they do some spiting in samples to know exactly the concentration of a specific transcript in a sample. And then they measure the signal intensity in red only of the perfect match probe here. And when you can see is that for low concentration of your transcript, the signal intensity of the perfect match probe does not vary a lot. So you don't have a lot of specificity when you look at low concentration of your transcript. But if you remove the signal of the mismatch probes to your perfect match probe, then you get again a specific signal. So you can see an increase here that is proportional to the concentration of your transcript in your sample. And it's by removing the signal of the mismatch probes from the signal of the perfect match probe that you get this specificity back. So because you took into account the technical noise, the non-specific binding. So this is very clear and widely used in affimmetries. So for me, it highlights the needs for taking into account technical noise. And you will also get, I will show you also the importance of taking into account biological noise. So this is how it is done in affimmetries. You have this perfect match and mismatch prob that are used. And then what they do is that there is a software widely used for analyzing affimmetries data called MAS5. And MAS5 basically, they use a distribution of the perfect match and mismatch probes. And then they perform a Wilcoxon text between this perfect match probe signal and the non-expressed perfect match probes. And then they give you an answer whether your gene is actually expressed. So after this Wilcoxon text, that will tell you your transcript here, it is present, it is marginal, or it is absent. So what we do in BG is that we use the same approach, but first we use an improved method for normalizing the signal intensity called GCRMA. So it corrects the signal intensity also taking into account the probe sequences because depending on the sequence, you will have different affinities and you will be more likely to have a high signal depending on your probe sequence. So there is a technique to correct for this difference of sequence affinities. And then also a better method is rather than use the mismatch probes to estimate the background signal is to use a weakly expressed perfect match probes. So these are just small improvements, but basically the techniques, the algorithm of MAS5 is the same. And here you have rock curves here comparing the different methods. So here this rock curve is with the MAS5 algorithm, the baseline MAS5 algorithm. And here these two rock curves is by normalizing with GCRMA and by using weakly perfect match probes to estimate the background signal. So these are just small improvements. And in BG, why we use the best techniques that we can? So we use GCRMA and we use the weakly expressed perfect match probes. So then what we do is that for affinitrix data, change the time, okay, I should speed up maybe. So with affinitrix data, either you have the raw data available, which are cell files. And then we perform the analysis I just show you or we have only the process MAS5 data. And we treat them differently. So in MAS5, they say present, marginal or absent. And present and marginal, we will treat them as present in BG and absent as absent. But as we could not reanalyze the data, we will treat all of them as the low quality. And when we have the cell files available, we re-perform the analysis. And based on the FDR value, we say it's present high quality or it's present low quality or it's absent high quality. So we have present absent expression call. And for each of these calls, we say whether it's low quality and high quality. So we can have present high quality, present low quality, absent high quality, absent low quality. So this is at the level of individual experiment. This is what we generate. And then we have a broader integration BG that I will present later. So, okay, I will speed up a little bit and go to our analysis data, which is more interesting because these are the current techniques nowadays. So for expression data, for analysis data, so this is a library preparation I presented at the beginning. So you extract your RNAs, you fragment your RNAs, you convert them to CDNA, then you sequence them and you align back to your genome. And then, yeah. So usually to detect active signal of expression from papers, we see that a lot. Even when you perform differential expression analysis, most authors, they put a threshold on the expression value to study genes. So for instance, you do a differential expression analysis and you will only compare genes that have at least two TPM or one RPKM expression values. And also in a lot of paper, they say, okay, these genes are expressed in this sample and to consider what genes are expressed, they say we consider the genes with at least one RPKM or two TPM values. But then this threshold, it varies a lot between studies because it's arbitrary. So some authors, they use a value of 0.1 RPKM or some other authors, they use three TPM as a threshold. But it's very unlikely that one threshold can fit all library preparations and all cell types, cell populations, species. And so there is no consensus on the actual values to use. So in BG, we define a method that will allow us to have a non-adventary cutoff. So in this plot, I show you the box plot of the percentage of protein-coded genes that are considered as present if you use this cutoff of two TPM, for instance. So this is in the 25 species in BG. Here you have human. And you can see that the percentage of protein-coding genes considered as present with this arbitrary cutoff in some human library. I don't see my mouth yet. In some human library, you can have almost no genes considered as expressed with this cutoff and getting as high as 75%. Here, for instance, in C elegans, with this arbitrary cutoff, you have all genes considered as expressed. And you can have, there is a large variability between different species. So obviously this nice knife cutoff, it's not working very well. And it works most of the time. If you have two libraries, you do that most of the time it will work. But when you try to integrate thousands of libraries, it cannot work. It doesn't work like that. So what we do is that, in a paper from 2011, so the studio, Renécik libraries, and here you get a density plot of the expression values. So here, it was in yeast, I think, but it doesn't matter. In black, you have the density plot of the expression values over all your genes in black here. And what you see, it's very typical from Renécik libraries, you see a schoender here on the left. And the authors in this paper, what they say is that, actually this schoender here before they speak, what they say they saw is they say it's the sum of two distributions, lowly expressed genes and highly expressed genes. And what they say is that this lowly expressed genes, they are actually not expressed genes, not actively transcribed genes. They look at histone modification, at chromatin opening, they check all of that, and really these genes in the lowly expressed gene distribution, they exhibit all signs of genes between being not expressed. So basically it means that here this distribution, you see in almost all Renécik libraries, it is the sum of the distribution of not expressed genes and actively expressed genes. And what we notice actually is that the distribution of the not expressed genes, the lowly expressed genes, they match really closely the signal from intergenic regions. So here this is a distribution of intergenic regions. So from this paper, we got the idea of using the signal of expression of intergenic regions, which you should have none, but still you have signal of expression for intergenic region. This is a biological noise. This is a technical noise. So we use a signal of expression of intergenic regions to estimate this noise and to determine a cutoff about which genes are considered actively expressed. So here I show you another library that come from our data to show you that it's very typical. So it's come from Drosophila Melanogaster library. So here in red, you get all genes distribution. You see the shoulder here on the left. dotted line is protein coding genes. You can see that it's even more obvious this shoulder on the left. And in blue, you get intergenic regions. And if you were to define a cutoff, for instance, to say a cutoff such as at this TPM value, you get 5% of the read that got mapped to intergenic regions. In that library, you will define a cutoff at, for instance, one TPM threshold, which is the classical threshold applicable to most libraries, but not all libraries. And here you will say on the right here, your genes, they are expressed. On the left here, your genes, they are not expressed. So here this is the plot I show you when you use a naive cutoff. And here this is the plot using our method using defining a cutoff using intergenic regions. You can see that for human, we recover the libraries where there was almost no genes expressed. You don't have such libraries with almost no gene expressed. We also recover the libraries where there were all genes expressed in silicon. This is not the case now. You can see that the distribution is a tinier and that it's much more consistent between species as well. So this is how we say that this is a method actually working because now it makes sense in the comparison between species and also between different experiments. And also we do expect between 60 and 80% of genes to be expressed in any sample. So again, this picture here makes much more sense. What I want to show you also is that in the first place, when we design this method, we just use intergenic regions using the genomes that we were providing with. And this is the results we obtained in the first place. And it was actually working only for model organism. It was working for silicon here. Here you get Josephine Amelien-Gester. Here you get human. Here you get mouse. And it was working mostly for this species. But for all the other species here, it was not working. It was poorly working. So we're concerned about that. And we actually realized that in non-model organisms, the genome is really less well annotated. So when we plot the density plot of the expression values, as I showed you before, for the intergenic region in blue, you can see that it was really shifted to the right like that. And for the protein coding gene, our older gene was shifted to the left. And just because it's low-quality annotation of the genome, so it means that in what they consider intergenic regions, there are lots of non-annotated genes, most likely non-protein coding genes. But still, there are lots of genes that are non-annotated and that are considered intergenic. So what we do actually is that we pull all the organic libraries that we have for our species. For here, it's Macaca, we have 90 libraries. We pull them all together and we plot this distribution of expression. And then we deconvolute the intergenic region. So we deconvolute by trying to identify the different Gaussians in the intergenic region. So here in this distribution, we have the sum of different Gaussians. And what we do, we keep the Gaussians at the left here. And here, this intergenic regions, in this Gaussian, we consider that they are real intergenic regions that we are sure they are intergenic regions with a low signal intensity and low expression values. And we will use only these intergenic regions to have our background expression lies. And we get back to this plot here. This is how we get it for non-annotated organism. We try to identify true intergenic regions. And this intergenic regions, we provide them. You can download them and you can use them in your own analysis. You will see that this afternoon, during the practicals, if you use them, but we have a bio-conductor package available that allows you to perform these analyses on your libraries as long as the species is in BG using the intergenic region that we have defined. Okay, so I will speak these details because I don't have time. For in-situabilization, so for in-situabilization, basically what we get is that the authors, they provide pictures of the staining areas where gene expression is detected. And then you have curator that manually review this paper, these images, and capture using ontologies as represented of the areas where the gene is expressed. So what we get at the end here, it will be something like that. They would say for this gene, it is expressed in the Thargel arc neural crest cell at the stage print 25. So we take all these annotations that have been manually performed by curators and we remap them to the ontologies we use. But at the end of the day, we get the same thing. We get where and when, in which sex, in which condition a gene is expressed, capture using ontology. So we get the same information at the end of the day. So for EST, okay, it's an old technique now, but basically depending on the number of ESTs mapping to a gene that we have in the library, we consider that either as present high quality or present low quality. And with EST data, we do not produce a cause of absence of expression. We do not consider that it's reliable enough that we can actually say, okay, look, this gene is not expressed. So we use the EST data only to say where a gene is actually expressed, not to say when it's not expressed. So briefly, just to finish this first presentation. So we are working right now at integrating single cell RNA-seq data. As Marc said, we want to integrate that in a consistent way with the other data. In most cases, single cell RNA-seq data, no other data are used to identify cell types. So we have identified hundreds of new cell types who have had no idea they even existed before. We do that by performing clustering. So we do single cell analysis and then we cluster the cells based on the gene expression. And we can actually clearly see different cell types. So this is how most single cell data are used nowadays. But what we want to do is to use single cell data to tell, okay, in this cell type, these genes are expressed. And this is difficult because the signal is much more noisy. So what we do here, it's a density plot of genes, depending on the number of cells they are expressed in. And there is a clear B modality when you look at an individual cell type. There are genes that are expressed almost in old cell and there are genes that are expressed in no cell. And this B modality here, it's a very good quality control actually. When you don't have this B modality, it's most likely that you have a mix of different cell types that you have not correctly identified. So now what we are trying to do is, for instance, we could define a cutoff saying, okay, the gene that I express in at least 20% of the cells of the cell type, we are going to say that they are actively expressed in this cell type. So we are benchmarking different solutions like that. And for instance here, this is a rock curve comparing single cell data to bulk RNA-seq data using the different methods we are benchmarking right now. And it's actually working pretty well. We have a AUC of 0.95, comparing single cell data of a given cell type, here Epiblast cells to bulk RNA-seq data. So our method seems to perform quite well. We are benchmarking them and that would be integrated in the next release of BG in a few months from now because it's a lot of work. Okay, so to conclude this first part, I introduced the concept of present absent expression calls in BG, how we generate them from each data types so that the data are comparable between the data. So we generate these calls from each sample individually at the gene level with two quality levels, low and high. So present expression calls that represent where expression of a gene is detected over the biological and all the technology and technical noise. And we produce them from all data types. The absent expression calls, it represents where the active expression of a gene is not detected. So it's very useful to distinguish that as compared to no data. Because when you have no signal of expression in a condition, is it that your gene is not expressed or is it that you had no data? It's very important to distinguish between both. And it's produced only from insituabilization, arithmetic, RNA-seq data, not from EST data. And then how we integrate and make sense of all these calls, the tool that you can use thanks to this call, it will come in the next presentations.