 Okay, hi everyone, I think we can start. So I'm going to talk about more technical details, how you submit data to Ismara, how different functions work, and also go live over some web pages so you see how interface works. What I'm going to talk about is Ismara genomes and data type support. Then what steps you should take in order to prepare your data for upload. Then show the uploading interface, go over web pages, show the averaging functionality and averaging the fetch effect correction, and also talk about alternative ways to upload your data to the Ismara server, either with Ismara uploader or with Ismara client, and finally overview of expert more. So as Eric already mentioned, we have support for five organisms right now. So this is human mouse, east red and equal eye. And here's some statistic about genome version and how many genes, transcripts, motifs, transcription factors, microRNAs are included in every organism annotation. So also for human mouse, we have support for all the genome versions like AG18 and MM9, but these are also based on old transcript annotation and old motif annotation. So here is actually what data types Ismara support. First of all, we can analyze microarray data. So we have support for aphiometrics chips, data supposed to be in cell format from aphiometrics, and these chips are supported for human mouse, red in east and equal eye. And we also support net generation sequencing data, so which could be either map treats in bum or bed formats, or on map treats in fast queue format. So fast queue is supported for human mouse, red and equal eye. And not supported for waste. So here's summary of how we process microarray data. So we do correction for background and unspecified binding using bioconductor packages like oligo, GCRMA and so on. When we filter out non-express props, quantize and normalize the data between the samples and finish it with log transformation. For raw read data like in fast queue format, we apply different strategies for NACC and Chipsack. So if for RNACC we map reads with Kalista algorithm to transcriptome like Eric has shown before, for Chipsack we map reads with Kalista algorithm to promoter regions. So Chipsack data is always mapped to the predefined regions. We do not infer genome-wide binding peaks for Chipsack. So for RNACC we count reads per transcripts, calculate TPM values for every transcript and then log-transform the data and use this in ISMARA. While for Chipsack we count reads and then use these counts in order to infer the dynamics of read density changes between promoter regions. So in case of map reads, of course we take existing mapping, again calculate TPM values or calculate read counts and basically yeah, then follow the same protocol as for raw read data. So one important thing here is that we usually strongly recommend users to support, to submit raw reads instead of map data sets. Why is that? Just from our experience sometimes we see that map data cause different problems. So for example, it could be mapped to a very different reference genome. It could be preprocessed in some different ways which also lead to the problems when we try to calculate, for example, TPMs per transcript. So if possible, just upload the raw data. So as I already said, we support the following file formats like cell, BAM, BED and FASQ. And for FASQ it's also important to have a proper file extension in VAN so I will tell about this a little bit later, talking about PERT and support. Also files could be compressed with zip, it could be tarred, it could be zipped or it could be submitted as archive of targedy. So also it's quite important for good representation of results to properly name the sample files because ISMARA take the file names as sample names and here's example of BED naming and good naming. So I will go about this on the next slides. But first about the compressing the files. So it's always a good idea to compress a file before uploading this and this could be applied to cell, BED and FASQ. These files could be efficiently compressed and this way you can significantly reduce the upload time. Upload time is crucial in some cases because some users have unstable network connection and less time you spend for uploading your file, less is the probability that something goes wrong. So for BAM files compression is not necessary because they're already compressed and there is also no benefits if you compress all your files into one big archive unless it's already exist in such form. In this case, of course, you can upload the whole archive ISMARA will automatically extract all the content and properly process it. So about file naming. So here I present two examples. So it basically kind of the same data, the same profile but we see in one case files were named in some comprehensive way, saying that this is, for example, day zero, this is day three and yeah, when giving some information about replicates and they automatically in alphabetical order they are arranged on these profiles in the way which is absolutely clear from the very beginning. So you can see the dynamics of from day zero activity arise to the day three. In case if files were submitted with some identifiers which were given to these files, for example, by some sequencing centers, you can get something like what you see on the right plot and you will need to spend some time in order to disentangle what it exactly means. So for example, here the day zero samples are split and in order to identify that for day zero all activities are down. Yeah, you need to spend some time in order to remember what these identifiers correspond to the day zero and this one too and so on. So if possible, try to rename your samples because this is how they will be presented in the results. So the basic rule is actually name it in such way that if you list the directory, so in directory listing the samples are just ordered the way you like to see the results. So here you gain some example of good and bad naming. So on the left we see the good example. So there is a comprehensive sample names which will be automatically ordered properly on all the plots and on the right there is a bad example where we have identifiers from database and in order to properly understand the results you just need to make some kind of reference yourself to assign every identifier to the proper condition. So finally, in order to enforce the file order you can use a special prefixes. So here is example and it's also important for such prefixes to use zeroes at the beginning for example because in alpha numerical sort if you just put one instead of zero one this sample will be listed actually after the samples which starts with one. So this is shown in the examples. Now about file naming for paired and fast queue files. So paired and fast queue files require a special suffix event of file name. So it should be underscore R1 for one end and underscore R2 for another end. So sample names for both files will be the same before the prefix. So here is example. So I have two samples so this is control one and I have one file with first end and then another file with a second end. So unfortunately, Ismar doesn't support yet multiple files pairs per sample. So yeah, you have to merge these files before the submission. If you need assistance with that, yeah you are welcome, we can always give you a consultation how to do this on different platforms. Okay, so now about the data set upload. So first of all about the parameters which you see on this page. So we actually always try to keep our tools as simple as possible and to force user to choose as less parameters as possible to make the simplest use of the tools. So in this case, basically you have first of all email in project name. These are optional fields. So you can submit your data anonymously without any email. But I would recommend that you always specify your email why. So first of all, if you specify email after job is finished, you will get a notification with a link to results. And so you can easily access it and download if necessary and so on. Also in case if there is a problem in processing of the data from the, and if you submit email address we can always contact you if something went wrong. Also to find way to resolve the problem to find the best solution. So if you submit data anonymously, of course, if something is broken, yeah, you have to contact yourself. So yeah, because you don't know a contact person. Project name, project name could be also quite useful because when you open results, it is shown on the top left. And if you have results, let's say from multiple data sets you want to compare and you start to switch between the browser tabs to have some identity, identifier of a data set could be quite important. So you can easily distinguish between the different data sets. Also in case of problems for us, it will be quite easy to find the data set just by the project name if you submitted it with a project name. So next is a data type. So for data type, yeah, you see we support microwave RNA-seq and chipset. So in case of microwave, the only additional option you can choose is if you would like to run it with microwave included or without microwaves. Here is email, here's the project name and here is options for data type. So for microwave, as I said, you don't need to choose anything else just to specify microwave data type. But in case of RNA-seq or chipset, you also have to choose the corresponding organism. Because usually we couldn't extract information about which genome, which organism, which genome version was used, for example, for mapping to which organism FASQ data set belongs. So you always have to specify which organism it's coming from. So it could be either human mouse, red, E. coli, and we also have two older versions of genome. But these older versions works only with map data. So for these two versions of human mouse genomes, we don't have support for FASQ files. Okay. So let's say I would like to submit data. FASQ files, RNA-seq, okay, let's choose mouse. And now I would like to add some files. So here I have some example data. So you see where named by days, like in mouse liver data set. So you can, I can select them all at once, open. And now you see the list of files shown here. Once you've finished with ADIN file, so you can also edit one by one. Files could be located anywhere on the disk, so in different directories. So once you've finished, you just click start upload and then yeah, data start to upload. So here you can also see progress and yeah, and basically there is also progress bar for every individual file. Once upload is finished, you are automatically redirected to a so-called status page. So it shows you information about the job status. So either it's running or it could be some messages about errors which happened during the execution. This URL here is very important. So in case if you haven't provided your email address, you need to either keep this tab open until the job is finished or save this URL somewhere to access it later on. So this page is reloaded I think every three or five minutes. And once computation is finished, you will see here actually your results page like this one here. So in case you haven't submitted your email address along with your data set, please don't forget either to keep this tab open all the time or to save this URL somewhere in order to access your results. All right. So about running time, so it is Mara usually runs from one hour to a few hours depending on the size of data set, but it could be also sometimes longer in case if computational cluster is overloaded. So I would say to set threshold at 20, if you don't get any notification in 24 hours, then please contact us so we will find out if something went wrong. And if you contact us, it will be good if you provide us with a status page URL or for example, project name. So we can quickly locate your drop on the server and check if everything is okay. Okay, so now I'm going through some results page. So as Eric already introduced, so that will be developmental dynamics of the mouse lever transcriptome. So here is the link to original data set. You can download data and try it yourself. And so there are 12 time points. I think two replicates per time point and they are separated in different phases like prenatal, postnatal, postnatal is separated to second phase and postvenient phase. And now, okay. And here example of the main page. So I switch back to browser before I move to the mouse lever data set, few words about tabs here in the main interface. So we have example results, section which lists different ISMA runs for different data sets. So it includes usually the data set itself plus some averaging results. So for example, for Illumina body map, so we have average over replicates, average over young and old for JNF-SIM Atlas. So there is averaging cancer versus non-cancers and so on. And so on the bottom, we have links to this mouse lever data set. So we have not averaged and we have also average version of results. So you can click and open this. So other tabs include some general information, information about usage, supported Afiometrics chips for different organisms. Also it includes some data which you can download, then upload again, run and see how it compares with a predefined set of results. And here is how you can upload data to the server. Here I would like to point you to this ISMARA client and ISMARA uploader I will review it later. But these days when people work from home, sometimes not with very good internet connection, use of these utilities ISMARA client and ISMARA uploader will be quite important and can make data uploading to ISMARA web server much, much easier. So when there is a frequently asked questions section, example results and terms of use and content. Okay, let's go back to the mouse lever data set. So as Eric already said, first of all, what you see on results page is a table of motif activities which are sorted by Z value here. So this Z value is actually average squares of Z values over all samples. So these Z values are all positive and they express a general significance of the particular motif in this analysis. So in this case, so the top one are E2F2, another E2F2 transcription factors. In this case, we are going down indicating that cell division is reducing in post-natal phase and I would like actually to show you results for HNF for alpha. But before we go to the results page, so some demonstration of possibilities you can obtain on this page. So first of all, this table is sortable. So you can sort it by Z value, for example, to see the highest and the lowest one. You can sort it also by motif name if necessary. But I think what is the most useful, you can actually search here for motifs. So if I search for HNF, so you see we have motifs associated with different HNF genes, so and they're all listed here. So HNF for genes, HNF for A, so I again can sort it. Now for every motif we can actually, so we have a set of links. So you can access different databases to see information about this gene. So for example, you can go to Ensembl or you can go to NCBI, to MGI or to UniPro, so. Then there is a profile picture. If you click on this, you can access the big picture of the profile and there is also a logo of the motif itself, so which you can also click. Okay, let me open this in our tab. And so you can access the big picture of weight metrics. Now I think actually I would like also to say, so as I mentioned, we group some motifs because we cannot statistically distinguish the activity between each other. So these motifs are grouped. And for such grouped motifs, for this one or for example, even for this one, we show here only one logo. Some people think that this logo is a represent motif which is the same for both transcription factors in this case. That's not true, so we show here only one logo, one of the motifs which is included into this group motif just because we don't have so much space in the table and if we include all the logos, it will look awful. But if you click on the motif name, so you can see actually here we list motif logos for all motifs which are associated with all genes inside this grouped motif. So you see we have a separate motif for E to F F two and we have separate motif for E to F five. All right, so now in addition to the motif page, we also have a motif table. We also have a sample table. Let me go down here. Here is the sample table. It just lists the names of samples in your data sets. And this table is also used for specifying sample selection for averaging, if necessary. Or in case if you'd like to make a batch correction also for a specification of batches in your data set. So sample names are actually links to sample page which I will also review a little bit later. A few words about the navigation menu on the left. So here in page navigation, so this is motif significance table, sample table. Then there is a link to the page of mean activities and also to the page which has a list of all promoters which are sorted by a fraction of explained variants. This we will also check later. Then there are special utilities. So you can search a gene if you like. So for example, E to F. And then you can see some information about this gene. I will also show later what exactly it means. But to start the averaging procedure and then there is a download section. So you can download different data, activity table, data table, list of regulatory interactions, motifs sorted by significance which is basically just the main table you see on the screen. And also you can download the whole report in a target format. So let's go now to H&F for alpha page. So on the top you see the picture of which represents weight matrix. So this picture is clickable. So you can open this to get access to the picture of the motif. Then there is some information about the gene which is associated with this motif. So it's a name, it's corresponding gene ID. So our gene annotation is based on an ensemble. So we have corresponding ensemble ID. And if available, we have some information about the gene. So again, here are the links for this gene to a different databases if you need additional information. Next is activity expression correlation. So here you see again the gene name, the promoter which is associated with gene and corresponding Pearson correlation coefficient. Also we provide the p-value and there is a link. So if you just move your mouse to it, you can see a small version of the plot but you can click it and see a big version of this plot. So as Eric already mentioned, these plots are quite important. They provide a lot of information. They show if expression of transcription factor correlates with its activity, which can indicate if transcription factor is activator or a repressor, but the lack of correlation doesn't indicate that something wrong. It could happen that there are different mechanism of post-transcriptional activation. So for example, HNF4-alpha has been shown to change it activity upon phosphorylation and acetylation. So transcription factor could be expressed, protein could be produced, but it was just not active because of missing phosphorylation or acetylation or something else. So next we have activity profile of HNF4-alpha motif. And so this picture, so you see unfortunately in case if we have very long sample names, so they wouldn't fit to the picture, but in this case it's kind of interactive plot. So if you move mouse over the points, you can see first of all on the bottom, you can see the name of the sample and on the top it actually shows you the value of this point with corresponding error. In order if you would like to save this picture, there is a special button on the top. So you can download this plot as image. And in addition, yeah, you can zoom, you can use these buttons for zooming and you can rescale it back if necessary. So the next one is a picture of sorted Z values, which basically indicate in which sample we have the highest Z value which usually correspond to the highest activity. So in this case, we see that the lowest activity and the less so, the most negative Z value correspond to the first time points in utero phase and the most active points correspond to the, sorry, so and the highest Z value which correspond to the highest activity, they correspond to the last time points which correspond basically to adult mouse already. So the next is the string database so Eric has reviewed it already. So we show the picture, but this picture is clickable. So if you click on this, it will direct you to the string database. So you can see this picture here. And now you have possibility actually to click on different proteins to get some information about this. So for example, here this cluster of CIP genes, you can also get information about connections between different proteins to see what evidence about what evidence support this connection. And in addition, string of course provide additional tools which you can find interesting. So for example, clustering, analysis, you can export this and so on. Then we have a first level regulatory network picture as Eric said, this picture is interactive in the sense that, okay, first of all, you can get information. So if you move your mouse over the motif name, it shows you the corresponding Z value. And you see the motif of interest in this case, HNF for alpha is in the center. The one in the center is always red. It just indicates that this is a central one. And all other motifs surrounding this are in blue. Now, if we have evidence that central motif regularly, central transcription factor regulate, another transcription factor, we have arrow in the red. If we have evidence that blue transcription factor regulates the red one, we have arrow in the blue. So in case if they regulate each other, like for example, here or here, we have actually two overlapping arrows. The intensity of color of arrow indicate the chi-square score. In this case, so we see that chi-square score of promoter which correspond to NR5A2 gene is 11.2. Now, we have a slider on the left with which you can actually remove the links and transcription factors which have the lowest chi-square score. So we can scroll down. And this way, we just leave the most significant regulatory interactions. So in this case, I just moved here, so I now have four genes which are indicated to be regulated by HNF for alpha and three of them are well known. So HNF one beta is known to regulate HNF for alpha and vice versa. And it's also known that these nuclear receptors are also regulated by HNF for alpha. But this picture actually can provide you some interesting candidates for future investigation. So it indicates that HNF for alpha regulates promoters of genes as a B, actually we can see exactly. So it regulates promoter of as a A and in the literature, I haven't found any experiments which prove it. So potentially this is an interesting candidate for investigation about regulation of as a A by HNF for alpha. So you can also access interactive SVG image. So which provide basically the same functionality or you can download these images in PNG format. Now the target table. So this table like almost all tables which present in Esmar provide you some interactive functionality. So for example, you can search gene names in here or gene ID. So if I type SIP here, it shows me that there is 18 cytochrome per 450 genes which are regulated by the HNF for alpha. So what is important in this table is that it doesn't show you all the targets. Sometimes the target list could be quite long. So in order to make pages, compact and easy to download, we decided to limit number of targets which are shown in this table only to top 200 entries. So if you need the full list of targets with corresponding scores, then you need to download the file which is given here as regulatory interaction. So it contain list of all targets with corresponding scores. So also it should be mentioned in this page. So here we have a link to the promoter. So you can click it. It takes you to the Swiss regular database and you can browse the promoter region so to see what transcription factors are predicted to have binding sites in this promoter also to see which transcripts and genes are around it. So you can zoom in, zoom out. So in this case here, this promoter correspond to this two transcripts of C29 gene. All right. Next, we provide information about gene overrepresentation in different categories. So these are also listed in the table. So you see here we have term ID and here we have description of the term. And now we have a total log likelihood and log likelihood per target. What is this? So you see in this targets table for every promoter we have a score. So what we do in order to calculate this overrepresentation we just take a category and sum these targets score in every category. And then the sum is shown as total log likelihood score and the average, so the total divided by number of genes in the category is shown as log likelihood per target. So we found that both of these metrics are useful because usually genes with highest total log likelihood they correspond to rather general gene categories. So like in this case it correspond to X-pogenase P450 pathway which is not surprising taking account how many cytochrome genes we've seen in the target list. And log likelihood per target or average log likelihood usually correspond to the most specific category. So you can see which categories are most specific for this target sets. In this case, yeah, this is sodium independent is economic transport and so on. So we provide this information about for biological process categories, several components category molecular function categories which are standard for gene ontology collection. And in addition, we also provide two categories from molecular signature database. So these are two sets from curated gene sets categories. So this is canonical pathway and react on pathway category. So and in this case for HNF for alpha, basically we see categories on the top which do make sense for HNF for alpha. So for example, FoxR2 and FoxR3 are known to be regulated by HNF as I remember correctly. So the top one in react on category is organic cation transport. So this corresponds actually to another big set of genes in target so this is SLC which are responsible for transfer so you see we have 12 of these in the top 200 targets. And basically, yeah, that's all about motive page. So the next one, I wanted to show actually one interesting example which we have here. So you see on the top, we also have quite interesting factor which is called an R2E1, this is a nuclear receptor. And what is interesting about this, it shows quite unusual activity profile. So if most of activity profiles here go rather monotonic from left to right or from right to left, this one has a plateau at early days and basically start to rise only at winning stage of mouse development. So it was quite interesting. So we started to investigate what is known about this transcription factor and for our surprise, we haven't found any information. When we had a closer look at this, so here is actually how useful is this activity expression plot. If we look at this activity expression plot, so first of all, it has basically zero correlation between expression and activity of this transcription factor. As I said, it doesn't mean that this factor doesn't work. It just says that there is no direct dependency between expression activity. But if we look now at the expression range of this transcription factor, we see extremely low values. And it means that this transcription factor is actually not expressed along the whole time course. And so what happens here, we believe that most probably this motif for this transcription factor, so these targets are selected because we have a binding site for NR2E1. But in fact, they are regulated by different transcription factor for which we have no information in our annotation and which is co-occurring with binding sites of NR2E1. But unfortunately, yeah, we have no idea what could it be. Another possibility of course, is that this motif was wrongly assigned to these transcription factors, transcription factor, and in fact, there is a different transcription factors which binds to the binding sites of this motif and drive the expression changes of all the targets. So this I wanted just to show as example when you should be careful, when you check your results. So pay attention to the expression levels which could be quite informative from this course. And now also a little example here with E2F. So E2F2, as I said, there are two different motifs. They have a little bit different transcription factor, binding sites prediction, and in this expression correlation table, we have actually expression correlation for E2F2 and we have expression correlation for E2F5. And here the high correlation for E2F2, or the lower correlation of E2F5 might indicate that in fact the targets here are regulated by E2F2, not by E2F5. Moreover, you can see that E2F5 demonstrate much lower variation of expression along the samples, while expression of E2F2 you see changes quite a lot from minus two to eight. So in this case, I think we can give high confidence, assume that E2F2 is active and expressed in this experimental system. All right, so now I would like to move to another data set. So let's go back to the presentation. Yes, so for next steps, I would like actually to introduce the Illumina body map to project. It's also available from the example results page. So this data set consists of 16 samples RNA-6 sample from different tissues, human tissues. So for every tissue, there are two replicates. And here you can see actually the names of these tissues. So they are quite different organisms. So we have a brain, we have liver, we have muscles, we have also adipose tissue and so on. And now I move back to the browser and go to example results and open Illumina body map data set. So what I wanted to show here is actually that Ismara could be applied not only to time courses because sometimes people see the presentation with mouse liver data set and they start to think somehow that Ismara only applicable for time courses. No, it's applicable basically to any samples which demonstrate expression changes. So in this case, We apply Ismara to the different tissues and here are different activity profiles for different transcription factors. So let's have a look again at HNF factor. So you see it just shows that activity or if HNF factor is pretty low, almost over almost all tissues except for kidney and not surprising highest it's in the liver. So basically if you upload data set which contains different tissues or it contains different conditions, Ismara will try to infer the transcription so motifs of transcription factors which are most indicates, so which are best explained the difference in expression between these samples. So it's applicable not only time courses, it's applicable basically to any experimental system. So if you look now at, for example, at this one, so MF2D and MF2A, MF genes which are known to be involved in regulation in muscle tissue. So if we open this picture, so you see very well that the highest activity correspond to the heart to skeletal muscle and so surprisingly for thyroid gland. So this I don't know why that is something interesting. Yeah, also which potentially could be investigated. So here I would like to show actually what are the sample pages because I think they are most indicative for this data set. So let's take liver example again. So I go to the sample table, then I click to the sample name which correspond to the liver. And now I see, let's wait a little bit, the picture is loading, yes. And so what I see on the top, I see sorted that values for this sample. So I see the top 10 positive, that way top 10 motifs with highest positive that values and also top 10 motifs with negative that value. So basically here from this, you immediately see that highest that value and it means highest activity in this sample correspond to liver specific factors. So if we take some other data set which are well known like brain or muscle, for example, we will see on the top, transcription factors which are most significant for a given sample. So the same value, basically all values are presented in the table underneath. So it contains a motif name, a corresponding that value. And then what we've seen before, again, the links to the gene information, the corresponding logo. And from here you can actually see, yeah, motifs with most negative or most positive that value. So by the way, so this that value which is shown here is not that value which is shown on the table on the main page. On the main page, we see average square that values over all samples. And here actually is individual that value for the corresponding sample. So that value is activity divided by its heater bar, so by standard deviation. Now let's move to the mean activities page. So as Eric already mentioned, mean activities are fitted differently. So this is not just activities which are summed and divided by number of samples. So they're fitted in a special way. And high mean activities usually represent motifs, which targets are consistently high or consistently low expressed in the samples. Okay, first of all, it shows the top negative and top positive activities, which are sorted by absolute value. So, but in this case, it's not very informative because on the top you see activities of some microRNAs which are high but also have very big standard deviation. So I think much more informative is the next plot which shows the mean activities which are sorted by the Z values, by significance. And from here, we can immediately see that, so for our dataset, so the most consistently high expressed targets belong to NRF one motif. And NRF one is known to regulate genes which are responsible for energy metabolism, some mitochondrial genes and so on. And these genes are known to be highly expressed in all conditions. In the same time, if we look at the bottom here, we see rest and rest is known to be a repressor. And indeed we see that for rest targets are consistently low expressed across all samples in the dataset. Then basically you can access the same numbers for all motifs now in the table again, so it's sortable, it's searchable. So you can see the top mean activities, you can sort it by absolute value, you can sort it by Z value and you can access additional information on this. All right. The next one, I would like now to go actually to return back to the mouse lever dataset and show quickly the gene search function and also the page which contain all promoters which are sorted by fraction of explained variants. So let's go first to this page, all promoters sorted by fraction of explained variants. So let's go here. So this is really large table. It contains all promoters which are expressed in this dataset. So this is 27,000. This is 27,000 entries in this table and by default they are sorted by fraction of explained variants. In addition for every promoter, we also have value of its mean expression over all samples and dataset, the variants of expression over all samples and here is little notation. So we have a promoter and here we have a description of gene which corresponds to this promoter. Now, if we either click on the promoter link or we click on the gene link, it's going to take us to a special page and this page actually allows us to see how much different motifs contribute into the explaining observed expression of a given promoter. So on the top, we see a plot which demonstrate expression of a given promoter for a gene of interest. In this case, we have gene H2Fx. It has only one promoter and here expression of this promoter. So here is the highest expression point. So here is the lowest expression point. So in case even multiple promoters per gene we will see a multiple plots here. So you can see how expression of different promoters for a gene is changing. Now we have panel basically for every promoter which contains first of all plot of again of expression of this promoter and in orange we have a predicted expression. So by default, we turn off all prediction. So this line just correspond to the mean expression of the promoter. So as I said, so on the right, we have a table which contains all motifs which have positive chi-square score. It means that these motifs do explain something of expression of this promoter. So here we have this chi-square score. The chi is chi-square score where more significant is impact of the transcription factor to explain in the observed expression differences. Then for every motif, we also have corresponding side count. So it basically sum of posterior probabilities of all transcription factor binding sites in this promoter for a given motif. So you see E to F2. So this motif has actually roughly same two sites of high confidence in this promoter. And here also the corresponding that value overall significance of a given motif. Now every motif has a checkbox and if we click it, we actually see predicted expression which is based on expression of only this motif. So I can select this one and see only contribution from this motif. I can also combine it to see how expression changes, how predicted expression changes when I start to add a different motif. And yeah, in addition, I can actually turn all the motifs on or I can turn it off. So here you can observe actually how much every motif contribute into the explaining of observed expression of a given promoter. And now if we go to this surging, so this provide basically a quick way to access these pages for a certain gene of interest. So if before you can select some genes on the basis of fraction of explained variants, on the basis of mean expression, on the basis of variation of expression across the sample, here you can simply search a gene by name. So I had example, wait a second, I'll find it. Yes, here I think this is a good example. So this is a SIP tool. So you see, you can also select something from the list, which drops down. So it supports alter completion. So I, but I would like to search for a gene which is named SIP to C29, so it's here. Now it basically takes me again to the page we just seen before, but now here is information about expression of promoter for this gene. Again, for this gene we have only a single promoter. Here is its expression. And now we can see which transcription factor contributes most to the explaining expression of this, or this promoter. So in this case, you can see just from chi-square score. So of course, most of all contribute HNF for alpha. And so this is not surprising because it's known that this gene is regulated by this one. And then there is a bunch of other motifs which also contribute into the explaining of expression of this promoter. Okay, few words about eSmar downloads before I move to the averaging and everything else. So we provide, as I already said, downloads of activity table, data table, regulatory interactions and so on. And most of these tables are given in a simple text format. So these are top separated values. So it looks something like this. So it could be either directly read by different analysis software or it could be easily parsed by the scripts which we prepared for data analysis. So here is example of regulatory interaction file which represent basically data which are given in this top targets table. So we have promoter name, we have chi-square score, we have name of a motif which regulates this promoter and then we have actually some annotation of given promoter. So this transcript name, gene name and so on. So also you can download the whole report archive which contains almost everything which you've just seen so far. So it contains the main page, it contains all tables, all pictures, all motif pages and so on. What is missing in the report which you download, so this report you don't get pages per individual genes. So these are not stored anywhere. So they are generated on the fly. So it's this functionality available only on the server. Also you don't get a page with promoter sorted by fold because this also generated on the server and also if you download report, so for offline browsing, you don't have interactive features like averaging functionality which you gain work only on the store. Okay, now I would like actually talk about averaging. This scheme roughly represent what happened. So we select some groups of samples and then inside these samples when we average the activities. So it basically collapses into a single point and then we observe a new profile of average activities. So it was applied to the Illumina body map and actually for liver data set as well. So here is a comparison. So you see in this case, we have two replicates for every sample and after averaging we have one. For H for Illumina body map, it's probably not so crucial because there is very good reproducibility between replicates. But in some data set when you observe a certain noise, between replicates just inside one sample where averaging, for example, for replicates inside the sample could be quite crucial in order to obtain clear results. So I would like to show quickly how you select things. Then you need to do averaging and for that I'm going to use another data set which we have. So this is actually a microarray experiment about, if I remember correctly, missing human transition of human cells. So here is the data set. Let's wait until it loads. Okay, and now in order to average replicates, so we have two replicates per sample. So I just simply need to click perform sample averaging. So now once I click perform sample averaging, I get a new menu appearing in this sample table. So you can specify again here email address in case for your data, actually your email address will be here shown by default. So it just take the original email address you submitted with but you can specify a new one if you like. And here you can specify also project name by default this is all project name with average prefix editable. You can also submit it anonymous if you like. Now, so we need actually to specify groups. So for that we have here dropdown menus. And so what I'm going to do, so I have in this case this two samples. So this is a condition one in this case to replicate one, replicate two. And now I can also select. So in this case, ah, sorry, this is not good. So that will be condition two and that will be condition three, condition four, this will be condition three and condition four. Now actually I can rename this condition the way I like. So for example, sample one and you see it's automatically changing in all corresponding fields. And sample two. So if now I run averaging four. If now I run averaging, so I will have how many? I will have four groups. It will calculate for me four points which will correspond to the averaging between replicates. I could be, I could do it differently. So and actually specify that all these cells I would like to average actually to one point. So you see, I can do it like this. So in example, yeah, you see here, so in this case, yeah, I have two replicates for this point and when all other which are underneath here will be merged into one group. So this way you can do very different contrast and different data set in order to obtain the contrast you would like to see between different conditions. And finally, in order to, so once I finished, I simply click submit data for averaging and I'm taking to the same status page like when you submitted this Mara data. So also I would like to say now a few words about batch effect correction which is also available here in advanced replicate effect option. So first couple slides about this. So what is the batch effect correction? So here is example, let's say we have three samples again and so in sample two, you see the difference between replicates is actually higher than the difference between samples if we average them. So and if we plot it like this, it looks like noise data is not reproducible and so on. But if we plot this data differently, so it was done into different batches, so this is kind of real example. So if we plot it differently, every batch now is in different color. So you see actually that profile of expression changes is exactly the same. They are only shifted and rescaled relative to each hour. And in order to remove such effects from the data, we applied rather simple method, we just use standardization of the data. So what it means basically, it means that we calculate, we recalculate activities by subtracting the mean inside the group and dividing it by the, no, subtracting the mean inside the batch and dividing by the standard here inside the batch. So we basically take our data like shown here, apply this procedure and we get basically shifted and rescaled profile, which is now quite easy to compare. And now if we average the data, so we average between replicates, we get basically the result which we expect. So before I go and show how you can select different options for this batch effect correction, I would like to say that we usually do not recommend to apply batch effect correction without clear understanding what you want to do because batch effect correction does change the data. And if you start to apply it to everything, especially to a system which do not have batch effect, so you risk to get some results which could be wrongly interpreted. So also batch effect correction has some limitation. It requires that you have the same number of points in every batch. So as I said, so in order to run a batch correction, you need to click these advanced options. And now, okay, let's change back. So it was condition two, that is condition three, condition four, and now we have two, three, four. So now I just say that the first replicates are produced in one experiment and second was produced in another experiment. So I have two batches. I suspect that there are some effect which will be corrected. And so first of all, I select like before the groups for averaging and now I need to specify the batches. In this case, my batches will be first batch for all first replicates, so it will be like this. And now I select the batches for the second set of replicates. And now, so you see here are corresponding selection of groups to average and here corresponding selection of batches to normalize. And so basically now I just need to click submit and it will take me to the page and run everything. All right, yes, yeah, before I move on, yeah, I would like also to give example of averaging between different conditions. So if you go to example page, we have example results for JNF Atlas, which includes 79 tissues and cell line plus reference cancer cell lines. And so you can do a lot of these data sets, but what we try to do, we simply have applied averaging to see if there any transcription factors which clearly separate cancer cells from non-cancer cell. So if you look at the data set itself, so you can see that it's quite noisy. But if we apply the averaging, it can provide it with quite nice results which are summarized in this slide. So here are the top motifs, which have rather significant Z value after averaging. And you see that top motifs are basically are no, so left point in this case correspond to non-cancer cells and right correspond to the cancer cells. So you see basically that, for example, IBX has high activity in cancer cells and these genes, these transcription factors are known to have a high activity in cancer cells. We have a few more examples like this MIX or ZNF-143, but we also have example of motifs with low activity of cancers and it is known to be true. So for example, the Sivronip or Aerofix which activity drops in cancer cells. So this is example how you can apply the averaging not only to the replicates, but also to a different conditions. So at the end, I would like to mention alternative ways to upload data to the server. Sometimes user experience problems with uploading data over web interface. Most of the time I have impression this is because of some technical problems in between, not really on user side, not really on outside, but on different filters, firewalls and proxies which are just in between of us and the user computer. So in order to provide a more robust way to upload the data, so we have developed a so-called upload client for Ismara. So this upload client is simply a Python script which you can run on any machine, basically it's cross-platform, but idea behind this script is that you run it directly on machine which stores your data. So I'm not sure how it happens in other places but in our case, for example, we have a big Linux cluster and this Linux cluster has direct connection to the storage system, so this data is directly available. And then on this Linux cluster, I can simply run my uploader from terminal in background and this uploader will just run in background and upload my data to Ismara server. So here is a summary how it works. So it's simple program with a few parameters to specify. It provides basically exactly the same functionality like web interface, so you can specify email, project, data type, genome, select if you want to run it with mirror or not, and the only difference you need to provide a file list, it's just text file which contains paths to the files to upload, one path per line. So you call this script, you provide that file list, you hit enter and yeah, when it runs fine. The only thing to watch out is to make sure that you properly run it in the terminal and this script is not terminated once your session is over or broken. So for that you can use Ivan Ohav command or screen or chemoog utilities. Okay, so here example of a file which contains paths and finally also to mention the Ismara client. So Ismara client is a stamp alone application which was developed to run on Linux and Mac machines. And this available also from the, I think how to download data section on Ismara web server. So what is special about this program, this program actually run some pre-processing steps on your local machine. So it takes your data, it run pre-processing steps, it obtain much smaller summary table and this summary table sent to Ismara server instead of all your big FASQ or BAM files. So this client basically support almost everything like web interface, I think with only limitation, it doesn't support FASQ processing but it can process the bed, BAM and cell files. It provide some history information about which jobs you have run and also access to the links to the results which you have. All right, so here's some information about these things and okay, this is not relevant. And last minute I would like also to mention expert mode. So expert mode allows you basically to run Ismara on anything you like. So you need to supply here the table which contains expression or some other signals, for example, chip sex signal for a set of genomic features. So originally Ismara use promoters. And then you need also to supply assigned count table. So this table basically summarize how many transcription factor binding sites you have for every genomic feature. And once you supply both of these tables, you can actually run Ismara on anything. So for people who want to run some unsupported organism, so I suggest that you can contact me and we and I can give you some direction how you can test your data in expert mode. So we've seen some successful example, some people running for example Ismara analysis for salmon genome. Of course you need to invest some time in this but it's doable. Okay, I think I will finish now to be in time and now, yeah, you are free to ask questions.