 Welcome to this short introduction to the core concepts of transcriptomics from a computational perspective. To make sure we all on the same page, let's start with the basic biology. Your genome is made of DNA and it contains genes which are transcribed. Transcription produces RNAs, some of which are called messenger RNAs, and these are later used to produce proteins. Since your genome contains many genes, there are many different RNAs. And these RNAs are expressed at varying levels in different tissues under different conditions. The systematic measurement of all RNAs, or at least many RNAs, in a sample is what is known as transcriptomics. In this presentation I'll go over the main technologies, specifically DNA microarrays and the more modern RNA sequencing methods. I'll talk about some important sample pre-processing steps, specifically to deal with ribosomal RNAs and other non-coding RNAs. I'll talk about data normalization and about statistical testing that is needed to find the differentially expressed genes or transcripts. Let's start with DNA microarrays. The first style of arrays to come around were the seed DNA arrays. These are also known as two-channel arrays because the way they are used is that you do competitive hybridization of two-label samples to the same array. The readout looks like this. You have a glass slide, each dot corresponds to a different seed DNA, and the amount of red and green tells you the amount of mRNA in each of the two samples for that gene. The readout is therefore effectively directly fold-change values. The major problem with the style of array is cross-hybridization. Since you're using full-length seed DNAs, mRNAs of paralegous genes can easily cross-hybridize to the wrong spots. The afro-metrics gene chip, by contrast, contains short DNA probes that have been designed to not cross-hybridize. On top of that you have multiple probes for gene, giving you multiple measurements. These are known as one-channel arrays because you hybridize only one-label sample to a chip. And the readout is the probe intensities based on which you can get an overall estimate of the level of each transcript. A major challenge in analyzing microarray data is that they can be hard to compare. It's very hard to compare between transcripts and reliably say if one transcript is higher expressed than another because of different hybridization efficiencies. But it can even be difficult to compare the same transcript between samples. To solve the latter problem, you need to use nonlinear normalization first. On the left you see an example of some two-channel array data. And you can clearly see that the samples labeled green are shifted systematically toward higher intensities than those labeled red. Nonlinear normalization take the distributions and superimpose them on top of each other as you see on the right, giving you more comparable values which is a better starting point for identifying differentially expressed genes. The other technology I want to talk about is RNA sequencing, which by now has largely replaced microarrays for doing transcriptomics. The idea is simple. You first do short read sequencing, map the reads to a reference genome using tools like Hisat2 or Star, then the sample novel transcripts using tools like StringTie, estimate expression levels using tools like HTC, and whichever pipeline you chose to use, there are many tools. You end up with a read count matrix where each row is a transcript, each column is a sample, and the number tells you how many reads you saw for this transcript in the sample. However, there are three factors that affect how many reads you see for a transcript in a sample. One is what we want, the transcript level, but there are also two unwanted factors. The first is transcript length. If a transcript is twice as long, you should expect to see twice as many reads at the same expression level. The other is sequencing depth. If you produce twice as many reads for a sample, you would expect to see twice as many reads for every transcript. For this reason, you normally work with normalized metrics when dealing with RNA-seq data. One such metric is RPKM. It is the reads that you saw for the transcript, per kilobaser transcript, thereby normalizing for the length of the transcript, per million mapped reads, thereby normalizing for the amount of sequencing you did. One thing that we as bioinformaticians tend to forget to think about is sample preprocessing, what was done to a sample before measuring it. Specifically, I want to talk about the problem that most RNA cells is ribosomal, and when doing transcript tomics, we therefore somehow either need to select the wanted RNAs to measure or deplete the unwanted RNAs. This is trivial for microarrays, for the simple reason that each spot on the array selects for a certain transcript. However, when doing RNA sequencing, if you were to just sequence a sample, you would be wasting approximately 80% of all reads on sequencing ribosomal RNA over and over and over again. The most popular approach to solve this is polyase selection. It relies on the fact that messenger RNAs are polyadenylated and can therefore be pulled down with a polyteprope. This eliminates the ribosomal RNAs. However, you also lose most other non-coding RNAs in the process, since they too are not polyadenylated. The alternative, and the method of choice if you're interested in non-coding RNAs, is to do ribosomal RNA depletion. This eliminates the ribosomal RNAs, but allows you to keep the other non-coding RNAs. It's therefore known as total RNAseq, since you're not only measuring the mRNAs. However, this approach gives you worse coverage, and the only way to deal with that is to do more sequencing on each sample. In other words, throw more money at the problem. This gets me to the last topic, statistical testing. How do we identify what is differentially expressed? If you work with microarray data, you will typically be working with either log ratios from two channel arrays, or log intensities from one channel array. The reason for log transforming is that it gives you numbers that approximately follow a normal distribution, allowing you to subsequently use t-tests and ANOVA to identify what is differentially expressed. In case of RNAseq data, you would therefore be very tempting to calculate something like RPKM, log transform this intensity, and use the same tests. But that is a very bad idea. If you think about what actually comes directly out of RNAseq, it counts. And the way to analyze it is therefore to use the appropriate statistics for dealing with counts. In this case, it's the negative binomial distribution, and the easiest way to deal with this is to use dedicated tools like d-seq2, which directly take the read count matrix, apply the appropriate tests, and give you the list of differentially expressed genes. That's all I have to say about transcriptomics this time. If you want to learn more about what you can do when you've arrived at a list of differentially expressed genes, I suggest you take a look at this presentation next. Thanks for your attention.