 Okay, first and foremost, everybody welcome, you are all in for such a huge treat, not specifically for me because my topic of conversation is probably the least sexy of all the different modules that are available here for CBW. However, I'm hoping to make this transition a little bit easier for people who are relatively new to bioinformatics. But more so importantly, I'm hopefully going to make things a little bit easier because you're going to be drinking from the fire hose throughout this week. And if anyone's ever done that as a child, because I know I have almost hot summer days. It's not the most fun thing to do but boy is it entertaining to watch on YouTube. Okay, so let's get a couple of particulars out of the way. Creative Commons, the slides that I have available both in PDF format and in keynote are available on the website. You guys all hopefully have the link if not I will link up to them. Use it, take it, abuse it, do whatever. Please don't put it up on TikTok as a meme and make fun of me. That's just hurtful. I'm more than happy for people to take some of the stuff that we've created and kind of share it among other people and other individuals and kind of make it your own have some fun with it and go out there and start teaching as well. Okay, so the topic at hand here is data formats and databases. Again, the least sexy of all the modules. Most people want to start working with kind of somatic mutations gene expression profiles and things like that almost immediately, but hopefully for the next kind of hour and a bit. I'll be able to kind of hold your hand and get you familiar with some of the files that you're going to be working with, because there's going to be a lot of them. Okay, so I've been doing this for about 13 years now I think my first day I actually talked to Francis, and boy was that fun. Along the way I've joined several other labs, and it never fails at the first day someone just drops a whole crap load of hard drives right into my lab, and they say do something with this. This was my first day when I was over at the hospital for six children back in 2014. And when I went over to Princess Margaret Hospital, they just pointed me to the cluster in a folder and said here work with this. So a lot of the times you were just going to get a whole lot of data, and it's not necessarily going to be organized or at least organized in a way that you're familiar with or that you're comfortable with. But you're going to kind of flip through the files and see all these different things where you'll have character based files all over the place and binary files that need to get converted over to character based files. It's just going to be somewhat of a mess at times because no one ever organizes things in the exact same manner from Institute to Institute, or even from person to person sometimes. So what are we going to do today today we're actually going to focus in on a few data formats specifically that I deal with, pretty much on an everyday basis. By the way, everyone, if you guys have questions, I have my slack channel open. Feel free to, you know, if you don't feel comfortable saying it out loud, put it up on slack, I'm watching that put it on the chat and zoom or just turn your mic on and just ask a question and no problems whatsoever being interrupted. So that's the data formats we're going to deal with. We're also going to touch a little bit near the end on some cancer databases, because there's so much information and data available right now for you guys. It's absolutely crazy to me, knowing that in the short span of about 13 years, how much this whole community has gotten together and said, I'm going to make this data available. I'm going to make it dig for it a little bit. But you guys have it available so that you can either replicate some of the work I do, or you can actually extend your own work by looking at more different cancer sub types, or even just expanding on the population of samples that you deal with. So let's just go right ahead and dig right into our first file format, which is going to be the fast file of sorry, I didn't ask, can everyone hear me okay. Cool. See some nodding heads works for you. I always forget to ask that question. Okay, how many people here have seen a faster file. Raise your hands, I can see everybody. Okay, good portion you guys have seen it. It's a boring file format, you get a header, and you get some data. And that's really all it is worth noting though is with the header that you get. So it's going to start off with a greater than symbol. And then there's going to be some text going to be followed up by some former sequence ATC G's and ends. Now, case in point that and is actually just a placeholder, you can actually substitute a whole bunch of IUPAT codes. For those of you who don't know what IUPAT codes are there's a link down there or haven't don't recall what some of the IUPAT codes mean. It's down to the bottom of the slide there so you can quickly check to interpret what an R is, what a Y is in terms of, is this a CRT, you know there's some ambiguity behind that. Now on top of this in a faster file, you can actually, it's not, I'm going to be focusing so much more on nucleotide bases, as opposed to anything else, but keep in mind that you can actually keep amino acids in there. So there's protein information that can be retained in this as well. There's there's kind of some flexibility in there, but just realize that in the basis of it all, it's just a header and some data. Okay, so can anyone tell me what the sequence is for mitochondrial DNA for humans, base by base. I can't remember how big it is, but if anyone can actually recite that by memory, I'm going to be really impressed. I can't do that. And luckily we all have the capability of storing these things as files. And here is a faster file that has actual mitochondrial DNA in all its glory. This particular one is downloaded from NCBI. I have links at the very end of my slide deck so that you guys can see that if you want to play. But again, a lot of you guys have faster files that you have seen in the past. Now, along with this, that some text that I was telling you guys about here they actually put information about what this is some accession numbers so that you can find out where it comes from. And in general it just give you a bit of narrative on on the sequence that we're dealing with. This is great. I absolutely love this. And here, you know exactly that the sequence below what it is where it comes from. And if I need to reference it from some online database of some sorts. Well it's it's right there for for me to do as well. So I can look it up to you the accession numbers here. Again, we're talking about sequence data. So, again, this could be substituting for protein. But right now we're dealing with the sequence of mitochondrial, but JTCs G's there's some ends in there somewhere because they don't know exactly what that basis. Bless you. But for more than anything else, you're just going to get this long series of sequences. Chromosome one freaked me out the first time I ever cracked it open like I don't know how many people actually open up any of the, the faster files that they look at, but this was downloaded from USCSE. So this is chromosome one, and you notice the the text for the header slightly changes now to be perfectly honest having a greater than simple is good enough. You don't need to put anything. You will forget everything a week later of what that sequence represents so I highly recommend putting something in there. But that's just me. This particular one happens to be for human reference, I believe this is GRC H 38. So it's a reference 38, and the thing that freaked me out the most is why are there ends all in the beginning of this sequence. Anyone know. I might just yell it out but soon if you guys want. Oh, I saw people have some hands up. Anybody you learn, you learn, you know, okay. You'll see this on a lot of the different chromosomes there's a whole bunch of ends at the beginning and the end and even in the middle. So it's really the tailors and central mirrors is a whole bunch of ends. But the first time you see this you're going to freak out and hopefully you guys can freak out in front of me, and in front of your peers here. And not when you're working with your PI side by side, asking what in the world is going on. And these are just saying we don't know what's going on here but somewhere down the line you're actually going to get yourself some some bases that that look a little more familiar. Is this familiar anybody. I know I've read kind of some of the introductions that you guys have had on slack channel, and a lot of people were. We're saying that they work in cancer and some are talking about some of the different cancer types that they were looking at. But I don't know if this looks familiar to anyone. If I said that the reverse compliment of that is TTA GGG. Some people might think okay well that sounds a little more familiar to me terminal sites. So this part of the repetitive from your sequences that goes on. And the only reason I'm bringing this up is the the fastest file that you're going to encounter it's unidirectional right now so it's only in reference to one direction. I'm going to start mixing and matching all the way in between there for the different strands of DNA that you're working with. So this is just to keep that in mind, not really freaking you out if you didn't know necessarily what it's in your terminal site like. So yes keep in mind that they're the unidirectional. And again, towards the end here what I was telling you how you have a bunch of ends in the at the back end of the chromosome one. So this hopefully won't freak you out too much. I actually chopped this down there are rows upon pages of rows of ends of the very beginning and very end, because keep in mind that the chromosome one is 250 mega basis long so it's a pretty lengthy one. Okay, so you saw crumbs the mitochondrial chromosome mitochondrial DNA you saw chromosome one in humans, you know there's 22 that's and why some people sticking mitochondrial in there others don't I usually do. Because we do some studies with mitochondrial DNA, but we have all these files here scattered about so there's one for each chromosome represented represented. And again these are all things you can just download I can get that reference genome easily, no matter what. One thing though is working with a whole bunch of files kind of sucks. So a lot of the times what we'll do is will merge that and want to only deal with one full genome. So this is kind of to the next topic of the fastest file itself. You don't actually just have to save one sequence in there. You can actually take one file and retain the entire human reference in there if you so choose. You do the same for you know who's ever working with xenograft data you can actually take the mouse reference genome and stick it in there. If you're doing an analysis that actually required you to do both human hand mouse, you can actually merge both human and other in one genome. Keep in mind there's you know some similarities between those sequences in small batches of sequences, but keep in mind that that that does that is a possibility. So you can actually merge as many of these individual faster files as you want. And so, I don't know, did all you guys do the Unix tutorial as well. Thank you very much. Yeah, I can see some heads nodding. I'm going to be giving you guys some commands along with everything that I do here so whatever you guys want to revert back you can see kind of how it is that I did certain things, and these are all executed on the command it's not kind of recorded I did some really tricky keynote stuff in order to showcase it but you can see here I can do an LS of a directory and it just happens to have all the different fast files in there. Note that fast files themselves the standard usually is to use either dot FA is the extension or dot FA STI dot fast as the extension. Oh, no you do not need to log into AWS. The extension will actually be held by Heather. This is all just kind of lecture notes to get you guys a little more familiar with some of the the files that we're working with. Okay. So how do you how do you merge all these files together. Well you literally just cat is the command that concatenate files together and these happen to be all text files, which is great you can just kind of merge them all and redirect them into a file with hopefully some useful variation on it. In our case GRCH 38 dot FA. I don't know if anyone's use the grep command, but once I had that that that large fast file, since I know that the header starts with a greater than symbol. I can find out immediately how many sequences, and what the actual sequence names are in that file, just strictly by doing a grep statement so I'm going to grab. In this particular case I'm using single code. I'm going to use that carrot, which tells the grep statement. The line has to start with this, and this is going to be the greater than symbol. So search the entire file for characters to start with a greater than symbol and then search that your CH 38 dot FA file, and you can see here, chronologically go we go chromosome one two three four five six seven all the way through 22 and X Y and X Y off. So that's great. People tell me okay well we have all these anyone kind of doing more advanced units. Okay, well I got a whole directory, I'm just going to do this. I'm going to catch each our star dot FA pipe it I'm going to give it a different filing. that here's the catch. You'll notice that when I grab it, I'm not guaranteed that this thing is going to be in the order that I expected it. A lot of the times in kind of the unixi world, it's going to sort things either numerically or alpha numerically. And in this case, because the file start with CH and R, it treats it as alpha numeric. So it kind of looks at the alphabet itself. So it starts with CHR1, then goes to CHR10, 11, 12, 13, so on and so forth, all the way down. This just, it just sets my obsessive compulsive to the stratosphere. Like this just drives me nuts. So I go a little bit crazy on this one. This is actually going to impact you considerably in the next module when you start doing alignment. And then also in the modules after that, when you start looking at variance. So keep in mind that the order of which you actually merge all your individual FASTA files together matters. But we'll get to that again in the subsequent modules. So just kind of stop that right now. Let's go back and actually do this correctly, because I like doing things for reinforcement purposes. It's a little more of a pain, but concatenate your files kind of on a more manual basis. And you will get the order again of what you're expecting. Okay. So lucky for us, most FASTA files are just a plain text file. There's nothing special about it. It's just a plain text with some characters, ASCII characters in there. We can stir sequences, either DNA, RNA or amino acids in there. Most of the focus is going to be on probably DNA and RNA. DNA for this particular section here. RNA and you guys are working with RNA after the fact. It consists of two parts, that header and that header always is going to start with a greater than symbol and some sequence alpha characters. Again, subsequent to the greater than symbol, you can put kind of whatever characters you want there. I highly recommend if you guys are creating your own FASTA files to give it a name of meaning, because you will forget what that sequence you were working with is. Multiple sequence is going to be starting a single FASTA. And here's the kicker. Pretty much all the reference genomes that you're going to be using is going to be in a FASTA format of some sorts. I don't want to say it's only going to be FASTA, but it's going to be a large portion of it. So keep in mind that this FASTA file is kind of the first thing that you'd deal with when working with some of the analysis you're doing in the bioinformatics world. Okay, Sue, who's got some questions for me? May I ask, are there carriage returns at the end of each line? Or is it just only prints a certain number of nucleotides on the screen? That is a fantastic question. And yes, there are carriage returns. So there's a catch however, it doesn't have to. You can actually take that string and just keep going on forever and ever and ever and ever. And just having a word wrap on your terminal will allow you to see a kind of in a nice viewable manner as such as the previous slides I showed you. That's great. However, most, again, there's no fixed standard, but most people will actually use either 60 or 70 characters in terms of the line width and then add a carriage return after that. But again, since it's not a standard, that is really dependent on the institute of which you're downloading that reference genome from. Genome.ucc.edu will always, at least for the last three references that I've used, have always been 60 characters wide and then a carriage return happens after that. Great question though. Anybody else? Okay, cool. By the way, if I'm going too fast or too slow, just say, hey, moron, move it. Or slow down, moron. I don't mind. I've got no ego. Okay, let's go on to a file format called FastQ. This is one of those things that's the concept of it really is boiling down to smaller segments of sequences that you want to kind of do queries on. So you're going to get this particular length of sequence. You're like, I don't know what this thing is. I don't know what to do with it. But it's really just a storage similar to the Fast to File format that it's just some sequence of some sorts. Here is the catch though. When you guys get sequences from someone, do you guys not wonder how confident we are that the sequence actually is what it should be? Like you have an A, a T, a C, and a G. Is that T supposed to be represented? That T? Is it correct? You know, is there some form of qualitative measure that we can get? So, you know, can we sign some form of value to find out, you know, what is the quality of this actual base that we have for it? The answer is FastQ file. So I'm just going to throw you guys in the deep end and just say, okay, well, I'm going to do ahead of a FastQ file itself. And this is what this is just a small, small output from what the file is. But let's just focus in on kind of one of these sequences itself. Now, keep in mind with the FastQ file itself, four lines represents the sequence of interest. You'll see here that if I look at these four lines, and let's just isolate them for now so we have a deep kind of look at it, you're always going to be dealing with two parts as we've talked about even with the FastQ file that you're going to have a header and some data followed up by a header and some data. Probably wondering why does the header only have a plus and we'll get to that momentarily. Let's take a look first at this header here. So the header for the sequence portion of a FastQ file is always going to start with the at symbol. So that is always going to be the first character in that first position, followed by some string of characters that you're going to have that really should be represented in this particular case. This is from the short read archive that I downloaded this from. And that's that is the accession number for the particular, I think this is a cell line. This is anyone 2878 that I'm working with. And then you're going to get your sequence. There's no big surprises here. You're going to get this plus symbol. I rarely rarely rarely rarely rarely ever see this plus symbol accompanied by anything else adjacent to it. That is just me having looked at various things and not having seen it before. This isn't to say that you can't have something with it. So for example, this was actually supposed to be the header for the quality and the header itself should have been the same as the header up top. So I just matched it up here saying SR15083.111.1. It should be that once upon a time sequences were separate files from quality values. And so they needed a way to match those things up. And that's why they kept them separate. Nowadays, I don't think I have ever seen a single instrument kind of segregate those into two different files. Also, I rarely see it that they actually use it. So we're just going to go back and this is what you're going to mostly encounter, where it's just a plus sign. So that's why I was saying before that the four lines consist of kind of like one sequence of interest. And then we're going to get over here to the actual quality of the data itself. Sorry, before I go on. Does that make sense to people? Or did I just confuse a whole lot more? Okay, getting some nods there. So it looks like we're good. Okay, so here is the accompanying quality. Now I don't know about you, but looking at something saying I have a T and the quality score for that or the score that I have for that one is a C. That just looks a little bit messed up to me. So we're going to change these characters that you see and give it kind of a numeric value. And the way we do that is we're going to use an ASCII table. If anyone's ever looked at ASCII tables, anyone having done computer science or any programming courses, you may have encountered this one. I just grabbed this off the web. You guys can take a look. So let's take a look at that first. The sequence that we had was a T and its subsequent quality was a C. Let's go ahead and take that C. What we're going to do is we're going to look it up in the table and find out that it's a 67. Okay, straight look up. So let's just write that down saying it's 67. And then what we're going to do is we're going to subtract 33 from that. This is important. And this is kind of the standard set by Sanger, where they said, okay, well, the Sanger fast queue files are always going to be offset by 33. And you may be wondering why would you offset that by 33? Well, if we actually take a look at what 33 is in the actual decimal values, you see over to the column to the left 33 is an exclamation mark. 32 is a space bar. If you look anything before that, those are all just kind of non-really specific characters. So we're trying to avoid all those and kind of stick us into more kind of like alpha character based things that we're more familiar with. Like how do you identify a null character without describing it as NUL? That's a really difficult thing to do. Or a tab without putting all these additional spaces or a backspace or a delete key. These are just really, you know, difficult characters to understand. They're not even characters, actually, they're just key presses to understand. So we keep that in the range saying, okay, well, we'll subtract 33 from that one. This gives us a 34. Okay, so the score that we have for our sequence of T is a ASCII character of C and is some decimal value 34. Okay, that's, okay, I get that. I'm willing to accept that. We can actually write it out for every single one of these quality scores that we have. And you can literally just write them all out for every single one of them. Let's go back and bring up our sequence. And we say we have a T, we have a C here, and we literally do it for all of them. And so we know what these numerical values are for each of the bases in the sequence that we're dealing with. While I think that's absolutely fantastic, the question really comes about is, well, okay, what is this actually mean? So the only reason they start this way is for simplicity. We can assign one character, which is a value to one base. So instead of looking at that as 34 under there, we'll have this long string of characters. We're just going to have to pad on some spaces to kind of separate everything nicely, which really sucks. Someone came up with a great little equation though to say, okay, well, the quality score that we get is equal to negative 10 log base 10 of the probability that the base call is done in error. Okay, base quality is done in error. So I'm hoping that the, that is a very, very, very small number. So if we isolate for that probability, we arrange kind of some of our terms, P is equal to 10 to the negative Q over 10. Okay, so relatively straight formula, relatively simple to do, perfectly fine with that. If we take that first character where we have our nuclei tied as a T and has a quality score of C, we know that C is equal to 34, the Q is equal to 34 on that one. Great, we've already done that previously. And we plug that into the equation and all of a sudden here we get three times the negative four in terms of the probability that this thing is in error. How many people feel confident that the call of a T is correct? Most I see some people start shaking. I have some faith in that that that is actually a correct call. I mean, the fact that it's 0.00398 probably that's an error. I have some faith in that one. Okay, I just taught you how to calculate it. Forget all that. That's just to get the numbers and the formulas out of the way. This table, once you've kind of have it kind of locked into your head, you'll start looking at various thread scores. Because you know things, how things start in terms of what character is what, you'll, you'll see those in an almost like reading a language. So you'll be able to do this kind of stuff without even thinking anymore. So when you look at a third quality score of 10, it's a one in 10, you know, sure, it's an okay quality score, not that much faith. I go to the extreme and looking at something, you know, at 60, and I've never seen anything with a Fred score quality score higher than 60. It's one in a million. So I don't know about you guys. I am pretty darn confident that that is the correct basis. That's the case. So really good to do is kind of remember kind of just general ranges and the scales of which these things are in. And for the most part, you should be okay. Okay, let's go back to looking at that fast Q. So we store sequences, and we store quality scores associated with those sequences. Each consists of four parts, the read header, the sequence, the quality score header, and again, those quality scores. Every read header is always going to start with an at symbol. Does anyone actually know? Because I call it an at symbol just because it's what I've always called it with email. I don't actually know what the proper name is. I guess a Google search would fix that. But anyone knows, let me know. The sequence itself in this particular case has a CTG and N, and the quality score header again, I'm just going to reinforce this, has a plus sign, followed by some texts, which usually match the read header. But again, I've never, ever, ever seen many places these days use that. So it's always just going to be a plus sign on its own, just because those four lines are always adjacent to one another. And lastly, the quality scores are all ASCII characters. So you guys have that lookup table, you guys can actually do those calculations if you so choose. But again, familiarity is going to take over a lot of this stuff. And you'll be able to kind of look at these things and say, okay, well, that C is really a 34. That 34 is really a 10 to 3. So it's a 1 in 10,000 or at least 30. So you can interpolate on that one and say it's anywhere between 1 in 10,000 and 100,000. Okay. So here's one little kind of gotcha. I guess it is kind of a gotcha. The FASQ file for one of the, for one random one that I had available to me, it's 4.9 gigs. 4.9 gigs. My computer itself is, my laptop has 500 gigs storage on that one. Yeah. I'm not really going to put five gigs of that, which I could be using for either a movie or some music for a FASQ file. So what you'll typically see here is, hey, someone put it. Thank you for notifying that it's an ad symbol. Thanks, Arana. One of the things you'll notice here is some of the files to deal with will end in .GZ. GZ just means this thing has been compressed. We don't want you to store something that's five gigs when we can actually reduce it in size and still not lose any information. So you can see we went from 4.9 gigs to a 1.5 gigs. It's about three in a bit times compression ratio. Technically, I can store three files in this particular case, where I could have just stored one uncompressed one. That is great because a lot of the tools you'll be using in the following days, actually this afternoon when we start looking at module three, it actually uses that GZ file. You don't have to uncompress anything. You don't have to do anything really with it. But what happens if you actually just want to take a quick peek at your file? Do you have to uncompress the whole thing to the 4.9 gigs, take a peek using the cat statement or less statement, and then go back to compressing it? Well, no. I mean, if I just go back to do just a less on the FASQ.GZ file, there's a small problem because it's a binary file. Do you want to see it anyway? And you'll get just a whole bunch of what seems to be random characters. If anyone can interpret that, kudos to you. You're in the matrix because I have no idea what any of this really is staring at it this way. Luckily, again, a little tidbit for anyone who's going to be doing this regularly, there is a command called Zedless, which says, okay, I understand compressed files and what I'll do is if you just type Zedless in the name of the file you want to look at, you should be able. I'm going to show you the actual character values that are stored in here. So now this looks should be looking a little more familiar, start to the yet symbol. My apologies, the fact that it line wraps, but I wanted to make it all large enough for you guys to see. But you can see that four lines represents one single sequence, and this looks a lot more familiar to us. Again, one of those little tricks that I've seen so many of the former students that I've had in the past actually uncompressed the file, look at it, then recompress it. And uncompressing 4.9 gigs of a file, or sorry, 1.5 gigs of a file to go to 4.9 is actually, it takes some time, really does, even on a current laptop, it takes a little bit. Okay, we have just finished FastQ files, and I don't have a sense of kind of where you guys are at. So I want to open this up to both sections of FastA and FastQ, if you guys have any questions for me. Okay. Either it's too early in the morning, you guys are just particularly smart, or I'm poor. Actually, I guess. Absolutely. What's your question? I'm really wondering, but now it's going to my mind, why it's called pasta and fastQ, what's the name of it? I remember looking this up one time, and I didn't actually get a good reasoning for the actual name, but I thought it was an acronym for something, but I was never able to find out what that acronym was. I'm just doing a quick search again. The fastQ portion, it's another one I have no idea. I would have assumed it's for quality, but it's never been defined, because they said that it's a FastA file, but with some quality scores. Now, again, when it was used originally, when Sanger originally intended it, I don't know if it was kind of built in mind with the Human Genome Project, but they were doing Sanger-based sequencing, and FastA files were really the only means of which they were storing the sequences. But for the fastQ file, I would have assumed that as well that the Q stands for quality, but then I would have thought, okay, is it called FastAQ? But who knows? I'm willing to step back if anyone has any other guesses of what those possibly are or why they're called the way they are, but otherwise, if someone out there knows, I don't even know who created this, I have to ask someone at Sanger. I think there is a software program called FastA, way back when, and they defined the file type, which is be called a FastA file, and then the software kind of fell out of use, but the name stuck around for the file type. So why was the software called FastA? I'm looking at the Wikipedia page. Who made that? Was that done at Sanger? David Littman and William Pearson doesn't say Sanger. Okay, FastA is pronounced FastA, stands for FastAll, because it works with any alphabet. I guess originally there was FastP and FastN for protein and nucleotide. Which, I mean, those are still being used less so. But remember how before I told you guys you can actually do these as actually stored as nucleotides or amino acids? That's where the FastP came into mind. FastN, I remember, but we, it was one of those discussion points in a book, and everyone said ignore it. You're going to be dealing with FastA. Okay, or FastA. Excuse me. I have a question. For the FastQ file, which range of and those quality scores are accepted? I mean, does it depend on the experiment that you're doing or it's just something fixed? So this is actually dependent. The FastQ files that you're typically going to encounter are machine kind of driven. So we talk a lot about sequencing. So for example, the Illumina sequencers, they used to start at only, I think it was a character B and then move up from there. But that kind of went against with what Sandra was doing with their sequencing. And they were using an offset of 33 instead of that 66, I think is what that B represents. And so it was selling things off. And so people were just like, okay, well, what is it that I can what is it that I can do? Okay, well, they came in line finally, and went to that same offset. So if I take a look at what that's, oops, I did not like that. Sorry, I'm just looking back here to see what the character for, oh, yeah, it's the exclamation mark. Sorry, I even said it. So we can go down to as low as an exclamation mark. And then height really, I don't know what the upper limit is on that one. So I'm just sharing my screen again. I don't know what the upper limit is. Most of them, I've again, I've never seen anything with a higher quality score of 60. 60. So if we take a look at that 60 and add 33 to that one, then we're we're dealing with a square bracket, the right square bracket. So that's probably the highest that you'll encounter there on that. It does not like be going back and forth on on sharing and doing the slide. Okay. Is there like a rule of thumb that that the letters are good quality, but the numbers and characters are bad quality then or something like that? You know, it's funny you mentioned that. I asked that once when someone was calibrating our machines. And so that's really what it's pulling down. So you're asking, who monitors the quality of the qualities? Is that your question? Yeah, I guess just looking looking at that list, it's it's hard to figure out like where the cutoff of what what a decent quality is. But it seems to me that the letters are in the range of good qualities. But down at the I think if I'm unless I'm reversing it that the numbers and the characters are not so good qualities. Oh, sorry, I see. So you're talking about from the ASCII table itself. Yes. Oh, sorry. Okay. Okay. I understand the question. So yeah, I mean, if you see a comma, for example, you know how far down the list that is? It's pretty down there. So when you take a look at something like that, yeah, you can associate it that way. It's funny. I don't even see them as actual legible characters anymore. Just see numbers whenever I look at these things. So but it's definitely a interesting way of looking at it. So yes, I mean, you can take a look at it. Most of the time, if you see anything in the range of an A or a B, you'd be like, is that really good enough? Because again, your highest value that I usually see for most cases is going to be a 60, which means it's one in a million. So that I guess that gives you the spectrum of how far up that's going to be. So I'll just chime in to say, usually when I take a peek at the data, you want to look at your data and get a sense of if it's good quality or not. You want to look for capital letters, right? If you see a bunch of capital letters, it's good. Depending how long your reads are, your quality can drop off at the end. Or you might have a little bit of bad quality at the beginning. So you might, so you kind of want to see capital letters sort of in the middle chunk. And if you see a whole bunch of commas, something went wrong, or those reads are terrible for some reason. But you really want to do some of the QC metrics that are going to come up in module three to look at how good your reads are. Lots of the algorithms for calling mutations will consider the quality of that base that supports the mutation and getting confidence of whether it's a real mutation or not. So they make a real big difference, but usually it'll be an algorithm that you use that will be looking at that quality. Although it is nice to look at your own data and see lots of capital letters. So yeah, that's a good observation. By the way, there are so few people that look at raw data. I highly recommend just open it up and take a peek. It costs you a couple of seconds just to review it. And it also gives you some idea about familiarity. So that whole idea of using the quality scores when you're doing your base calling as well as using the quality scores when you're doing your variant calling. Not all algorithms actually look at it, which is the sad part. Portions, some of them do. Some of them don't. They'll give you a variant call and you'll be like, well, I kind of made some primers across. This is not showing up, it's going on here. Never hurts to look back and find out, was that base actually correctly called or was it marginally called? So these are things you should take a look at. Okay. Any other questions? I had a question. Between the FastQ and FastA files, so will the FastQ files also have those IUPAC code names in them? And will those also have quality scores assigned? So I have never seen any IUPAC codes in a FastQ file. I've even taken a look at the standard itself and it doesn't say explicitly that it can't be in there, but I've never seen one in there. Good question. However, most of the time, you're always going to just see an ACTG and an N, at least from what I've gathered. I have a question about the FastA files. Yes, please. So are those updated regularly as more sequencing is done and some of those Ns get discovered or some of the areas of the genome that do the better sequencing techniques become uncovered? So do they update these FastA files? That is such a great question. The answer to that is depends. Worst answer could possibly give, but yes, it depends. Do they update it? Yes. Sometimes you'll actually see prefixes to a reference genome and see blah, blah, blah, blah dot one or dot two. And it tells you that, you know, they're progressing and they're actually figuring out more bases that were once upon a time N or they're fixing positions, things like that. So are they updated regularly? Not like every day, not even necessarily every month, but they do have a release cycle. Like, don't get me wrong. The GRCH 37 was what February 2009. GRCH 38. Does anyone remember GRCH 38 when it came out? 2013, I think. Was it 2013? So you could see that, you know, there's that four year gap where there are updates in between. Sure. Here's the problem. If you do all your analysis on one version, say dot one, they come out with dot two, dot three, dot four on a kind of even say quarterly basis, are you're going to redo all your files on that? So you're gonna have to reanalyze everything. This is played literally all of us for just like for years. And so while they do update them on a basis, I can't even say a regular basis. Even I think Heather actually sent out a Slack message once upon a time to our lab, which talking about, you know, the whole entire human genome reference is complete. I'm not about to go back and redo the analysis on two years worth of pancreatic data, pancreatic cancer data, and reanalyze those. That might, I just need to close this off. Maybe I'll take a look at it as part of phase two. But even if they do update those on a semi regular basis, it's kind of hard to keep up with that analysis, just keep going over and over and over again. My highly, highly recommend pick a reference from the beginning. Keep your eyes open because if there's a glaring problem and say, he's still looking for mainly so TP 53, say there's, there was an error that got fixed kind of in TP 53, whether it's a shift or something, then by all means go for it because it's directly impacting your work. However, pick yourself a reference, pick yourself a fixed prefix on that reference, and then stick to that at least for the duration, unless there's a definitive reason why you shouldn't. I have to admit, I don't even know longevity when the next reference genome is going to be made available, or even what's in the works in the genome reference consortium. So I have one last question. Sure, sure. So which site would you recommend that, that like maintains a pretty good up-to-date FASTA file like, I don't know, like USCS or Ensembl or where? You're about to start a rumble. That is such a live specific some people have their preferences. I'll be honest with you, I have always been using UCSC for no other reason than it was the one that was forced down my throat when I first started. Okay, that's it. There have been derivations of that particular reference genome from other people where they'll give you, you know, the 1 to 22 XYM, then they'll give you additional sequences. Some will be called like junk sequences. For example, they'll throw in a couple of common contamination sequences that are on some benches so that you can eliminate those ones. There'll be those, you know, those just random fragments of DNA that they really couldn't figure out where they are, but they know they're part of the human genome. And so they'll stick those in there as well. But UCSC has always been kind of my go-to anytime I start a new kind of a new place where we're dealing with data. But keep in mind that this is actually going to require some collaboration with whoever it is, other labs that you're dealing with. If they have a fixed kind of standard for which reference they use, then you may not have a choice but to use it. But for me personally, I've been always a big UCSC fan. Thank you. Okay, let's move on. Before we do kind of the next file format, though, I kind of get to give you guys just a little bit of background of just the idea of sequencing in general. Okay, so here, let's just say this is, you know, human DNA. I'm just going to stick with that for now. What I'm going to do is if I get a piece of DNA, we don't see this kind of like from one to the end. Again, if you even just looking at chromosome one, that's 250 megabases, it's kind of a difficult thing to do one time if you're doing it sequentially. Number two, just to keep up with reagents as it's in the sequencing is going to require some kind of labor there. And three, if really it's time, it takes such a long time to just kind of run through that. So what we'll do is they actually just share their DNA so that we have it to some length. I've always found the whole shearing process fascinating because I think the first month that I worked at OECR, they let me in the wet lab and they let me share some DNA. And really, I pressed a button. And that was it. That was just the most anticlimactic thing I've ever encountered. But we'll get different fragments of DNA. And what we can actually do with those ones, we can, if we sequence all these in parallel, well, we can get some information about them and that should produce our timeframes a lot more. But the question is, so I just take that great little piece there. How do we actually, how does the sequencing portion actually work? I'm not going to get into details about the specifics behind the luminous sequencing, nanopore sequencing or anything like that. This is just going to say that we have a piece of fragment of DNA. And again, I'll be dealing with DNA here. We have some mechanism. The DNA just runs through it. And what we'll do is we'll get a FASQ file. I haven't talked anything about read lengths or anything along that lines. Just the fact that we're going to get a FASQ file for that little segment of DNA, which we're going to call a read. That's kind of all you need to know. We know that four lines make up that FASQ file where we have the actual sequence and every corresponding base in that sequence has a quality score of some sorts. Okay. So that's one we could just read through that entire DNA fragment. That's great. There is another technology that is probably one of the most common ones used This isn't taken away from any of the other technologies, but this is the one that's just been used for the better part of a decade that seems to have taken over most things. Where again, we have that same fragment of DNA. We're going to attach on some mechanism at one end of it. We are going to sequence up to a distance. So right now we sequence a certain number of bases and then we're going to stop. We're going to attach on another mechanism on the other end. What we're going to do is we're going to sequence from the other direction and we're going to sequence in. So we are going to get from that one fragment of DNA kind of just the ends of it. This one we're just going to call read underscore R1.FASQ and then the other side we're going to call it read underscore R2.FASQ. And really what we're doing is we're taking a pair of sequences of that fragment and you'll hear this term a lot. Maybe some of you've already heard of it. It's called paired in sequencing. Some people may wonder, well, what happens to the middle portion that you actually didn't sequence? Which that's a really good question. Here's the caveat. We're not just dealing with one cell. We're doing with a bulk of cells. And so we're actually sequencing a whole bunch of these things. And if you take things looking in totality, you're probably going to cover that middle portion to some degree. Again, we you'll hear the term coverage or depth of coverage a lot. And that's actually what this means for anyone getting position in your reference, you know, how many reads or little fragments do we have piling up there that we can actually take a look as evidence for what we're looking at. Okay, so let's just go back and look at it from just a single cell here or sorry, a single piece of DNA. And what we'll do is we'll actually add some labels to these ones. So now remember, we were talking before about how we're looking at on a, we're going to have one end of a, and we're going to have the other end of a. So this would be a underscore R1, a underscore R2.sq. Keep in mind, though, that when you sequence something, it's not going to be this kind of sequential pretty nice thing. It's kind of going to be all over the place. So you won't actually know the order of things. You won't necessarily know where these fragments are supposed to occur within that reference genome. So in this particular case, we were talking about chromosome one D is on chromosome one somewhere. I have no idea where these as well. I have no idea where I got to figure out where this thing matches up to, which this is going to actually lead us to the SAM slash BAM format. You may call the SAM file from here on. And we'll get to the reasons why the BAM portion, I'm not really talking about it, even though it exists. Okay. So again, I like feeding you guys to the deep end or straight to the sharks. Here's what the actual file looks like when looking at the characters themselves. And again, as with almost all the files you're dealing with, there's a header to give us information. And then there is a whole bunch of data, which gives us other information. So let's kind of take a peek at what's going on here. The first line, it actually starts with an HD and it's probably hard to see. So I just kind of blew it up here. That HD actually tells us that this first off that all headers in a SAM file start with that at symbol again. And now that we've confirmed that at the way to say that at is literally just at. So that at symbol tells us that it's a header line. The HD gives us tells us a specific information about the file itself. So the VN says that this is according to version specification 1.5. So the SAM slash BAM formats actually have specifications associated with them. Someone came up with the standard and said, all of you guys, if you're going to be using the SAM format to convey what alignments are, you're going to use this. And I have to tell you, before this same format came about, it was a crapshoot. Your data could be in so many different types of alignment. There was a Mac one. There was other ones with solid when they came up with their own. It was just, it was, it was absolutely crazy at times. So yeah, so sorry, kind of do my slides here, but a header line version of the actual specification used. GEO is interesting. I never use this. I don't think I've seen it, you know, set to anything other than none. This is just kind of my experience with this, but I always leave it here because it's usually put in there. But again, I don't think I've ever said it to anything other than it's supposed to represent the grouping of alignments. But it's kind of funny because I've never actually worked with that. The SO, however, the SO tag is very, very, very, very important. It's one of those things where you can actually take a final, we'll get to this momentarily, and you can sort it, but how many different ways can you sort it? This actually tells you that it's just sorted by the genomic position. So it's sort of based on the chromosome as well as the position itself. And we'll get to what all this means in a moment. At SQ is kind of the next section in the header. And again, it was a little hard to see because it was so small. So I'm just going to copy it here. So you're going to have in that header section a whole bunch of these at SQs. And what this represents is all the different reference names that were used. So you actually take a read when you align it to a reference genome of some sorts. Remember how we had that FASTA file, the multi sequence FASTA files? And they all started with that greater than symbol. That SN says what the next some text quote unquote that I used beside that greater than symbol. So it tells you, okay, what is the actual, in this particular case, that first line SN is one chromosome one. How do I know that? Well, I actually aligned it and did that. And also we know what the length is of that particular sequence, which is 200 approximately 250 megabases. So yes, so the the order of this is all important. Let me, you know, let me just get to you back to the FASTA file that we were dealing with previously. If you remember this slide, how we did a quick rep of the FASTA file that we were dealing with, we actually use these FASTA files and all our alignments and you'll be dealing with this a lot more in module three. We deal with these DCHRs. And you notice the corresponding values to those SNs. Well, those are a little bit different. Well, that just tells us in the FASTA file that they used, actually had one two three four five six seven. That's what they look like instead. Nothing really special. Just keep in mind that again, because there's no fixed format, this is really institute dependent. NCBI, UCSC, UCSC will always use UCHR in front of theirs. NCBI, they, I believe they only use just a numeric value with the exception of being X, Y, and MC. So they're just a limited to CHR, which is fine. I mean, you're just really carrying around three extra characters for no reason when you know what this is. So keep in mind that the order of which you do your alignment in is going to be represented in this file as well, which is why we say keep everything in kind of that numeric order. Read group. If I told you that you're just not going to be sequencing one sample, but you're going to be sequencing a lot of samples, would anyone here really be surprised? No. I mean, I'd like to hope that you guys are doing a whole bunch of samples. And specifically, since it's cancer related, you're going to be looking at, you know, tumor normals, you're going to be looking at multiple samples. The read group itself, I'm going to kind of break this out a little bit easier. So a lot of the time when you see it as a single row of data, I'm going to break it down to columns just to make it easy for us to view, but it has the same data in there. This read group section tells us some information about the sample that you just sequenced. So anytime you see at RG in the beginning of that header section, so it's going to be a read group. Okay. The important thing is the things that correspond to it after the fact. So ID, you got to have a unique identifier for this particular sample in library. It's just one of those things where you just need to identify it among all the craziness in there. SM is the actual sample name that it belongs to. This is kind of a little bit of a debatable thing. A sample name can be a person's name and their MRN if you so choose. Please don't do that because that's illegal for privacy reasons. Most of the time people come up with their naming conventions and we literally use sample one, sample two, sample three. I'll be honest, I've done it in the past. We actually came up with a more kind of standardized means of naming our samples, but keep in mind that you can easily do that way. A library is, it's funny, with libraries, some people will actually use the library name and sampling as being the exact same thing. What's a library compared to a sample? Does anyone know? Actually, I didn't even ask this question originally. How many people have actually started looking at secrets and data? Raise your hands. Please instructors do not raise your hands. Okay, NTAs don't raise your hands. I know you guys. Okay, so some of you guys have done it, which is absolutely fantastic. You get a sample, then what do you do with it? Someone shout out an answer, it's fine. You prep your library? Prep the library. Exactly. Well, I mean, some people actually give you tissue and then you have to extract your analyte, whether it's a DNA or RNA, whatever the case can be. And then from that, you're going to actually have to prepare a library. And that library is very specific to the type of sequencing that you're going to do on that one. Here's the thing, though. How many libraries can you pull out of a single sample? Really, there's really, you're only limited to by how much sample you have. That's all it is. You can reprep that library over and over and over again. And you want to know the different libraries that were created for each of those samples. Why? What happens if you have one particular library from that one sample that has all sorts of fusion genes all over the place? That's just absolutely crazy. And then one other library that has absolutely none of those. If you mix and match and don't define what those libraries are, you may be thinking to yourself, oh man, what's going on here? So it's really a way of just kind of separating these things. The next one we have is the platform unit. Now, platform unit. I'm going to actually tie this into platform itself, which is a little bit further down the line. But platform unit, you need some form of identifier to know exactly what this was sequenced with. And saying the sequence with an Illumina machine, that doesn't mean anything in terms of the time, in terms of the flow cell of which it was used. That's what we actually use. We'll actually stick the flow cell information in there. Or others would just actually choose a run name. So whatever the sequence are actually spits out, which with Illumina sequencing, it actually spits out the flow cell as part of the run name. So that's actually one of the ones that we use. Sien says, okay, where was the sequence? Give me some information about them. And then the platform would just discuss saying what platform is used. This could be Illumina. This could be Oxford Nanopore. This could be Tenef genomics. It could be any one of those. You can have a whole bunch of RGs in one. I don't put multiple samples into a single BAM file, because it confuses me when I have to make some decisions when I call variants and things like that. So I'd rather not. So what I do is I actually keep each individual sample separate. Now I can merge multiple libraries into one. And that's something that we actually do very commonly. So that's the only instance that I've ever had multiple RGs in. If anyone has any other stories of where they use multiple RGs in terms of the other instructors or TAs, please jump in. But that's just, again, my experience here. Excuse me. Yes. Explain library more because I haven't been in VethLab ever. And I didn't understand the library itself. Okay. Anyone here a VethLab person? No instructors, students only. Because I'm about to butcher this. And so if anyone wants to jump in. So a lot of the times what you're going to do is you're going to receive a vial of, let's stick with DNA. It makes things a little bit easier. We get ourselves a vial of DNA, which was prepared by someone, extracted by someone. And we want to actually get some information about the DNA that's in here. So for us to get the sequence information out of there, we need to prepare what's called a library from that. And a library is a kind of a set of procedures. Well, there's a set of procedures that are followed in order to kind of take your DNA and prepare it to be actually used inside the sequencer. So I'm skipping some steps here, trying to make it really simplistic. When we prepare that library, it is very, very, very specific to the instrument at hand. And once it actually gets put into that sequencer, we'll get the information out that we want. Here's the catch, though. What happens if someone prepared that really, really, really badly? We're probably going to get really bad data. Well, do you still want a sequence that original library that was done? Probably not. The only caveat to this one is, if you have very, very, very, very, very, very small DNA amounts that they gave you, you might be stuck. In that case, good luck. Something really bad happened and there's much bigger problems. If, however, you have, like, someone gave you, you know, five mils, that's great. I could work with that. Technically, I only need, you know, nano or Pico, depending on what I want to actually sequence. But now I have the opportunity to say, okay, well, I'm going to go back to the original DNA that I had. I'm going to re-prepare that because it really didn't pass any of my quality metrics. So someone's doing it quite literally from scratch again. And then I'm actually going to take that and sequence that. And hopefully that one's much, much, much better. So hopefully no one screwed up, you get much better data. If you still get bad data, well, you start wondering what's going on with the DNA itself. Does that make sense? Can I just add a little bit to that, which was great. Please. And I think when you think about libraries, like Richard said, they're for specific purpose. So if you're doing whole genome sequencing, your DNA is going to get fragmented into 350 base pair chunks. And then you put adapters on the end and those attach to the primers on the Illumina flow cell. You can't take that preparation or that library and do long read sequencing on it. You have to prepare your tissue DNA in a different way where you're now extracting super, super long pieces. And you have different adapters and you can sequence that library on a packed bio machine. So you can't, so libraries are made for specific sequencers and for specific assays. And they can be DNA libraries or RNA libraries, et cetera. So it's just to get your material in the right form for sequencing. That's a great point, Sarah. Thank you so much for that because I need to talk about the different platforms. So you can use the same library for like different experiments. It doesn't matter what experiments you're doing. Yeah. So if you had that library that you made for whole genome sequencing, where are you going to read, do paired end read sequencing? And you sequenced that and you realize, oh, I need actually, I need more reads from this library because I'm looking for some rare things. You could just sequence more of that library. So if you have a good library, you could keep going back to that library and sequencing more of it. If you want to now do long read sequencing to complement that short read sequencing, you go back to your tissue, you make another library and then you sequence that long library on a different instrument or a different platform. Okay. Thank you. So I have a question. So when you sequence it for the second time the same library, do you give different LB names for that library? If you're doing it from the same library, no, that library stays the same. There is a catch, though, that you will have a, you're most likely going to run it on a different, you can even be a different instrument. It's going to be a different flow cell. And so you'll have different information based on that one. But for all intents and purposes, that library name is exactly the same as it once was. Because you're just literally just taking an aliquot of that and sticking it on a flow cell, again, going through that process of putting adapters and so forth. All right. Thank you. Does everyone understand, once I mentioned, why you would actually sequence that same library again? It's very important to just the whole idea of the more information that you get, the more data that you get, the better statistical power you're going to be able to make with decisions. And so we always talk about, how deep do we go? I don't know about most people. Jump in if anyone wants to say anything about this specifically. But for Illumina data, I've always done a 60-30 rule. So I'll always go, at any one given location, 60 reads pile up in one particular spot for the tumor. And then looking at normal tissue, I'll do 30 reads piling up at any one particular spot. Okay. I'm going to keep moving on here. So we had that PG line, which again, difficult for people to see, but I just broke it out again. And I'm going to break it out this way. The PG is really one of those interesting things. If I told you you can get a historical view on any program slash tool that was used to process a file, that would be great. I got an audit tray. I just realized after saying that my view of great is really probably different from most people, but I get interested in it because I can go forward and backwards in terms of looking at data and figure out exactly how someone processed that. So that's where the at PG comes in. It comes in. Multiple of them can be done in a single SAM or BAM file. And the reason is you can do a lot of things post-processing. So you can take a SAM file, you can convert to a BAM, you can take that BAM and you can, you know, sort it and you can do that sorting and then re-sort it. So you can get this audit trailer of exactly what that file has gone through. Let's go through some of the tags here. ID is a unique program identifier. So I don't know about most people. I usually just give it a name of the actual tool and then add a suffix to it, .1, .2 depending on how many times I've used that thing. But that's just me. The program name is the actual program we're using in this particular case here at SAM Tools. You can actually retain the version of the program use and I cannot emphasize how important versions are when you're dealing with the data that you are. If you start changing versions in the middle of it, you may see some really weird things. So for example, if you have a thousand samples that you've processed, half of them were played around with on version SAM Tools say 0.9 and the other half were done on SAM Tools 1.3 and you see some huge discrepancy like there's some batching that's happening there. Look back at the actual versions that were used because a lot of the times you're going to notice that was actually the call grid. The best part about this is if anyone wants looking again at the audit trail, you can actually see the exact command that was used in order to process the file that you're working with. So this is great. Again, my definition of great really must be different from most people. Okay, so now that we've got the header portion out of the way, let's take a look at the data because this is where the real fun stuff happens. You're going to be going over this a whole lot more detail with module 3. So I'm going to go over this just to give you some familiarity. So this is a long file. It's a tab separated file that has a lot of information and let's break some of the more important things down that piece of the time that you'll be dealing with. First columns that you work with is a Q name or a query name. The next one is going to be a flag. It's funny. There are a whole bunch of these things that I was writing out like last week and then I realized, you know, as much as I like having this information here, you guys can just read up on this. Let's actually apply this and figure out what's going on with each of these tags themselves. So I'm going to take this one, the first line from a SAM file. We take this first line and does it make sense to anyone? Anyone who's actually looked or not even looked at the data, it can be a little daunting. But let's break it down. So I just again, I just took the information and put it down into more column or format for us to look at. So that first line that's there or the first tab stop has this. It's a query name. High seat nine tells me that it's probably a high seat machine unless you're just playing with me. There's some numbers. Don't know exactly what those are. And then there's some characters at the very end of it. And I wonder if that's a molecular barcode. You know, it gives me some information. And this is actually going to be unique for this particular bit of sequence. The next one is a flag. Flag 129. This quite literally means nothing to most people. I'm going and I put this on a separate slide here. This is probably the most important site you will ever encounter when looking at SAM slash bank funds. So the flag is this really weird way of reducing a whole lot of information into a single number. So this is what that site looks like. And I can actually take that flag 129. Let's just take that one for example here. And I'm going to take that number and pop it into the text box here and then hit explain. And what it'll do is it'll put little check marks beside all the properties of this particular sequence. And this is fantastic. So the first thing that it actually lists for me is the fact that this is a paired read. Okay. It's a paired read. I'm most likely dealing with an aluminum sequence paired and run. Okay. That's some great information. Notice the other one that's checked mark there is the second in pair. So this is most likely that underscore R2 dot fast Q or comes from that underscore R2, that other end that we were dealing with in the picture previously. Okay. This is great. I've got a little bit of information about the sequence I looked at. I can actually do the reverse. This is just kind of a little side thing. If you ever want to, you can just randomly start clicking on things and it'll update the sand flag for you. I'll tell you what that number is. So an unmapped, if something's unmapped, it just gives you a four. If it's paired, it turns it into a five, so on and so forth. Sorry. For anyone who's kind of into the math of this sort of thing, really it's just an execution of like binary bits, sorry, bits that get turned on or off depending on what's there. And then convert it to a decimal value for your ease of use. Okay. So this is just some of the information about what's there. I'll be honest with you, while it's great to look at absolutely all of them. These are the ones I always focused in on. Is something paired, is something mapped or not mapped? If you're looking at paired-end data, is the corresponding other end of that fragment that you're looking at? Is that mapped or unmapped? Does this thing align some multiple locations on the reference genome? If it's multiple mapped, it's always one of those questionable things that we look at. One of the other things is, amongst the library preparation when this thing is PCR'd, did we happen to sequence the exact same fragment? And it's just one of the PCR duplicates. And again, you have the ability to be able to flag that as well. So these are some of the things that I look at personally. Okay. Arning. The next one. This really is just the reference name. Remember at the ADSQs, we had a listing of all the chromosomes of which we had our sequences aligned to. This just tells you that this allows to chromosome one. Subsequently, this gives us the position on chromosome one. The next one here is the map Q, so the map quality. This is no different from your FRED score. Keep in mind that with your FRED scores from your base quality, can anyone tell me what a zero means? That it's probably correct. That we have zero to low confidence that the alignment that was listed is chromosome one position. 138,017 is probably the correct one. This is a garstree. And I know that I'm kind of going through this a little quickly. The only reason being is you actually cover this in some good detail in the next module. I just wanted to kind of give you kind of a small introduction so that when you see it in a little more detail, that'll get it reinforced for you. For the map Q, is it the average across all of the qualities of the base of that sequence? This is the mapping quality. So it's not okay. So it's overall, it's not. So it says if I had the sequence length of say 100 bases, if I were to actually align that up into chromosome one position 138,017, does it actually match up well? Yes or no? And it goes through some beautiful maps to say, okay, this really doesn't align here well, but it also doesn't align anywhere else well. The cigar string is probably one of my favorite things, which again, you'll go over with in greater detail. But this particular one says, okay, well, I have 98 bases in here in the sequence. Now 98, it can either match or not match. I won't tell you any more than that. You'll have to actually look at the sequence itself to determine that. But this gives you just a kind of a quick synopsis of what length you're dealing with and also what the contents of it potentially are. The next two, when you talk about, oh, sorry, before we get to that, if I actually take this Q name, if I actually take this sequence here and this quality, does anyone kind of wonder what's going on here with these three? If I kind of isolate these, kind of just jump in ahead of myself, but I didn't want to forget this. If I just take this, put an add symbol there, if I take the sequence, if I put a plus sign here and actually take the quality scores, what does this look like? Anyone? Yeah, it's a FASTQ. It's a FASTQ. It's great. So I can act. This actually comes directly from the FASTQ information and then it just tags on some more details. Just be aware that you can go back and forth from a band to a FASTQ. It's much easier to go from the band or SAM to the FASTQ because you just need to kind of slough off some columns and then reorganize them. So go from FASTQ to a SAM file requires a separate tool. I just kind of wanted to bring that up to you guys. The SAM actually stands for sequence alignment map, which is something I didn't actually tell you in the very beginning. And so the format itself, it gives you the information about where the sequences that you've been thrown at, where they align to on that reference genome. It contains two sections, header section and the data section. You have a line to start with an add symbol to represent the header. And then, again, we have the add sd, sq, rgpg. But then you actually get the data itself. So the data contains the information about where those sequences align to and we've gone through some of the different tags in there, at least quickly to describe what information is found in there. Also, there's paradens. There's single fragments. Just be aware that those are the different types. Now, with a band file similar in nature to that .gz that we talked about, a SAM file can be huge. It can be in the order of about a terabyte for a single tumor of what I've encountered. And when it's compressed, it goes down to about 200 gigs. So it's a pretty big file. Band file, again, is just that compressed version. Keep your band files, get your SAM files, rule of thumb. You'll be doing a lot more with it. So you can't just z less of band file. You'll have to use specific tools designed for it. And again, you'll be using these tools in module three. One of them specifically is SAM tools. I'm just throwing it down here because again, depending on how much time you have, you'll be actually doing some laboratory work or some labs with that. If anyone's interested, you can actually just take a look at that specification which is found in online. I put the link there for you so you guys can take a look. And does anyone have any questions about SAM band files? We have three more file formats. One called the variant call format. Two other ones which, I mean, we'll go through really, really, really quickly because I'm kind of running out of time here. Actually, question for you is Rashad? Yes. Is Trevor going to be taking over my previous slot or no? Can I just continue? I think you should just continue. Okay. Cool. I'm sorry, everyone. My time is being thrown off here. Okay. So, sorry. Questions? Yes. Anyone? Anyone? There's a lot of information. I kind of threw a lot of it at you in hopes for, again, just introduction and you will get to look at this in a great more detail in module three so that you'll actually be working hands on with this data. Great question. Sure. So, I think I got some answer in the select channel. But I mean, when you create these SAM files, if you do the library sequencing twice, I mean, do you need to add these names by yourself? Let's say if you do, if you sequence the same library twice and if you want to distinguish these two and using the read name, so do you need to do that by yourself or is there a way when the SAM file creates, it will do it for you or something like that? So, this is probably getting into a little bit of module three. Okay, sorry. Well, I'll answer your question now. So, unfortunately, the tool that you're using has no idea about the library or anything. You have to fill that information yourself. So, when you're executing the command, for example, there's a tool called BWA. You can actually fill in that regroup information by saying there's a parameter for it to fill that in and you'll literally call it LB colon and then add whatever name you so choose. The thing is that has to be manually done by you. Could you write a script to do this for you? By all means, go nuts on it, but be aware that the tool itself that does the aligner doesn't necessarily know anything about the libraries. That is the onus is up to you to be able to label those, which kind of makes sense because would you necessarily feel comfortable about just giving it a file, not giving it information about what the library is? No, you probably want to tell it. Okay. So, the sequencing center would not give me any information of that, right? Other than they would say that. Sorry, the sequencing center, that's a different story. If you just say, for example, you just give tissue or say even a violet DNA to a sequencing center and say, okay, I need this thing to be sequenced. When they return that, most likely, they'll have their own internal naming for libraries and they will identify that and you should be able to see that inside your sand header sections. But then, again, a lot of the times, for example, someone will tell me, Richard, I need this thing sequenced to 60X coverage. Okay, the first pass I do, okay, I only get like 20X. I got to keep doing that library again. When I do the library the second time, well, it really is still that just same 20X. There's really no differences. The library itself is considered to be not that complex. I'll have to redo another library. When I redo that library, I will give it a different name because it is a different library. And then I will actually return that information to whoever asks us to sequence it saying, okay, we did this, there's multiple libraries in there for that one sample. Here you go. Okay, so if they only give me the FASTQ files, that information is not there, right? If they only give you FASTQ, that is correct. That is not there at all. Okay. Hi, Richard. Hi. I have a question about, yeah, just along those lines. Yeah, I had a problem when I was marking duplicates using the p-card tool and it wanted to provide the regroup information and then I had to go into my FASTQ file to look for it. And not all of these, all of this information was available. So what would you recommend in that situation? Oh, this gets, this becomes a street fight for a lot of people because it's happened to me in the past where I've had things sequenced. They've given me the raw FASTQ files and they didn't tell me that the actual sequencing was done of two different libraries. If they don't tell you that, what you'll end up doing is the following. If it's two different libraries, if they're two different libraries, you don't know that they're two different libraries and there happens to be overlap of the exact beginning and end of that fragment, you're going to collapse that down and so it's going to be considered PCR duplicate. You can't really, if they don't, if they give you the FASTQ files and they don't tell you that it's multiple libraries, you have no other choice but to assume that it's only one library. If you find, again, this may be a point where did you ask them to specifically set a coverage cutoff that it has to be minimum, like say 40X coverage. If you didn't do that, you might want to talk to them and tell them, well, this is what I was actually aiming for. What coverage did you kind of give it to it as? Sometimes they won't actually just do the alignment. They'll just say, okay, I'm just going to do the sequence and generate FASTQ files and then you go deal with it. But then at that point, they're going to need to tell you that they've done it as multiple libraries or not. So that goes to them. And that's pretty much the only advice I can give you in terms of discussing it with them because otherwise it really is a crapshoot. So I definitely feel your pain because I've encountered it as well. Thank you. My next question is about the PG group. You mentioned the WAN, which is the version of the program used, and that can result in some batch effects. So how do you recommend addressing that and do we address that in the analysis part or do we ever address that? So there's two parts of this one again. It depends on what you've been given. If someone, there are some groups set up so that they'll just give you the band file. We did the analysis for you. Here you go. And you can see that there's different versions being used. Right at that point, I was asked to give me the raw data. Let me do this myself. That's just me. Just happened to the way I work. If not, you could always go back to them because this is also happened at one of the production labs that I worked at where they had to actually process it using the exact same version. And usually that's a discussion point. What version do you want us to use? You have to tell us. If the tell us what program, what version. Those are usually set up in kind of the agreement portion of things. So if you don't set up originally, get yourself the original raw fasties and just do yourself. Assuming, you know, assuming you have the computational power to do it. If not, you become a little bit of the mercy of whatever analysis group inside that. Thank you, Richard. Thank you so much. I'll just add to that. Richard, you showed that the fast queue info is actually in the BAM file, right? So if you can't get the original fast queues often, you can use a program called BAM to fast queue and re extract all the reads and then do your own alignment. And there's only one caveat, which is that sometimes, and you can see this in the program in the PG, sometimes the parameters for alignment use make it such that they don't report, let's say unaligned reads, then the fast queue that you extracted would not match the original fast queue. So you just have to be aware of these, exactly what was done to generate that file. But you can usually regenerate the fast queue. Thank you. Awesome. That's actually a really good point. Thanks for that. Okay. I'm going to move on to VCF, which again is one of those other important files. And again, as usual, let's just stick in the deep end and throw what a fast or the VCF file looks like. Hey, big surprise. We have a header and we have some data. Notice you're going to probably say this is like a recurring theme. Sorry about it. Just the way that it works. Let's go ahead and isolate some of these. I'm just going to take that. I'm actually going to start with the data this time because this is where things get a little more interesting. You kind of have to work in reverse order for this particular one. So the data itself, it's a, the data portion is a tab separated section. I'm going to break it down into first, that first line. So this is the header for the data section. It starts with a single, do people still say pound? Because I do most people these days, especially with the grad skin cyber, keep calling it a hashtag and it freaks me out. But if you hear me say pound, I think it is, I think, yes. But you may have just dated yourself. Just say it. Okay. It's possible. It's possible. It's possible. Come on, Brandon. Come on. So I still say pound. You didn't interpret that. You're old too. You're old too. Yes. Okay. So here's a little aside. You guys are at one point in time. You know, when I first started OACR and Francis remember this, I was like one of the youngest people in the lab. Now when I came back, I'm like the oldest person there. It's freaking me out. Okay. That's an aside. So this particular one here that we're going to look at chromosome. This happens to be on chromosome nine. The position happens to be at 130 megabases from the start. So that's the position itself. The ID, this is just a dot. And a lot of the times you don't necessarily have to include what the ID for this particular thing is. And I guess I should have told you beforehand. I have it in my notes, but I didn't say it. This is a variant call format. So the purpose of this file is to store variants, variant information. That means that you've done this alignment, you've taken your reads, you've put them up to the reference genome. And for the most part, a lot of it's going to align exact. That is probably the most boring thing you will ever deal with things that are exact have no interest to me at least when I'm when I'm taking a look at it from a cancer perspective. I care about that's why you work. That's why you're working. That's why you work in cancer because you like variants. It's exactly what it is. I like things that differ. Things that differ. That's what's interesting. Because if it wasn't different, then I probably wouldn't care so much. So this is going to tell us what chromosome for the particular base that we're looking at. What chromosome is it on? What position it's located? There is not necessarily going to be an idea. I have to be perfectly honest with you. I've always seen it as dots, but it's not necessarily the case. What we'll do is it'll actually tell us from the reference genome what that reference base is at position 132.17050. It's actually a C. And it'll tell us in your read, however, that position is a G. Okay, that just turned a little bit sexier. We'll also get information about quality for that variant. Again, you'll always see a quality. But again, the quality itself isn't necessarily filled in depending on the tool that you're using. This particular one I believe comes from a tool called VAR scan. And then you'll get some filter and this one says, well, it passes all of our internal checks on this one. Here's where things kind of get sucky. You'll get this info string, which has a whole bunch of characters in there. You'll also get this format string. And then you'll get this normal kind of tumor thing. We'll get to those in a bit, but let's start off by going back to the original file. Remember how I said that there is a header? Well, let's just actually extract that header and take a peek. Here's the data. Let's go back to the info string. And I put whenever you look at the header file, you're going to notice that some of them actually start with info. So, pound, pound, info. And this is actually going to be kind of a description of the information in the info string. Okay. So, for example, let's take a look at that DP equals 83. If I take a look inside that header, I'm going to notice info equals and the ID is DP. And then it's going to give us some description. This is going to tell us the total depth of quality basis. That's a little misleading. I wouldn't necessarily use the word quality. I just would just use the term basis. But this is a total number of bases that kind of pile up at that one position. So, you basically have 83 pieces of info or pieces of evidence for that particular variant that you're looking at. If this thing happened to be a somatic mutation, sometimes you'll see the word somatic thrown in there. That's if it's calling somatic mutations. Otherwise, you won't necessarily even see this line in general because maybe you're looking at information. So, you're just doing it against the reference genome and only the reference genome and not caring about tumors and normal values. Here, you'll actually get some more information. So, for example, for far scan, you'll have the status of it. So, what's the somatic status of this thing? And you'll notice here that in our particular case SS is equal to three. And if we look that up, it's a loss of heterosygosity. Okay. Well, I'll keep that in mind as I'm doing that. The next portion here is you're going to get some kind of quality score, what that is for this particular variant. And again, we know what the Fred scale is. This is the third iteration of us seeing that Fred scale but applied to something different. But its calculations are still pretty much the same. Depending on the tool you're using, they'll actually give you some stats. So, I mean, in this particular case, if the alignment for the tumor and the normal, you'll actually get a p-value associated with that one. This again is going to be very, very, very tool specific. And again, it's going to be covered a lot more. I believe this is module four for variant analysis. But again, I'm just explaining to you that if you do get yourself a VCF, don't be worried about these info strings because you can actually look that information up at least for what it means, not necessarily how it's calculated. And then the last portion here, they're just going to tell us it's another Fisher test looking at tumor normal. So, you can do that comparison. So, again, this really isn't to be specific here for exactly what, because you may not necessarily get a GPV, SPV, or even a DP or SS in your VCF. This is very, very tool specific. But when you see that info string, you can always go back to the header and find out exactly what those components are. Okay. This is where stuff gets really fun, the format string. So, I pulled up the format and you see that at the top now. This is going to tell you information about the variant calls themselves. So, I mean, technically speaking, you know that the reference call there was a C and the alternate call was a G. But this is only up to the reference genome. In a VCF file, because we're all, the majority of us are doing cancer, we can actually compare the tumor to the normal. That's where things get fun. So, in this particular case, you can look it up and see what's GT. What GT is the genotype, you have the genotype quality, you have a read depth, you have the alternate, alternate depth, all that information is there. I leave it to you to look at it. Let's just actually take the kind of one by one along with tumor in the normal. That genotype, most people are kind of used to the AAAB, or sorry, AAAB, EAABB, anyone looking at a raise that's probably a little more familiar. We're going to change things up a little bit, where we're actually going to kind of give them some numeric values. So, zero is typically going to be representative saying that it's the reference base. One is going to be, okay, well, it's the alternate. So, you have both leals, you have one that's a C and then one that's a G. In the tumor, you have one in one, which means that you actually have both leals being G and G. Okay, that's interesting. Right away, you can tell, well, there's probably a lot of heterozygosity. Is that a safe assumption, everyone? Probably safe assumption. Okay, so, if we kind of shift along the quality, eh, Barstin didn't give us the quality score for this one. So, we can just keep moving on to the different values. So, it happens to be that the depth of coverage that we get in the tumor in the normal. So, we have 42 and 41. We have some pretty good evidence there, depending on what your thresholds are. Now, keep in mind, make sure that you do have a threshold cutoff. Whenever you go and do your sequencing, you need to tell them what your expectations are. My expectations, a lot of the time, are a 60 re-depth. I'm only getting a 41 here. Is that sufficient for me? No. Am I willing to accept it? Maybe. Can they prepare another library and sequence it? Yeah, okay, let's see if we can get that. And hopefully, they don't charge me, which is why you said it, those expectations up front. So, you can actually get a number of reads that support it from the references, as well as from the alternate. So, you can actually break those down. These particular ones, the vernal frequency, which so many people kind of manually calculate on their own. Some tools will just actually calculate it for you. Not all tools, unfortunately. But a good portion nowadays do that calculation for you. And again, this one here, this actually seems to be very, very specific with VAR scan, or at least tool dependent, where you can actually get counts of saying what directionality of some of the calls that you're getting. If it's on the reverse strand, forward strand, and then also whether it's the reference wheel or the alt wheel. Okay. Richard, I have a question here. So, when we're looking at the frequency, does that mean like 75% of the sample is tumor sample? Yes. Since it's a mixed sample, so then 45%. But why isn't it adding, oh, it is adding 200%. How does the tool know that which one is normal and which one is tumor? You tell it. Okay. You have to tell it which is tumor, which is normal. All right. Thank you. Yep. Okay. Very cool format. Again, this is just a summary. Just be aware that you are going to be covering this a lot more detail in module four, which is great because you'll actually be doing this a little more hands-on. Also, there, of course, is a specification on these ones. I'll put the link there. You guys can actually take a look at the VCS specifications. These ones are relatively fixed for longer periods of time. It's a fun time. They're a lot more chaotic. But I highly recommend, if you guys actually want to take a look at more details of the VCS, because I actually only went through a portion of it in detail. Take a look at that one. It's not exactly the most fun thing to read, but it is something that you can do. Questions? Okay. Two more file formats. I'm going to go really quickly through. We're going to do some database work. There's a question in Slack for you, Richard. Do the RD and AD numbers add up to DP for tumor samples? The reason I'm going to go back and say look and say yes or no is because it depends. I hate saying that, but it really does. In our particular case, you'll see RD and AD here. What is that 42? Here it's 40. There's one missing. There's a caveat to all of this. The read depth itself is usually just the total number of reads that support this particular location. Any time you get to the RD and AD, here's the problem. They actually go down and do a filter check for you. They say, I'm going to take a look at all these bases. For example, let's take a look at the normal one and take a look at the 42. It says, okay, 42 pieces of information here. Actually, let's take the tumor. Sorry. There's 41 pieces of information here. Only 40 of them pass our quality filters. It's an internal thing. They're like, okay, we're only going to give you the counts for the depth supporting the RD and AD for those that pass our quality. Again, they don't tell you this. You won't necessarily see that until you look at that data to say, why don't these things add up? And then you go down to the documentation and you realize, okay, that 41 is unfiltered, the 10 and the 30 is filtered. How do I know this? I broke my head against the table trying to figure this out when GATK first came out because they did that. They were so notorious for that. This isn't to slam them. I love the Broad Institute, but boy, you did not make that clear in your documentation. So be aware that those numbers won't necessarily add up and always take a look at your data and then go back into the documentation for your very calls. Yeah. And I'll just add to that. It's buried in the default parameter somewhere. So when you run Varscan or Mutec2 or whatever, they have default parameters and they're not obvious. You have to look at the documentation. You can change them if you want, but usually, these reads are filtered so that they exclude the reads that have those mapping quality of zero, right? The aligner isn't really sure that that read goes there. So why would you use it for variant calling? So it gets thrown out. If you have a read where the base quality is really bad, so it's not sure it's actually that base, you don't want to consider that base for mutation calling. So it gets thrown out. So you have your read depth and then minus all these reads that didn't pass those filters and you can change the parameters of how to control what you're comfortable with, including or excluding. So yeah, you typically end up with much less than. Like I've sometimes seen it be half in certain positions, right? Only half the reads are used. It really depends where you are in the genome. Yeah, I can say that I've encountered that where it's down to like half or even a quarter, the one that you're expecting. Yeah. Did I answer your question? Can I ask a separate quick question? So the normal you see normal reference reads in the tumor sample, but it's called LOH. So presumably, those are just from normal stromal cells in the sample or something like there's some threshold. We're going to talk about this in detail in module four. Okay. Go over loss of heterozygosity, but heterozygosity. So this position in the genome, it's like a variant position where you inherit one allele from your dad, one from your mom, right? You got to see energy. So you should have in your DNA, 50% of reads having the C and 50% of reads having the G, which is what you see in the normal sample from this person's probably blood, right? You see 45% ratio between the two alleles. And in the tumor, you see a skew away from that. So something has happened. There's way more of one than the other. So there's probably a copy number event or something that happened. So you've lost the heterozygosity, which just needs a 50-50 ratio. So we'll talk a lot about that in module four. Sounds good. Thanks. Okay. Let me move on to bed and the GFF and then quickly go over the databases. How am I doing on time? Because I'm really lost right now. I was only supposed to talk till 12.30, but started 10.45. So I'm going way over that. You have time to finish this. Okay. So I'll finish this section here and then we'll go from there. Okay. Bed file. This one is very, very, very important. I actually grabbed this from an exome capture file from Illumina. It's such a simple thing, though, because really, all it is is give me a chromosome, give me a start position, and give me an end position. So if I'm ever trying to look at a file and I want to be able to record regions of interest to me, I can go ahead and just start in a bed file. There are a lot more columns that are associated with this one, some of them for visual purposes, others to identify, for example, what strand is it on. For this purpose here, Illumina gave us a main to say where it's located. Some people will actually start information with gene information, so genes of interest, or even segments of gene. So it's really just boiling down to what regions of interest do I have. Nothing all that special. It sounds for the brows of sensible data. It's funny because I remember when this format first came out, we're using it for G-brows, at least when I was first using it, and just kind of putting some visuals up for people to be able to see some of their data. It's tab separated, text file. Each line represents one single region. Again, I only look at those three mandatory fields, the chromosome to start in the end position. The other ones, there's like many additional columns. Go crazy on which ones you want to use. I've given some information here. I'll have some links also to describe exactly what those are. Let's talk about GFF really quickly. Sorry, just to go back here, the only reason I'm showcasing this, this actually becomes very useful. There's a tool that you'll leave one of the modules you'll actually use called bed tools, where you can actually take your BAM file. And if there's only certain areas of interest for you, you can go ahead and just grab those areas using bed tools. It's a lot more fun than you think it is. Maybe again, my definition of fun is odd to some people. But on the flip side of that, when I told you before that this one's from an exon capture, because you're only doing certain segments of DNA in your sequencing, having all the rest is great, but does it really give you any information? So you'll want to take a look at ones that were actually specifically captured. I have to say that the non-specific captures that you get, something maybe interesting in there, most of the time you'd just be dealing with just those regions that they give you. GFF, how many people here are interested in RNA in my hand? This format is a typical format used for annotating, which it's one of those great things that you're going to be dealing with. So I grab this, let me get this from, I think I grab this, I got it from GenCode. I'm going to isolate my favorite gene, which is TP53. Again, it's just a D-formani. So let's just take this and break it down a little bit. So I've just columnized it again. Again, it has to be on chromosome 17. The source of it is Havana, which I found out stands for human invertebrate analysis and annotation. I was wondering what that was. But there you go. So that's the actual source for that. The feature is this particular location I'm looking at, Exxon. That's the start position. That's the end position. Score, I don't really use that. Strand, this is on the negative strand, which makes sense because TP53 is on the negative strand. Where we get some more interesting information is the annotations that I was telling you about. So let's take a look first here at that group section and we'll take a look at that ID. Exxon, so it's actually going to give you some transcript information about the source. It's going to give you a parent transcript if there is a parent to it. It's actually going to give you, and sorry, the transcript ID in this particular gene ID, this is from Ensembl. So not many people really like this kind of genamy convention, but you can always go back and forth just by going on to Ensembl's database and looking at that. You can get the exact transcript ID for this one. Get some more information. This is where, I mean, most people feel a little more comfortable. It is TP53, so you get the HEGO gene ID in there as well. And again, this is something that I just downloaded, which you will be using for a lot of your annotations, not necessarily just for gene expression, but also for your variant calls themselves to find out whether they're located in gene or not. General feature format, it's a text file, it's a tab set, burger file, identify some features. Okay, you'll be using it a lot. You get on that? Sorry, I ran through those two really quickly just because I'm trying to finish this off. But the ones that I really, really, really wanted you to get more introductions on were the FASTA, which not many people are really going to be talking about in this FASQ, and then again the SAM and the VCF, which are the real key important ones. Okay, let's talk databases. I am such a big fan of the ICGC, not because I put the first pancreatic canister data up there. I did. So that's my claim to fame. And also the first Prostate data I think I put up there. But their website is phenomenal. I really wish I would have had this 13 years ago. So I can actually go up here and it's a repository where, you know, people are now comfortable uploading their data and making it just completely available for people to use. What's more important is from us, from a cancer perspective, privacy issues concerning Germany. I completely agree and understand what's going on there. But just give us the somatic information. Let's tell us what's different between the tumor and the normal. A couple of caveats. How many people are doing kind of somatic analysis? Just raise your hands. The one thing that I always ask people when you guys do your somatic analysis, the majority of us are using blood for our normals. Be aware, always ask them what the source is, because I've had some adjacent normal. The problem is the margins were so small that there was so much tumor contaminating into the normal that it just threw off all my analysis. It was just kind of a little tip if you know, always confirm what your normal is ahead of time. Again, I'm going to throw in my favorite, my favorite chain, TP53. Who doesn't like TP53? It's too much pressure. And I can actually get a lot of information about that. Again, this is just, someone did all this work for you and you can just look all this stuff up. It literally gives you everything from, you know, where it's located, descriptions of it, all the different external references, the fact that they give you, you know, all ensemble, entrees, everything. And then on top of that, you know, you find out how many mutations in there, which are clinically significant. And then to top it all off, they actually give you pathway analysis for you, which I mean, Robin, I know he's done quite a lot of work with rectum and like this is just given to you. Someone had the foresight into saying that we're tired of all the researchers having problems being able to look up their data in a more cohesive manner. We're just going to do that for you. So now you just have to take, you know, what gene or variant of choice and actually just plot it in and see how many other different cancer types are affected. So why you just click on, yes. I was just going to say, entrees is actually pronounced out fake. And it's a French word and and that data is actually from the great work of the folks at OICR who basically do have APIs to all these databases that you mentioned. That's how they populate their database. And just a comment that I see someone raise his hand. Please, we are a little bit ahead of time. So if you have any questions, go on Slack, I wrote them there. And we will try to finish with the content. And if anything, we will answer you on Slack. And then Francis will stop interrupting Richard as well. At least for the French pronunciation of things. Oh, no, what's anyway. Go ahead. Okay. So what's great now is, you know, there's a mutations tab. Sorry, there's a mutations tab that you can click on. And now you can actually start going a little more fine grained and taking a look at the different mutations. Again, look, I can go here and pick a primary set. Pick a primary two, you know, go nuts. Now it doesn't necessarily have all the different subtypes in here. And if you do want to look at the particular subtypes, you'll probably need to take a look at the projects themselves, which will define those. So it takes a little more digging. But the sheer fact that I can just go on and click on brain, and then see the local biomass, the global multi-forms, this, this is just going to kind of enhance some of the analysis that you're going to be doing. And again, this is all building down to look at your data. You have tools now available to you to be able to find out, okay, across how many different cancer types for brain cancers are going to have a particular mutation of interest for me. And then on, sorry, and then on top of that, I can actually see the number of patients that we have, or donors that we have with that particular mutation, and then how many mutations are there. These are things now that people are just giving to you for free. You don't have to pay a single cent. And I really have no idea how sometimes OICR makes their money because they're not charging for this, but man, this is fine. Okay, Cosmic, this is another thing that I pretty much, I use Cosmic so regularly. I don't necessarily use the web interface as much because I'm a command line kind of guy. But the cancer gene senses, this is kind of like some curated genes. And the great part about it is this is wonderful. You can actually go down and take a peek at the somatic types associated with particular mutations. You could see the role in cancer. And if there's a particular fusion, you can find out what the corresponding partner fusion is. Again, they're just literally just giving this to you. C-Bioportal is another one. And Trevor is, he was kind of one of the developers for this. I did some work with him in accessing the C-Bioportal API. This is another one where I can just pull information. What I really use this for is when I take a look at studies. This has a great chance you can see here that they have pancancer studies and you can actually just get a total number of samples associated with them. But on top of that, you can do the same thing where you can actually break down the different cancer types that you want to take a look at. Click on any one of those and just explore the studies and get some just general information about these particular studies of interest for you, at least for the cancer types that you're looking at. Again, these things are just, I love open source tools. I just absolutely love them. You can, don't get me wrong, all this data is downloadable for you as well. Like you could literally just go ahead and grab this information if you so choose. For some of them, you can do germline analysis. You will have to log in and get approval as a collaborator for those. But that's a whole other story. GDC, I got to be honest. I, one of the people that works with us at OICR, she was in charge of GDC over in Chicago. And they actually used, I think the back-end APIs from OICRs, ICTC. So a lot of the kind of back-end tools will look a little more familiar just because they were using kind of the same back-end. But it's the same type of thing. This was kind of the next generation for the cancer genome atlas. And again, it's really just an aggregation of different ones from the US. So you'll still get the same type of information there. So I'll just list all the databases we just looked at more than welcome to kind of review those. And really, this is really all I kind of want to convey with the database side of things. If you want a little more detail on kind of the construction, things like that. Francis actually used to teach this section databases. But a lot more emphasis is put on just the data types to get you more familiar with those. Feel free to take a peek at those and play around with them because again, there's just the opportunities there for you.