 So, we're going to be doing a lot with Sam bam files over the next couple of days. And it's kind of useful to dig into them a little bit and have some familiarity with what they are so I think for a lot of people bam files like really a black box right like they get a bam file, maybe they feed it to another piece of software. But they don't necessarily understand like how to read it or what's inside it. So that can be useful for things like just basic to see or as soon as you need to do something that the like next tool doesn't quite do, there may be a time when you need to actually kind of like interact directly with the bam file and understand how it works. So there is a specification we link there. I talked about how Sam is just uncompressed version of the bam file. Bam files are usually indexed so we keep hearing about indexing, you build an index to go along with your bam file so that it can be read in by like things like IGV. This is what a bam file looks like. It has two parts. It has a header section that has information about from like the sample that was aligned and the program that aligned it. And then it has the alignment section where each row of the file basically has information about a read and how it was aligned. It's a little bit hard to make out but you have like a read name that's very similar to what you saw when you were looking at your FASTU files. There's a flag. We're going to talk about what that means. Next up is like the position in the reference or the reference record that you're aligning to. So this is chromosome one and the position in chromosome one. Then I think this is a mapping quality cigar string. Then the reference sequence in the mate pair. So when there's an equal just equal sign it means it's the same so that in this case the mate pair is also mapping to chromosome one. And then here's its position. And I think this is the length or size of the fragment. And then the actual sequence itself and then the quality string. So that's just for memory. We'll look at in a minute, see if I'm right about the order of things that are in there. This is just reminding you about the header section. So the header has things like the short order of the alignment. So sometimes it can be useful to just look at a band file and look at the header. So if you're trying to understand like what is this data where did it come from, you can find information, like in the regroup area about the library or the sample. You can see what program was run on it sometimes this will give you information about like the type of a liner that was run, or the reference genome that you were aligning against might be specified here. It can be really useful sometimes to to dig into the sandbam header section. This happens to us all the time like especially if someone gives you data or you like download alignment data in BAM format. There's like a kind of a paper trail in here like a record of how we got to this BAM file. And you might look at it and decide like oh I don't really like the way they did this I want to get the fast queue and like redo the alignments or something. Let's see. We already talked about this how it's divided into two sections. This is just kind of looking at some example data in an example header section so here you've got things like the regroup ID so you're getting like like sample ID and sample name and here's like an example of a program record where it shows you if the WA was run and it can actually have like details of the specific command that was run, which can be helpful. And then this is just a bit of a zoom in on the alignment section which I should have used when I was pointing out all of the parts to it. But that's just those an example of what 10 alignments look like. And we're going to break some of these sections down. So this is the order of those entries so each is separated by a tab, and then these are the 10 or 11 things that are in each row each alignment record. So again you have the query name so this is the name of your read usually. There's this flag which will talk about the reference name, the position of your alignment, the mapping quality score. And then the gar string which we'll talk about, and then information about the read pair where where it aligns the length, the sequence and the quality. And these are just examples of what those look like. So any questions about like the contents of a band file on the records. It's an attempt to kind of like demystify because I think when most people see this they're like what is this gobbledygook right. But if you like break it into its parts it's actually not that scary it's not that much more complicated than a FASQ file you've got your, your read name, where it's aligning to where it's made is aligning to the actual sequences there so that can be helpful if you need to like work your way backwards like you can actually revert from this back to a FASQ file if you need to. And that's possible because the sequence and the quality string are still there they're still in the alignments right. So you can get back to like a simpler version of your data if you, for example want to like realign it but you don't have the raw FASQ data that happens to us all the time too. And a lot of places actually don't keep the FASQ files they just keep found files because they know that they can revert to the FASQ file from the band file. So just to save on disk. Okay, so we talked about this. There's just two elements of this that we're going to talk about a little bit more so there's this thing called the flag, and there's this thing called the cigar string. So this flag is basically a way to encode a whole bunch of information in a really efficient way. So this is a series of so called bitwise flags represented as a single number. So you're essentially storing a binary string of, and this keeps changing over time as they add more but I think it's still currently 12. What you're doing is you're, you're representing a binary state of 12 questions with the value zero or one. So zero you can think of as no one you can think of as yes, and then each value in the string of zeros or ones represents a property. So for example, is the is the alignment representative of a duplicate. That's represented by this second last value here. If it's a zero it's not a duplicate. If it's a one, it is a duplicate. Okay, so it turns out that there are two to the 12 possible combinations right you've got zero zero zero zero zero zero, you've got one zero zero zero zero zero one one zero you got what I'm saying right so there's 12 which is 4096 possible combinations. So that means you can represent all of this information that the status of whether it's a duplicate or not whether it's a supplementary alignment or not with a single number between zero and 490 4096. That's what is in that field of the band file. So if we go back to like one of our examples. This particular flag is 99. So 99 corresponds to one set of values. And there's a handy tool which we're going to play around with in the practical where you can go to this website and just enter the number 99 and it'll tell you which of those properties that particular alignment has so it'll just like go through the, the permutations yes this know that and so forth. Any questions about these so called bitwise flags. This is important because if you want to filter your band file for some reason like let's say someone asks you. I don't know how many reads were there in your example that were properly paired duplicates or something like that you can like figure out you can go to this website you can check those boxes for the properties you care about. It'll tell you that number and you can run a single command that says, give me all the alignments with this value. And that will give you all the alignments that have the properties that you're looking for. So the other kind of simple is are going to like play around with it later and hopefully get more of a sense of it. The other kind of confusing thing is the cigar string. So cigar string is one of those other columns of information. And this is basically another space saving innovation, which is a really efficient way to describe an alignment. So the string of, or it's the cigar string is a sequence of base lengths and operations, indicating which base is aligned to the reference, either, and it doesn't distinguish actually between match or mismatch, necessarily. And so it can represent deletions insertions introns. There's all these different kinds of operators like the most common or obvious one is match so there's a series of positions that match the reference. There can be insertions relative to reference there can be deletions that you can represent soft clipping and so on. And the easiest way to understand this is with an example. So this is an example cigar string 81 m 859 n, and then 19 m. What this is describing in a pretty efficient way is 100 base pair read consisting of 81 bases that match the alignment matches 859 bases that are skipped, and then another 19 that are matched. So it can represent us kind of split read exactly like the ones we looked at in the last lecture right where you've got part of the read aligning to the next on, then there's an intro on and then the rest of the line it picks up in the next on. So it's kind of like if you've ever done like a blast search. Now when you have your read, and it shows you that kind of like simple graphical depiction with all the little lines showing you where it matches and then there's like a gap and then another series of matches. So that's a nice visual way to look at an alignment but it takes like a lot of characters right it's like, you know fills pages. This is like a really efficient way to convey that same idea. And so software programs can like look through the BAM file and use that information in various ways. So just really quickly cram files we've talked about BAM files cram files are just that sort of next innovation and compression that give you even smaller files, so something like 30 to 60% smaller than the BAM file. The BAM files basically like a G zip SAM file like it's just using like regular text compression technologies. The BAM file is using knowledge of the reference genome. So you actually, when you create a cram file you supply reference genome and by kind of knowing the reference genome you can, you can basically compress that alignment, even more, because you can do things like it's almost like the cigar string you can say this hundred base pair read this hundred base pair span matches like position here to here. It's a really efficient way of just storing what that alignment is much more efficient than, or it's with I guess storing even what the sequence is you need to say like the sequence we're talking about here is the hundred base pair reference sequence from position a to position B, you can represent that much more efficiently than actually storing the sequence itself in a file, even when you compress that sequence. Yeah. Yes. And there's a couple different things so one is that I think some of these compression standards involve compressing the quality string the quality representation to a simpler format so you can lose some specificities of the quality depending on the settings. And then, in a sense you can lose things if you lose the reference that you use so this actually happens where you like, download cram files from SRA, but you, they did not properly document the reference file they used to compress the cram file. And then you have a problem where you can't uncompress it, or you're kind of guessing how to correctly uncompress it because maybe they use like some janky custom version of the reference that they use in house it's not obvious or available. I think like now SRA is much better about making sure that when you submit cram files that you properly, you actually submit like the reference you used, but there's still a bit of like faith that I don't know if they like actually like go through a full decompress and recompress like want to make sure it works you know. Yeah. So that's a good question. So this basically kind of stores like a diff from the reference genome which allows it to be really efficient. I think you guys have seen bed format so far no maybe. The bed format is just another commonly used file format it's much simpler than the band format. It's just a tab separated plain text file that has minimally the chromosome name to start in the end, and then other optional columns. So there's a number of different flavors colloquially referred to as like bed three bed four bed six bed eight. It always starts with just chromosome start stop like this from some start stop, but then it can have up to nine additional fields, including a name, a score, the strand, and then these thick start thick end and so far. Columns that you use if you're planning to upload your bed file to a tool like the UCSC genome browser, and you want to stipulate how the features in your bed file look so like are they the thick bars in UCSC or they like thick and then thin and different colors and so far. But the most common is just actually like the first three or maybe the first five or six. And then, I think now I've already introduced you to tools like bed tools which you can use and manipulate bed files. Similarly, there seems like Sam tools bam tools to card. We are going to use these extensively especially Sam tools and the card to manipulate our BAM files and Sam files. And common sources of confusion. These come up in basically all genome related bioinformatics. The most common source of confusion is the coordinate system so have you guys heard of like one based and zero based coordinate systems before. So it turns out that like the bioinformatics and genomics world could never agree on whether you should represent your genome coordinates with zero base for a one base system. So what does that mean, basically one basis where you count the basis directly so if this is comes a one and this is literally the first nucleotide in your chromosome and it happens to be a T, you would number that T one and so forth. It seems logical right like the first, the second, the third. I feel like this is maybe the more human interpretable or intuitive. But it turns out that for various like mathematical and computational reasons there's a lot of efficiencies to actually numbering it starting with zero. In which case you numbered the positions between the basis so like zero is like the position before the first base and one is position between the first and the second and so on. And these two systems have their pluses and minuses but they just their consequences that like representing something like a single nucleotide looks slightly different depending on whether you're using a one based or a zero based system. And then the whole world as I said is kind of split evenly between them so certain file formats are one base like Sam, and some file formats are zero based like bam. Which is kind of funny. And bed is like kind of maybe the most famous zero based file format. So that's just something to be aware of it's just a super common gotcha you just have to deal with it. Genome builds, just make sure you're using the right genome build this will probably happen at some point in the course that will go to load some data in IGV that was aligned with hg 38 but we've selected hg 37. And the data doesn't make sense. That's because your data is like mismatching with your reference genome. And then there's this idea of variant shifting and parsimony. I think this is the last one, we don't do a lot with variants in this course I won't spend a lot of time on this but the idea is that there's more than one way to represent a variant so in this case we have a deletion of a CA. So if you look at this reference sequence which is GGG CA CA CA GG, and there's a CA missing. Normally unless you do some very complicated molecular biology, you really have no way of knowing which CA was deleted, right, like, you know the reference has five CAs, you know your sequence as for, but it could have been any of those five CAs that was deleted, and for the you really don't care, right, it's going to have the same consequence on the protein. But the problem is, we can't all agree again on like, which way we should represent it. So they're like various different ways these are all valid or like correct in a sense ways to represent this. But some of them are longer, some of them are shorter, some of them have the variant at the right some of them have the variant at the left. And so if you're getting into variant representation there are certain like formats or standards and conventions that say like you should left shift like have your variant be as close to the left as possible. You should have it be parsimonious which is the variant should the leels of the variant should be as short as possible as long as they're greater than zero. So, long story short like this bottom one is a good representation because it is short, and it is, I guess, as far to the left, as possible. So that would be, I guess a more standard way of representing it. Before I sort my BAM file, we're going to come across this you before you index a BAM file or use it, you often need to sort it. The most common thing to do is sort by position. So we're going to sort, I think, in this course, mostly it's not always by position, but occasionally you may find that you need to sort your BAM file by read name. And usually done when we need to easily identify both reads of a pair. So like one common example is fusion detection where you're trying to compare like you're really focusing on where the two needs are in relation to each other. And so you might sort by read name to be more efficient for that kind of an application. So those are the lectures, and we're going to start with practical tomorrow on alignments. Thank you guys.