 Dear students, in this module, I am going to introduce you to the topics of similarity and identity. Why do we need to study these topics? If you may remember, we are trying to compare two biological sequences. By biological sequences, I mean either a set of DNA sequences or RNA sequences or protein sequences. So when you are trying to compare two sequences from any one of these types, then you would want to evaluate. You will want to check how good is the comparison between the sequences. Towards that, we need to know what is similarity and what is identity. To begin with, I will talk about identity or sequence identity as it is called. Sequence identity is the number of nucleotides in case of DNA or RNA or amino acids in case of proteins which may match exactly between two sequences. If we match 10 nucleotides from a DNA or RNA sample with another 10 nucleotides in another DNA or RNA sample, then we can count how many nucleotides are there in each one. We can also look at which types of nucleotides are there in each one of these molecules and we can also look at their arrangement. So let's take a look at an example. Here you have two nucleotide sequences, CAT, GCTT. So I have numbered this to be sequence number one here and CAT, GC, that is sequence number two. So I would like to compare these two sequences and evaluate their similarity. So how can I proceed? The first thing that I need to do is I need to count how many matches exist between these two sequences. So as you can obviously count the first nucleotide, C matches. The second nucleotide, A also matches. Similarly, the next three nucleotides, T, G and C, they also match. So in all, in total, you have five nucleotides that are matching exactly between these two sequences. Next, I will have to calculate which sequence has a smaller length. You can simply count the number of nucleotides in sequence number one and sequence number two. In sequence number one, you have seven nucleotides here and in the second sequence, you have five nucleotides. So you are supposed to pick up the smaller length. So obviously, you will select five. Next, you need to apply this simple formula wherein identity equals to the number of matches that is five divided by the smaller length. As you just saw, the smaller length was five because sequence number one had a length of seven. So you selected five. So now you have five over five and you have to convert it into a percent. So therefore, five over five is one and one into a hundred percent is a hundred percent. So if you want to talk about the identity of these two sequences, they are 100% identical. So the points to remember in this case, number one, the gaps are not counted. So if you have a gap in the sequence, then that is not considered as a match. And that is left out. Second, identity measurement is always made on the shorter sequence. As we just saw in the previous example, the shorter sequence was five nucleotides long. So we need to consider five as the smaller length. These two points are very important while computing the identity between sequences. Next similarity. So what is similarity? So similarity essentially is the result of matching and transforming one sequence to another by finding the smallest number of edit operations. So let's say if you had sequence one and sequence two, then you would want to compare these two sequences by sliding them against each other such that the number of modifications you have to make in one of these sequences in order for the rest of the nucleotides to match, you have to do minimum number of deletions, insertions and substitutions. So this is what is called edit operations. So we will look at insertions, deletions and substitutions later. But what you need to remember is that you need to cater for them. So therefore to compute sequence similarity, you need to first align the two sequences in order for you to obtain the gaps and mismatches. This can of course be done by using pairwise sequence alignments. So in conclusion, identity is the count of nucleotide or amino acid matches that are exactly matching between the two sequences and the gaps are excluded from this exercise. Similarly similarity is a different measure in which you have to compare the two sequences after aligning them with each other.