 Sequences are stored in different formats in databases and since different softwares they require those sequences to be in specific format so it's good to have an idea about what the major formats are and we will go on few of them. FastA is the most recognized and well-distributed format so normally in FastA the sequences they are put in there is a greater than sign and then down below there is the actual sequence so the length of the sequence should be less than 80 characters. Generally we have it's around 60 characters so here is the FastA format coming out from DNA at the top we see it's starting with this greater than sign right here same way down below is the protein and then we have GIGI stands for gene identification obviously this is the idea of this gene and then there is something written as C so C is it's coming from actually the complementary stands and the regions from which it's coming are designated here the base positions in between them this gene is located so and then there is a short description that it's coming from Homo sapiens man chromosome 17 this is the assembly primary assembly so assembly is where we can get short sequence reads or small sequences and we put them together into a gene so that thing is called as assembly so it's a primary assembly so then here is the actual sequence that starts so it's around 60 base pairs in each line and since this sequence was very big so I put those dash in the end we can also see a protein sequence over here down below same way we have the ID then it's a reference sequence so reference sequences are the ones which are the curated sequences there is a subsection in NCBI called as RAF6 so that actually puts those references kind of standard sequences in order to avoid redundancies so we can say these are the primary or the main sequences we might have other alternative slice variants but references are kind of true representative of the class then we have this ID again right here the protein ID then we have its description its cellular tumor antigen protein p53 isoform and then we put it sequence so I excluded some part of it so those dashes are there so sometimes it fastifies they end up with this star signs sometimes they don't so it's important that the software must know that what star means here so gene bank is the format which is there in the gene bank database so that's kind of a standard format the other formats are pretty similar to it in gene bank the records they start with word locus and then we have some description lines the sequences they start with the word origin as we have seen previous in previous sections and sequences they end up with those double slash signs so here is the gene back format so we start with locus and then we have its ID it's 237 base pairs there are there are some short descriptions its DNA its primary sequence submitted on this date then we have a definition line where we can have some descriptions some explanations about this gene again we have accession number and then it provides the base counts also how many is in it how many sees in it there are how many G's in it and how many T's in it and then the word origin tells us that the actual sequence is coming so actually we have these lines in which we have 60 bases and they are split into chunks of 10 so that is kind of a standard practice and sequences end with those slashes now ample format is similar to gene bank format we have ID we have accession we have descriptions represented as DE and then the sequence actually starts from where the word sq is there and then we have pretty similar line as we have seen in the previous example and then the sequence ends with double slash is just like gene bank uh Swiss plot is similar to ample except there are some more descriptions there are some more letters so we have plenty of them you can go and to look into it for example we have rp reference position our n reference number rp reference position and then we have authors who have submitted and same way the sequences they start with sq and the termination line is this represented as this double slash so far what we have seen is those sequences are submitted in kind of similar formats um xml is a modern practice in which we try to put those sequences in kind of a machine language so xml stands for uh extensible markup language the format is similar to htt html which is the language for web programming um the the good part is that this language is in between uh machine and man readable so this kind of easy to code over this so that is seems like pretty weird but not weird for the people with the computer science background so it's kind of this format nbrf uh let me go over it also it gives uh these in addition to the sequences which are format is pretty similar to simple fast a but in addition to that it gives us the checksum value so checksum is we take those nucleotides and since we know in computers every digit is related with some ascii value so we can take those values and add them up together and then we can come up with this number called checksum so that is good to uh put there and once somebody is downloading the sequence he can again check on his computer and find the checksum if they are they are they are equivalent to one another if they are equal to one another means the sequence is uh correctly downloaded otherwise there must be some issues with the downloading gcd stands for genetics computer group so basically it's a group of scientists who were helping the biological community to develop the softwares and training programs to help with the biological sequence analysis problem so this uh also came up with those format sequence formats which are kind of similar to the previous one nbrf we have checksums we don't have greater than sign just like fast a they also tell about the length of these sequence and then there can be multiple sequences in one file uh sometimes we need to convert between different sequences so you can come up with your own scripts or or you can come up with your own codes but there are some other programs which are available so one of them is Reedsick so that was developed by D.G. Gilbert at Indiana University and basically it recognizes DNA or protein sequence files and then it interconverts between different formats so what we can conclude from this section is that databases they store sequences in different formats and since we need to play between different formats we might write our own codes or we might take help from the already available programs like Reedsick