 So this is the introductory. So I'm going to just start off with the first lecture. It's an introductory lecture to sort of give everybody sort of the basics about sequencing the next generation platforms. This lecture, like all the other CBW lectures, are protected under Creative Commons license, means that at the end of the lecture, whoops. Oh yes, I did press start. At the end, I didn't? You did, you did, yes, okay. So at the end of the lecture, we're actually going to put the PowerPoints up from this year's full series of workshops on the Bioinformatics.ca website. And this year, we're experimenting with a video version or voiceover PowerPoint, so that we'll have the presentations also available for downloads. If there's somethings you forgot, or you want to, again, you want to show your colleagues, what this year's workshop was about, we'll have those files available at a few weeks after the workshop. So Creative Commons license means that the one we're using is a variety of the sort of Creative Commons that allows you to share. So it means you can take this file, download it, and share it with your friends. You are allowed to remix. And for a PowerPoint slide, that means you're allowed to take one slide out and put it in your own presentation, for example. But you have to acknowledge who it came from. And this is sort of the tricky part. It becomes sort of like a virus. Your PowerPoint is now, because you've taken one of my slides, becomes, you have to share it too. And so it's a share like practice that if you take some of our slides, then you have to share your slides as well. So if you don't want to share, then don't take it. But I'm assuming you don't want to share. So as I mentioned, we're going to present this is going to be an introductory lecture on an next-gen sequencing. I'm going to be going over some of the technologies, some more in depth than others. There is definitely not time to do, we could spend two days on 454, two days on Solexa, two days on Solid. We could spend two days on just doing SNP detection. There is going to be really sort of a choice that we make strategic choice based on the faculty that we have, but also based on covering what some of the hot areas in the field are. So the field is a next-generation sequencing. People call it next-gen or next-next-gen. I have a slide in my talk where I call it about the now-gen. It's really, it's the current-gen. It's the current-generation of DNA sequencing. There are obviously many of the old capillary senior sequencing machines still around in a lot of departments worldwide, but more and more places are now purchasing the new machines. And it's even between last year and this year when we first offered this workshop last year, there has been, I think it's just announced that Illumina has got record profits in sales in the last quarter, and it just goes on and on. And Solid and Illumina, both of those machines, as well as 454, are really sort of fighting it out, battling this very active landscape. The things that we're going to oversee in this lecture is what kind of sequencing you can do and how does it actually work, trying to stay away from the vendor-specific challenges, although there are some very unique things to some vendors' sort of data output. The biggest difference, I would say, between the various vendors we've talked about right now would be, because currently they're also relatively short read, even 454 is a bit longer, but the biggest challenge is probably working with solid data, you're working in color space, and so I'll spend a bit more time sort of talking about color space. And then all the sort of challenges downstream from that end, in the following two days, then we'll deal with the data sets from either from Selexa and Solid. So just to set things in context, those of you that are still in your 20s may not remember this far back, but I remember when I was in graduate school that it was obligatory to show your autorads in any paper you published, and so that you had to show that you actually had performed the sequencing, and then you had all the bars and on your autorad to show. And obviously this has changed quite a bit and it's gained new sort of dawn, but so has the throughput, so has the sort of the tools that come with it. And if you go back to just 50 years ago, I mean the sort of the throughput of one nucleotide per year per person was all that could be obtained. And so from that going to sort of the large scale, sort of hundreds of billions of nucleotides per person per year is within this next gen. And even between last year and this year, we've sort of doubled or tripled sort of the throughput of the sequencing of next gen sequencing machines. And so there's from the ability of sequencing full organisms where these organisms are small viruses where now these are now large humans, it has definitely evolved quite a bit. So I think in the before the next generation, so in the Sanger world of sequencing, be it capillary or gel-based, I think we were definitely had a simplistic idea of what a genome was. That said, in how much information we could get it. And so we figure out, we get all the pieces, we get all the parts and then we'll figure out the organism. Unfortunately, I mean even since the first microbial genome, H-hemophilus influenza, it was quite clear that there were a lot of parts that we didn't know what they did and we didn't understand the full organism. But that said, with sequencing, we did DNA, we did some RNA and we deduced from that proteins and we deduced from that sort of population organization. We had some sampling sort of experiment where we would sort of look at various members of populations and so you can have some ideas of what was present, but you didn't have quite really the full grasp. And so you were able to do sort of sampling averages and consensus and definitely that led to a lot of information. And of course that led to other problems about sampling averages and consensus in that it's really, it confused the information and our understanding of the biology that was there. With this next gen sequencing platform, I think we're still reductionists, but we're better reductionists in the sense that we have a better idea of the parts and what we now have is that we have a much better idea of the variation and I think human variation sort of as we always knew about it, but now we know how complicated it is and because we've done some deep sequencing. And then the other sort of very big thing that next gen sequencing platform has brought on, although it's not the only way of getting that information, is all the structural variants in our genomes and how they're different in disease genomes and in normal genomes and how they are different between normal individuals. And that is still cheaper to do by sort of really sort of high density genotyping array type technology, but I think will soon be done by next gen sequencing entirely. And so what the next gen sequencing platform offers is ways to interrogate genome, be it of a bacteria or a population of bacteria or of a human or a population of humans in ways that sort of unprecedented. The other sort of big advantage of the next gen platforms is it's less, it's fewer sort of cloning steps, fewer PCR steps in some of the technologies and so fewer introductions of artifacts. And so when we're trying to detect sort of allele frequency, which are very close to the sort of the error frequency of the technology, so you won't have smaller and smaller errors that come from various lab manipulation and or sort of technical manipulation of the data or handling of the DNA itself. And so that's definitely moving there. And there's some, there's a couple of vendors that are doing sort of single molecule sequencing. And so that's obviously going to generate very interesting kind of data. It's going to make it much easier to sort of assemble phasing of alleles along chromosome. And so that kind of information is going to lead to new insights in biology questions that we're not been able to answer before. So quickly old school sequencing, you clone the DNA, you generate sort of a ladder of colored molecules and they're all different size and then you separate by size and then you sort of elude that through a capillary system or through a gel and then you detect the fluorochrome of these molecules and then you would get sort of sequences, strings of 500 to 1000 letters long. The problem is that you can only do even with the sort of the latest version of capillary sequence and you can only do sort of 96 well-plated samples at a time. So you could, the parallelism of the system only allowed you to do about 100 samples at once and it took a few hours to do that. And that's been definitely sort of one of the sort of bottlenecks in the old technology. That said, the old technology was how the human genome, the first and ladder drafts of the human genome were done. It was all through capillary electrophoresis and Sanger sequencing. So Sanger, the old gen sequencing and then the now gen sequencing. So for whole genomes obviously we've done the early drafts of human genome, we've done model organisms, we've done bacteria, viruses, mitochondria, chloroplasts and a lot of things at low coverage. So the first dog genome was a 2x coverage. The first, a lot of sort of organisms, you can do sort of low the cow genomes, sort of 5x coverage and so forth. And those are many of these things were done at low coverage. For RNA, cDNA clones, ESTs were the rave of the 90s and at first thought to be a sort of a contaminant of GenBank but then you sort of put them aside in a separate pile and you started making sense of the STs. And so we learned a lot of technology and sequencing approaches actually led to new discoveries and new tools and so forth obviously. From a community assessment, environment sampling, 16S RNA sequencing and so forth and ocean sampling, the Craig Ventures first sort of ocean sampling of the ocean viruses and bacteria were all done on the old technology. Now with the next gen sequencing, we're sort of doing some of the same project but we're doing it better. So we can now do human genome fully and do it sort of two way and do all the chromosomes, not just half the chromosomes and not an average of both chromosomes basically. We can do now, there's a thousand genome project, the ICGC cancer initiative, we wanna do 25,000 human genomes for both match and tumor types. So that's gonna be 50,000 genome projects in the next few years. And then we're also gonna be doing, we're, you know, Neanderthal man or in the mammoth and all these sort of the weird organisms that are sort of frozen up in the ice age somewhere. We only have a bit of DNA and we're gonna be able to do these samples through next gen sequencing. RNA secant digitization of transcriptome is sort of revolutionized. It's gonna, I think it's gonna kill the AFI industry and it's gonna totally replace the way gene expression is monitored and done right now. I think it's still a bit more expensive than AFI, so I think AFI still has one or two years of life left. But I think that kind of approach sort of chip technology for the high throughput analysis is gonna become much cheaper on the next gen sequencing platform. And what's, some of the challenges of course is sort of the alternative splicing. And so we can monitor alternative splicing events so we can sort of get tags or reads that sort of span a splice variant. But to sort of link up all the splice variants together into transcripts is still a challenge. But not to say that there's not a lot of people working on this problem trying to figure this out. And I think interesting people are doing really great things there. As far as communities are concerned, there is now much more sort of deep environmental sampling. The very interesting environment is the human microbiome project where they're actually exploring 17 or 16 sites on the human body and looking for what the normal microbiome, microbial sort of community that's living there. And then being able to use that as a baseline against which we'll be looking at sort of disease states. People mentioned Crohn's and other disease have been sort of thought to be associated with the sort of imbalance or things going wrong with the microbiome. But it's gonna be a bunch of other diseases people have not even thought about that might be associated with the microbiome. And so that's gonna be some very interesting things which are, again, we're not really quite good at sort of, we can sort of sequence a whole things and a lot of things and we're not really good at describing communities. We're not really good at, so there's a lot of these things, technologies are gonna lead software development and software tools that need to identify where the problems are and then you get some really good people working on them. There's just, this week there's a paper from New York, Toronto on Bar Seek. Another Seek, I should start collecting all these Seek names. So a Bar Seek is nothing to do with drinking but it's a bar coding system so that if, for example, in yeast, the 6,000 genes have been bar coded and so that if you can have 6,000 different yeast strains living in a test tube, you can sort of throw different treatments at them and you see which ones survive and which ones live, then you can go sample the DNA and see which ones are still alive by doing this bar code sequencing, next-gen sequencing. They used to do it with, and again, they used to do it with a hybridization model like a chip and but they found that this new Bar Seek is much more sensitive and gives a much broader more depth of the things you can do. So we, a number of people here talked about the epigenome rearrangement, chip-seek and all these things which are, there were chip-chip equivalents and other ways of looking at the epigenome but now this, with this new sequencing technology, it will be quite feasible to do and new, very interesting data is coming through. So there's a lot, we're gonna talk about a few technologies, it's a lot of different nanotechnologies, these products are using resolution of image analysis, chemistry and somology, signal to noise, software image size and pipeline and so forth and cost. And so that's all the sort of important stuff that sort of factored in into, should I use this platform or that platform and you're trying to answer some of those questions. This is a comparison between the Sanyur 3730 which is sort of the top model and the capillary sequencing, the 454 and the Illumina. People talk about Selexa and Illumina, so Illumina bought Selexa, so Selexa is the old name, Illumina is the new name and so if I use Selexa I meant to use Illumina and so forth. And the solid AB is quite similar sort of in the types, it's not identical and I'll get back to that a bit later, but it's very similar to the Illumina. And so the amount of read lengths, so if you want to sequence a human genome, you'd say it's three megabases, actually it's six megabases, because it's a two end genome and so you want to sequence both genomes fully and so if you want to do that, coverage, if you did a six X coverage and on a, with the 3730, that would probably work. 454, 12 X coverage is probably good, maybe 15 would be a bit better. Illumina and solid, anywhere between 30 and 40 X is probably good. Let's just give you an idea so that the base pair reads, you could get 600, 700 on a 3700, 400 is really, that's a really good run on a 454 and Illumina and solid now have, well solid is more like two times 50 or Illumina now has 75 things, base pairs times two. So you can get 150 nucleotides often prepared and read. So how many runs would you need to do, how many reads do you get per run? So as I mentioned, you only get 96 reads off of a Sanger, you get about a half a million from a 454 and 100 million from a Selexa slash AB. So base pairs per run, so you're looking at 57,000 nucleotides versus half a gigabase versus 15 gigabases. So you're starting to see the numbers change here. So how many runs can you do a day? This one you can actually probably do four a day if you're really like people working at night time. But basically two or three a day is probably a maximum. This one a run takes 10 to 12 hours. So you do about one a day, maybe if you have really good people management skill, you can sort of figure out how to run through a day. And this one, it takes about 10 days to do that length of a run. So it's a 0.1, so that takes a lot longer. So machine days per genome would be 312,000 days. So it would take you 850 years to do a human genome. Fortunately, they had a lot of these machines when they did the first genome. So it didn't take them that long. This one it would take you 144 days and this one would take you 120 days. So if a place has 10 of these machines right there, you've got this or 12 of these machines, it would take them a week or two. So cost per run is about 40. So this is much cheaper, but still for a human genome, that means it costs in today's dollars, not at the time they did it. Because the human genome as we know, cost about a billion dollars. So this is about $15 million if we did it today. On the 454, it would be about a million dollars. And on the Illumina would be about $100,000. So these numbers are changing. Last year, this number was half a million. So in a year, this number has gone down five fold. I was updating my slide and so I know that. And this one came down about two, it was like two and a half fold. And so it's really costs are obviously going down, but the way the costs are actually staying quite similar what's changing is how much reads are getting per run. And that's where the companies have really done a lot of inroad is on the throughput. Yeah. That cost, was that your reagents cost? Yeah, there's no bioinformatician salaries in here. Yeah, not absolutely. So I don't think it's reagents and it doesn't advertise sort of, there's no labor costs in there. And obviously that's an important factor and which actually makes this one go even higher because it's over many more, over 800 years. It's quite a, yeah. Yeah. Why do you think that the coverage and barrage of this company is similar to what the same number of base-reads per every year? The same number of base-reads per a base-reader? No matter what you think about it. So you mean first line? Why is this number different? I think so that's a good question. So there's basically the logic there is that with the longer reads of the 454 and even the longer reads of the Sanger, you can have, you don't need as deep coverage to get full sort of coverage of the genome you're interested in. And so that because of the gaps with the repeats because it's easier to align, so you can align multiple pieces and because they're longer. So obviously if we had, if the genome pieces were all 20 kV long, you'd need a lot less than, you would not need 40X if you had 20 kV long reads, right? Which is what some of the companies are promising in the future. And so that's gonna be, if we can get longer reads then you can start requiring shorter reads. But currently, I mean with the 35 base-pair reads not the 75 one, you really need 40X for, and so 30X is sort of is okay for the 75, but that's really, there's other issues that come up if you do those longer reads. But that's where the companies are. And solid right now, I think the longest read on solid is 50 base pairs. And so they're not quite up there as well. So there's, yeah, so no. And so I think that's the main reason. So with the longer reads, so if you can imagine, with the longer reads there's more chances of overlap and then there'll be less gaps and so you won't need, if you sort of assume a Poisson distribution of all your pieces and how they are good at overlapping each other, the chances are you have a lot of short reads and the chances of hitting another one is much smaller. And so that's why you need more. So this is another way, so this slide you actually don't have in your binder and the next one neither. In that, I borrowed this from John and I didn't ask him about it yet. So that's why I didn't want to print it. Oh, you stole it from you guys. Okay, I should have included. I'll have them printed then. Good. And so I didn't know John didn't acknowledge you. I'll have a word with him on that one. So basically this is a plot that shows sort of read length and bases per machine. And so what AB and Solexa or Lumina are at right now sort of 10 gigabases with sort of 50 to 100 base pairs per reads. And so they generate 100 million reads and so 100 million times 50, so times 100 over four to eight days. And obviously the longer the reads, the longer the time. So the 454 longer reads but much less data. And then Sanger even longer reads but even much, much less data. And this is where the companies sort of hope to be, they're promising to be sort of next year. So they're gonna be like 120 gigabases in Lumina 90 and they're sort of jockeying for position at the top end. And 454 is also gonna generate a lot more data in that sort of 400 base pair length. And this one, there's no more development on that platform. So that guy's not changing. So a lot of the figures I have, I'm sort of putting this acknowledgement slide up front is from a paper review from Elaine. And actually it's a paper and annual reviews. So it's not an open access publisher. You have a copy of the article in your binder because I have written permission from Elaine to distribute this in my class. So just in case you were wondering I did something illegal, I did not. I have permission from the author and she says this is enough and sufficient for me to allow you to, you're allowed to use my paper in your lectures. So I'm gonna talk a little bit about select cell first and sort of this is what the machine looks like. This I think is one of the older machines. And basically, sort of quickly, the way this works is that you start with your DNA, you add your adapters and then with these adapters you're able to attach your DNA to this solid support. Once you have that, you then with the adapters is sort of, you do this bridge amplification where you basically do PCR of your DNA so that you create these islands of all the same DNA because it sort of attached itself to the surface right next to where it attached it first. And so you have the right, you have this bundle and then you do PCR application. So you have a bunch, then you sort of detach it and then so you end up with a bunch of, oops, end up with, yeah. Maybe this is the right way to ask a question. So while I was reading the paper also I had this doubt. So I quite don't understand what bridge amplification is all about and why do you do it and has it got something to do with the period and the bridge also. Which one? The bridge amplification part. So you want to amplify that one, that's, you've got a single molecule here and then you want to basically create an island of thousands of these molecules. And so, and you want it right there at the same place. And so you want, so if you bridge right next to where you started, then it will be, that will be the location. So the trick is to have the right density of attachments and it's on your solid surface. And then once you do this amplification then you have locally, you will have this thing happening right there on the surface, okay. How's it going to do with the pyridine particles? So that's a different thing, yeah. So this is nothing to do with pyridine. And, yeah. And so then what you have is you have these clusters of sequences, of analytical sequence which then become your, this is where from which you do your sequencing reaction. And then the sequencing reaction for, is a sort of a fluorochrome terminated DNA sequencing reaction. And so you basically, if the same way the signage sort of worked, it would stop, the fluorochrome would attach where it would stop building the molecule once it got one of those, the one modified bases here. All the bases are modified at every step so that they all grow by one. The fluorochrome that gets incorporated does not have, does not allow the addition of another color. So they all get extended by one molecule. You take, you have this very powerful microscope because now you're looking at millions of dots on your slide. And you have this very powerful microscope that's taking pictures at every stage and then you're taking multiple pictures throughout the whole experiment. And so, and what happens is that every nucleotide addition is a different color that shows up in that little cluster of identical molecules. And so what you're doing through the image analysis, the first step of the image analysis is to look at these pictures and see the changes in colors that goes on. And then that's how you figure out which, because you know one color is one nucleotide and so you know which nucleotide is being added at that position. Yeah. My prevention of syncytine is that randomly you pass through the chip. So how do you... So I'm not sure I understand. Can you... Okay, so here in the chip, you have a random fragment of syncytine strength. Yes, but these are, so we think this cluster is all the same. It's one fragment. One fragment? Yeah. So it's going to then amplify at that one location. So all these molecules here are exactly the same. Yeah, but there are different... This one is different from that one. These are all different. So after that, everything's random. Okay, how do you control the position of these clusters and these clusters? Okay, so that I'm actually not sure for the same somebody else, but I think it's basically, it's a density, it's a solution density to how you observe the positive. You know, it's your laters and how they're placed on the surfaces also. Okay, so on 4.5.4, we'll wait for 4.5.4. Basically, you have the solid support which sort of tells you it's actually a very well-spaced surface that the beads can only go at one place. And so this one is much more random. What the companies, what Lumina is doing right now, they're trying to make this higher density. They're trying to get these closer so they can get more, higher and more reads per run. That's how they're getting the hub. It's one of the ways that they're increasing their throughput. So there's the paired end read which is a modification of this method, which I'm not, so I'm not talking about the paired end read, that's another way. Basically reading both ends of the same molecule. So then you now know there's 75 duplicates long and they both, they're like 200 or 300 base pairs apart. That's what you know exactly. So you have math from the beginning, these two reads in the genome. So when you try to map one, you know the other one is just a few 75 base pairs away. Of course, that becomes a useful reagent for translocations and alternative splicing because that one things are further or closer than you expect. So that's the way that's being used for that kind of study. And copy number variation as well. Are there also problems? One second, here, first here. Now I'll get you. Is each cluster of the nucleotide and the nucleotide activity quite quick, one rate? Yes. Yeah. The question about it is are there problems? Yes, so not as bad as 454. So that's definitely one of the drawbacks of 454 is the homopolymer detection. But this, actually, Michael was probably better at answering that question. So I mean. No, I mean, so like, so it's actually beautiful when it comes to insertion of the deletion. And you was talking about homopolymers. Right, right. Yeah. I mean, not having insertion and deletions on homopolymers. Right, right. Yeah. In general, there's very few. There's much more, yeah. Whether or not, I'll speak to that. When what goes up? Sure, yeah. But we haven't seen that machine yet. So. Well, it's mostly a soft break. They look at all the spots of the image from the instrument. They have to be able to hear. The spots apart, yeah. I have to speed up a little bit, but you had another question? Kind of an answer to the question. OK, and you will catch me a comfy break. So I'm going to try to go a bit faster now, just looking at the time. So basically, this is a luminal data. So you have the follow spot, the top spot over. You see the different colors. And so basically, the image is a multi-layer TIF image, so that you have progression changing of color over time that is layered into one image. And so you see the changing in color. And so you can sort of read. It goes from green to blue again, and so forth. So software does that. Obviously, read it off yourself. But then the first step of the pipeline for Alexa or Lumina data is to do that. And the same here. You look at another tough sort of cluster of sequences. And then here you have a T-polymer, and then so forth. So you can quickly go over the solid preparation and amplification and preparation of samples. One thing actually I didn't mention with the Selexa, but it's also or Lumina, and it's also very true. It's even more true for the solid is that you have to anticipate a lot of storage needs for that pipeline. And so your laptop that you're using today will not be storing, could store maybe one experiment, but most people don't have terabytes on their laptops. And so it's definitely been one of the big concerns. Not as much a storage issue as a bandwidth issue within your network of your institution between this sequencing machine and the server farm is at the needs of high bandwidth fast transfer rates. So the solid platform works. This is Michael Bruno, one of the instructors. He's not a late student. He's a late instructor. And actually works on solid. So it's a very timely appearance. He'll answer your question. So these are images from the paper, which are actually images from the solid website. And so yes, I know I'm over time. I'm going to try to go fast. So basically what we have here is we have single-stranded molecules attached to the glass slide. And again, it's sort of a somewhat random but uniform sort of type distances. And it's more a ligation type issue. And so what they have now is they have four colors still. But they're working in a totally different space. And so what we have is that we have your dye-based probes, which are the different colors, which are attached at a unique spot. And so basically you have your primer. And then you know exactly the sort of P1 adapter. You actually know the first nucleotides off your adapter because you set that. And then you have your template sequence, which is from your sample that you're looking at. And so you attach that and only so you know sort of the first two and then these can only attach because of the matching of these two nucleotides. So basically you're excited to get a fluorescence and then you get a color. You remove the fluorescent. You have a cleaving reagent. And then the whole thing starts over. And then you add some more. And basically you repeat that step. And then so every and basically what you're interrogating at that stage, you're only interrogating two nucleotides. Although the other ones have to anneal, but you only know about two of the nucleotides that you've interrogated that. And then you remove this whole molecule and then you start over with a shorter primer. So you're going to shift the experiment and do that experiment again. And we've just done all those steps where you take pictures at every step and so forth. And so now you're again interrogating two nucleotides at every seven or eight nucleotides. And so what you end up doing is you end up doing this five times and every position out of the 35 mer because you're looking, you're interrogating five mer and in this case it's 35 or it could be 50 now. You've got two, you have information about two, you have two informative positions that tell you what's at that position. And so you end up with a 35 nucleotide relinks which has informed you what was at that position across the molecule. And this, I'm going to try to come in here. And so you have now the different colors and what each color does is it tells you now what nucleotide is there is what pair of nucleotides is there. And so, oops. And so the color now is gonna, so these are the four different dyes. They're telling you you have an A, A, C, and so on. You have, so actually let me rephrase that. If you have this color, then you have one of these four pairs. If you have this color, you have one of these pairs and so forth. So one color gives you four possibilities, okay? So if we have this sequence, I'm gonna try to go through the exercise of translating the color space into nucleotide space. If you have this sequence, you start at the five prime and you replace the first dye base A, T at this position and corresponds to code three from the table. And the code three you have, I'll show you in the next slide. It tells you which color that is. And so code three, it tells you, so it's down to the T, and then I call it three, so that means the next letter is an A, okay? So we sort of go like that throughout the whole molecules. There's some rules here that it's sort of useful to know. But basically, let me just jump across to this. So in this case, the first one is an A. It's a red dye, so we know that the first base is an A and it's a red dye, then the next one is a T. So then I put a T down. The next one is a green dye by the T and green, so by the T and then it's green, the next one is a G, so I put a G down. Next, so the G is a blue dye, and so if it's a G, first base of G, if it's a blue dye, then the next one is a G. So all the homopolymers are all blue, so A, A, C, C, D, D, T, T, that's always blue. The G, E is blue. Then you have three of them being here. You have a G, then you have a yellow, so the G is a QA, and then you have an A. Then you go like that throughout the whole molecule. So solid data sort of looks like, one part of it looks like this, which is basically you have the first nucleotide, because that comes from the primer, and then you have the color code. So I had anticipated, and this is sort of an analysis pipeline, anticipated actually going through this with all of you right now and getting you to do this. We won't have time to do this right now, but maybe tonight we'll do it at the lab, which is basically to take one of these, translate and change it into a nucleotide space and see what the nucleotide is, try to figure out, understand what the software is doing. What the software is actually doing is it's using these numbers. It's actually not translating it to nucleotide space. It's actually using this and it can use that because it actually makes it easier to see errors, for example, or to see SNPs. And so SNPs will appear as two agents in color changes. And so if you have the same blue, green, then two colors are different, and then colors re-sync again, that's corresponding to a SNP. When you only have one color change, it's actually telling you there's an error in one of your detection or base calling or something. And then you have deletions and insertions which have, so these become diagnostic of the type of changes that you're seeing in your DNA that's been very useful. And Michael Bruteno is gonna talk quite a bit more about this. 454, much smaller number of runs, sorry, number of reads per run. Technology-wise, it's basically single-stranded. You have adapters. And what you do is we were just talking about is you have these micelles, which then become in a single micelle, you have a clone basically of one molecule. So that one micelle will have a single from one single molecule. And then you put these micelles on a sort of micro-titer or nanotiter type plate and then you will sort of capture the images that are generated from light being emitted every time you have a nucleotide added and that's what you capture. And this is not a microscope. It's a much sort of not as high resolution as the solid is Alexa. And so you don't need as fine tune more a CCD type camera to detect these and so forth. Smaller file size and so an easier process. But it's actually more complicated than that. You have quality score files, you have mismatches. You need to align the reference to genome and often. So there's the challenges of doing a de novo sequencing versus a re-sequencing project. And so the next gen sequencing platform for let's say for the human genome, I've been good to re-sequence the human genome because we already think we have a template of the human genome. For doing a bacterial genome that's never been sequenced before it's a much better off going in a 454 technology because you're getting longer reads and you have to assemble de novo. And so that's some of the things that some of the choices you have to make in those platforms. Also in a big new player last year, PacBio, that was their website. So this year they actually have a real website. And they actually there's a cute little video on how the technology works. And this is like Helicose is a single molecule sequencing. And so it looks promising, although they've published papers and of course when you start publishing papers, you start showing the world your dirty laundry and showing the reviewers your dirty laundry. And there's definitely, this is sort of a chromatogram off of a PacBio machine. And there's definitely some of these, what do you call these dark bases? Is that what you call them? Dark bases where there are things which should show up where the things, and it's not a uniform over time because you're dependent on the DNA polymerase going through these various sequencing this one molecule. And so the DNA polymerase because of homopolymers and various types of things. So the structure of the DNA may slow down or speed up. And so it's not as uniform as some of the other technologies. But that said, it's quite promising and we're still expecting a machine to be rolled up from this company in the next, so it was early 2010 is the latest estimate. The things to keep in mind is that all people are learning and so you can obviously learn a lot from each other and from the people in the room and from the people in your institute. The technology is changing. This workshop this year is different than the same workshop, same topic last year. And as I mentioned, we can only do so many things in two days. And I hope you'll be able to see some of the things you need to learn. Other things that you have to keep in mind more is not about the specifics of the technology and the specifics of the way things are being done, but more of the process by which it's being done and because the read lengths are changing, the technology is the software and everything is changing. So you'll have to, what you learn this week is gonna be different than what you'd be learning having to do in a few months. And so you get, but you'll be, hopefully you'll be in tuned with the way of thinking about this kind of data and so forth. Obviously, cost of machines is an important factor. That's sort of stagnating for the two big ones at about a half a million. Helicose, I think is advertised, is gonna go for like over a million. Pac-Bio unknown, maybe several million, but I have no idea. But it can sequence even genome in four minutes. So how many millions are you willing to pay for that? And so all these things are gonna change. Whether you're a department and you have one machine, it's a 454 or if you have your genome center that has 15 machines, it's gonna very much influence the type of infrastructure and types of things and types of projects you embark on. And so all these things are quite important. There was a recent tweet I read somewhere that in the Beijing genome center, they have 22 machines and they have 250 bioinformaticians. So they have 10 bioinformaticians per machine. And so if you're getting one machine, think about how many bioinformaticians you'll need to recruit. And of course the software is changing all the time. This is just a quick snapshot of what we have here at the OICR. So we have 14 machines. So we have seven solids and seven selects us. And this is the amount of the hardware we have. We have about 1.2 petabytes. We have about 1600 core clusters. We have about 120 sort of different types of servers. And basically the genome center is not the only purpose of the OICR, but it's the main purpose. It's the main purpose of what we have this kind of infrastructure. So obviously this 14 machines is a lot of machines, but it's sort of what we have. This is also what's coming up next is another tweet about, you can text me your blood sample so I can over my cell phone line. So we'll be texting each other our genomes pretty soon. And that should be quite interesting. So we have coffee break here, which has sort of shortcut it a little bit, but we'll sort of make do. And then we'll have Michael Brudino's, sorry, Michael Brudino's lecture, yes. Okay, thank you very much.