 What unsolved problem are you trying to answer in this study? The challenge is pretty simple to state that the human genome is still incomplete. And so we've heard of the human genome project and its completion. But there was still unknown regions of the genome that were difficult to finish for technical reasons. And those gaps, as they're called, so regions of unknown sequence in the genome have been gradually filling in over the years as technologies improve. But as of last year, there's still about a hundred or so of these regions that are incomplete. And so we set out using new sequencing technology to finish those regions. And it's clearly a challenge because it hasn't been done since the initiation of the human genome project 30 years ago. And so it's taken a combination of sequencing technology improvements as well as computer algorithmic improvements to stitch these regions together for the first time. What kind of sequencing technology made it possible for your team to work on completing the human genome sequence? Yeah, so the sequencing technologies that have been key to this are sometimes called long-read sequencing technologies where a read is how long you're reading the letters of DNA across the sequence. And in the initial stages of the human genome project, we could read about 500 bases at a time. 500 letters per sequence. And in the mid-2000s, we actually reduced the amount of DNA we could read at a time, but increased the accuracy and the throughput of the technology. So with those technologies, we could only read about 100 to 200 bases at a time. And around 2010, new technology was starting to come on the market that could read 1,000, 10,000, and now most recently 100,000 or more bases at a time. And so it was the emergence of those long-read technologies that allowed us to get greater information in these unknown regions. How did you and your team find a way to answer this question? The technique that we used was very simple. It was to collect as much of this long-read data as we could for a single cell line of interest. And so we chose a very unique cell line that has two copies of every chromosome, just like a normal cell, but each of those copies is identical to one another. And so rather than having to resolve the genome of two genomes, we only had a single version to worry about. And then you can grow those cell lines clonally so you don't have any variation within them, and then sequence them on these instruments. In this case, we used nanopore sequencing technology that was capable of reading 100,000 to a million bases at a time. And so we employed that on these samples and over the course of about six months or so, collected as much of that data as we could of the longest reads that we could possibly get, and then fed those into algorithms that my group developed for stitching these puzzle pieces back together again. What did your research find? Well, we found the unknown. So the real fun part of this project is that you're kind of exploring what hasn't been explored previously. We had some idea of what these gaps contained, but we didn't have it known down to the individual base pair. And so on the X chromosome, which we targeted specifically, we were able to fill in the remaining sequence of the centromere. So this is a large repetitive bit of sequence that's centered in the middle of the chromosome, as its name might suggest. And a number of other amplified gene arrays on the X chromosome were also completed. So these are kind of tandemly arrayed units of genes where you might have 10 or 20 copies of the same gene, tandemly arrayed in a large repetitive structure. And because of that repeat, it's been hard to pin down what each of those copies individually looks like. But by having single reads that can span now that whole array, we're able to find the variation that's within those copies. And so in some ways we didn't find any big surprises. We knew what we expected to find. So it's a little reassuring that we didn't find, you know, for instance, any like clearly novel genes or things like this. What was key that we did find was the amount of variation that exists within these repetitive arrays and within the centromere. And going forward, that's what will be useful, our ability to characterize the variation in these repeats and possibly associate them with known diseases that currently have unknown genetic basis. Is the X chromosome special in some way? Why did you pick it for this study? Yeah, there's a number of small reasons why we gravitated towards that chromosome. In particular, it was well studied in terms of its centromere because in males who are XY, they're so-called haploid, meaning they only have one copy of the X chromosome. And for that reason, the satellite array structure in the centromere of the X was well studied and characterized. And so we had an idea of what we would find. Also because the X is in a single copy in males, it's actually disproportionately linked to inherited diseases because you don't have a backup copy as a male. If you have a defective gene on your X, you don't have a secondary one that you could switch over to. And so it both has disease justifications as well as we had more prior knowledge. And so with that, we knew we would have a greater chance of success if we kind of knew what we were getting ourselves into. There's another number of other chromosomes that might be easier. There's a bunch that we know will be harder and we'll take a look at those next. Which chromosomes will you be looking at next? Currently, we're looking at chromosomes 6 and 8, for instance. There's other chromosomes such as 1 and 9 that we know will be exponentially more difficult than these other ones because of the specific repeat structures that they contain. Ultimately, our hope is that we'll improve the technology and the algorithms enough that we can just get the whole genome rather than a chromosome at a time. But in the near term, we'll focus on what we can do while we develop algorithms for the regions that are more difficult. Why is this research important and why now? Well, the second question is easy. Why now is because we have the technology to do it now. And if you go back and read the initial Human Genome Project paper, they kind of write off the centromeres in these highly repetitive regions of the DNA as regions that we may never know or may never finish. At the time it was because you couldn't actually clone that into a bacterial cell and replicate it, which is how they generated DNA to sequence for the project. So because you couldn't clone it, they figured, well, we can't ever access it. Now we can sequence individual molecules of DNA without amplification, without cloning, and so the technology allows us to do it. For the first question, why, it's because we don't know what we don't know. We've never seen these regions, and there may be things in there that we don't expect. And I'm a bit of a perfectionist. I like to see everything done to its completion. And so I would like to just map out that remaining unknown sequence. What message about this work would you like to share with the public? Well, at first it's just the notion that they've heard about the human genome project and might consider that it's finished and maybe we don't need continued investment in this. But with the new technologies, we're able to find regions of the genome that were previously inaccessible. And so just that take home message that it's not finished and there's a lot more work to do. The second message is that every genome is unique, which is kind of self obvious since we all individually are unique. Our genomes themselves are unique and contain unique structures, unique sequences, unique variations. And so what we hope to do from here on is now we have a single complete structure of an X chromosome. We can look at the variety of X chromosomes that exist in the human population. And once you can pin down all of those sequences and variation, you can start to make the link between different phenotypes, meaning how the person, what diseases they might inherit, how they look, et cetera, to what the underlying genetic basis is. What is the value in cataloging the kinds of genomic variation in a large number of people? Once we know the sequence for a single genome, it makes it much easier to explore the variations in other genomes because now we know what we're missing, we know what we're looking for. And so when we say variations, we're talking about individual basis that might be changed, but also very large structural changes. You can have a piece of sequence that's inverted, for instance, relative or moved or present in multiple copy numbers. And it's well known that these types of variations are linked to genetic disease and other genetic factors. And so by being able to now measure those variations in multiple individuals, we can make a better catalog of how prevalent those variations are and what their cause and effect is on human health. How do you think the study will influence healthcare down the line? Yeah, I mean, it's one of these projects that has kind of such a fundamental value to the genomics community that the human reference genome is essentially used by anybody who does genomics. Anytime a person's genome is checked for disease, that's possible now because we have a reference genome that we can compare against what a so-called normal state is, or unaffected state. And improving the quality of that reference improves the quality of all of those types of tests. One of the unintentional consequences of an incomplete genome is that those gaps of unknown sequence can cause artifacts in the analysis because if I were to go sequence your genome, for instance, it would generate sequences from those gaps and there would be nowhere to map them to in the reference genome. So they end up piling up in the wrong position of the genomes and this can cause artifactual variance in the analysis. And so by having a more accurate reference sequence, it prevents these types of artifacts. And then secondly, the technology improvement is very key. Sequencing now in these nanopore methods that we've been pushing forward and reducing the cost of sequencing and improving the analyses. This goes beyond just sequencing individual human genomes, but you can use it for outbreak tracking in a public health scenario. You can use it for antibiotic resistance testing. You can use it for testing produce and meats for contamination and pathogens. And so anywhere there's DNA, there's opportunity for DNA sequencing to be used and these fundamental methods are key to all of them. Could having a complete genome sequence also impact how genetic testing companies test for diseases? Yeah, it will open up these regions of the genome that were currently previously unknown to this type of testing. And so if there are variants within those regions that are linked to disease, now you'll be able to test for them and see them. Those type of kit companies in the past have often used probe-based technologies where you can really only assay variants that you know in advance because you have to design a sequence that captures those variants. Now you could design probes, for instance, and assays that would capture these previously unknown regions. But it's always a very slow process to adoption that as the reference genome gets improved, you know, for instance, we recently went from version 37 to version 38 and it's taken 5, 10 years for some tests and research labs to make that transition. And so the more something gets used, the harder it is to change to an updated version. And so that's only going to get harder as we go forward. So it's one of the challenges of the field is how do we kind of balance the validation of the existing assays that we have versus improving them. And I think by just showing improved accuracy and improving that to the community is how you push the field forward. Will there be a perfect human genome sequence or sequences, or is it a moving target? Yeah, it's a bit of a loaded question because what is a perfect genome? So your body has millions of cells and each of them is slightly different from one another, so-called mosaicism within your genome, depending on what cells in your body you're looking at. And those mutations accumulate over time as you age. And so even when I sequence my genome, it's just going to be a consensus view of kind of the average of all of those cells taken together. And so definitely you have to think of it as a dynamic system. And some of the regions that we're completing now are the most dynamic regions of the genome. They're most prone to change. They're most prone to copy number changes and variation. And between individual to individual, we're talking about the centromere. For instance, there can be up to a megabase, a million base pairs in the amount of sequence in a certain centromere from individual to individual. And so you have to settle for good enough in some point of having a reference point that gives you at least a coordinate system to talk about which genes and where on the genome we're talking about, but it's very dynamic. What's the ideal end goal of your research? I think, you know, if we take it to the extreme, the eventuality is that we'll have everyone's genome perfectly completed in both of the haplotypes for each deployed genome. And that's the ultimate goal of my lab and the sequencing technology and assembly methods that we're developing is that we can be given a human sample and the output is a perfect genome for both of the haplotypes.