 You're a natural and language linguist, linguist, linguist. We're going to flip shoes now. Okay, ladies and gentlemen, it gives me a great pleasure to introduce the last speaker of the day. So I just want to thank you all for attending Genetic Genealogy Ireland 2018. It is great to have such a large audience attending these lectures, and I'd just like to give a final thanks to our sponsors, Family Tree DNA, who have been so good to actually sponsor us for the last six years, and allow us to indulge in our hobby. So if you can all take your seats, especially those of you at the back, we will then continue with the last lecture of the day, and it gives me pleasure to introduce John Cleary. Now, John is a lecturer in linguistics at Harriet Watts University in... Is it Edinburgh? Edinburgh. Edinburgh. And in his spare time, he runs a variety of different surname projects, including the Kemp and Cummings DNA, surname DNA projects. But John is going to talk to us about why SNP testing, which he has done extensively in his projects, and the future of whole genome sequencing. So please give a warm welcome for John Cleary. Thank you, Morris, and thank you all very much for your endurance in staying to the very end. I hope it will not be the bitter end, it will be a very sweet end, in fact, to the three days, which have been fantastic three days here at GGI 2018, as ever. I'm going to warn you, I am going to talk about the... about why NGS testing on the assumption that most of you, well, all of you, I hope, are familiar with it. So I'm not going to be talking at beginner's level, explaining the basic concepts of what SNP is and how we test with it. So just to warn you in advance, if you're not familiar with SNP testing, you might find it disappearing in the jargon quite quickly. Can I ask you how many of you have taken why SNP tests for yourself or for a male relative? Yeah, so actually, I think you're all quite familiar with this. Of course, I had a very good talk just now from John Brasile, which is also dealing with a very similar topic. So what I'm going to do to try and wrap things up for this year is I want to do two things today. I want to do a bit of a review of some of the new tools which are being offered by FamilyTreeD&A for their big Y test on the site. So I'm only going to look at native tools today. I'm not going to look at any third-party tools. And the main reason is, as most of you know, a year ago, FamilyTreeD&A began the big process of converting their big Y database into HG38, the current human reference sequence. And this also led to a big revision of what they offered, eventually leading in the spring to their extending their test into the big Y 500, offering, in addition to the SNIP read, the panel of 500 STRs, which were offered from April or May this year. And so the big Y became something quite different in the past year, and FamilyTreeD&A took the opportunity to improve the on-site tools. And I'll look at these in the course of the talk. And then, as I said in the title slide, I'm not going to make any declarations or predictions about the future, but I'm going to have a little glimpse into where the future could go in terms of Y testing, but I think where it is going. So we are talking really about the presence that's shaping up into one version of the future. So the first part, then, I'll begin by looking at the big Y test. So as I'm sure most of you now know, it's an advanced SNIP test. It uses next-generation sequencing. And the aim is to sequence actual sequences of the Y chromosome. So not reading the microsatellites, the STRs, and counting the number of repeats in those regions. It actually aims to read long stretches of the Y chromosome. It does not aim to read the whole Y chromosome because that cannot be done. There are at least half the Y chromosome today has not been sequenced for the reference sequence. And there are other repetitive areas and areas that are deemed to be less useful for the needs of genetic genealogy. And so what Family Tree DNA did very wisely was offer a slightly cheaper version of next-generation sequencing that read important parts of the Y chromosome, but smaller regions of the Y chromosome than other tests, but could turn those tests around much more quickly and could concentrate on what was believed to be the area's most beneficial for genetic genealogists to acquire data from. So sometimes it's compared fairly with a test by a competitor, but the two tests do very different things. However, we are moving into a period, I think, where we need to start asking whether the parts of the Y chromosome not read by the big Y test are those parts we now need to be looking at. The big Y and other NGS tests are discovery tests. So the aim is to discover new SNPs that are not known before or to find out whether you may have certain SNPs newly discovered, which may belong in your branch of the Family Tree. And it's not to find out whether you may have certain SNPs in a panel at a fairly historic or prehistoric or ancient level. It's to find out whether you may have new SNPs, not known or ones that are very recent in terms of history and especially genealogical history. And in particular, the aim is to discover those branch-defining SNPs or to find ways to split apart big blocks of SNPs that can't be differentiated and splitting them into branches by finding people who are positive for some and negative for others that's allowing you to make a branching. So it's all about discovery. It's about tree-building in the sense that STR testing was about asking questions about similarity and difference between testers which allowed you to infer how closely related they may be. SNP testing is about building the actual tree by working out how the SNPs relate phylogenetically and building the tree downwards from earlier ancestors. A few little facts, first of all, about the development of the Big Y. It was announced in October 2013, so it's five years old. And the first orders, the first results began coming in during the spring that followed. And then by October 2017, a year ago when I was in correspondence with Family Tree DNA about this, they had about 20,000 Big Y results in the database. Of these, about 18,500 were customer tests and about another 1,500 were academic studies which had been commissioned by Family Tree DNA probably as part of the background towards developing the Big Y 500 STR panel. That's a year ago. There's certain to be several thousand more, I would guess probably another 3,000 or 4,000 at least as the rate of growth seems to be continuing at the same pace. During last winter, we had the conversion to the HD38 human reference sequence as I mentioned and the launch of the 500 STR panel. So Family Tree DNA now offer 500 STRs at which 111 are the well-known existing five panels which are tested by a separate method, not part of the Big Y test, but are included within it if it's ordered from scratch. And then an additional minimum 389 STRs, possibly more, are read from the results of the NGS test itself. I've some data here also from some third-party analysts. YFUL have at least 15,000 results in their database. Again, they also have some academic source and Big Ys. They have some tests from other companies and they have a large number of Big Ys but clearly only a subset of all the Big Ys which have been done. And then the 3DNA have the Big Y database. YFUL have a subset of that but they're mixing it in with tests from other sources. And this may become important as we begin to move into a period where there may be more diversity of the kind of tests being taken. Alex Williamson, very well known for the Big Tree which is very important for all people who are searching within the R1B haplogroup, including myself as I'm a member of that, and use that as an amazing reference source with very detailed information about new SNPs discovered and about private SNPs of testers who have submitted their data to this third-party site. Again, there are also some academic source studies on those as well. These are statistics from Alex Williamson's site. He constantly monitors or counts the number of kits that he's processed and put onto his tree and these are the latest figures I've updated these this morning and checked. They've probably changed since then because it seems to be constantly growing. And this is a cumulative count down the side. You can see how many kits per year are being uploaded and the three colours represent P312. Actually, R2L21 is the most common SNP in Ireland. And then the rest of P312 is the red. So these together are all P312. And then in 2015-16, Alex began to upload RU106, the other big R1B haplogroup results to the site as well. And I think probably most of the attention has been given to the P312 data, which has been growing fairly steadily. I'll just go back to that page. Growing fairly steadily across the past few years. And although that seems, I thought that maybe a slowdown this year here, because we still have two and a half months to go, I suspect that by the end of the year we'll see a fairly constant growth in the L21. The other seems to have backed off a bit. There's obviously a bit of catch up here as a lot of kits were sent and processed by the big tree. The RU106 tree hasn't developed quite as fast as the other two. Because these are all linked to the YDNA warehouse which is offering a service for all haplogroup are people. And I believe in time they're offering an even better system than this. So we'll watch out for that. But again, I've indicated here a growth. And of course, as you are well aware, every Christmas and New Year, and during the summer, family tree DNA had big sale periods, which are now looked forward to eagerly by project administrators who are usually ready with test to order as soon as the price drops down to an affordable level. Of course, the thing about NGS testing is person it discovered is very cheap, but the cost of an individual test is very high. So the decision to make any kind of NGS test, whether the big Y or the ones I'll talk about later on, is a matter involving investment and cost and weighing up what the gain is from the cost made. So I'm going to talk about the first one. I'll talk about three tools we've introduced by family tree DNA to the website. I did a mini review last year when they were brand new. And what I'm going to do now is look at them in the sort of cold light of a year of using these. And I think generally they're good tools, so they do have some shortcomings and there's definitely some room for improvement of these. The first one I'll look at is the haplatory. And of course, here's the familiar entrance page. I haven't blocked, secretised your name to it because it's me. I'm letting my data all hang out here. And I don't care who sees this. It's on my dress there. I don't care, send me the letters, I don't mind. And of course you can click on the buttons here and you have the familiar button here for the results and matches and the YSTR results are the new 500 panel. And there's also now a new button which will link to the public haplotrees, which I'll look at shortly. So these are the two main tools we'll look at, as well as the haplotree, which is the stepped matching system derived from the haplotree and the Y chromosome browser, which found the 3DNA offer. So first of all, the haplotree. Down here we have the view of the old haplotree, which still exists, of course. And until September this was contained only within the accounts of the tester. So you had to be a tester or a project admin to be able to see this. And it became very detailed in the past two years as family 3DNA began to invest in the resources needed to update it, first updating it manually by receiving submissions from project administrators, which also brought a bit of quality control into the process because in the older version of Big Y there were a lot of SNPs called which actually weren't that good for various reasons. So the QC here was applied to put the ones which were worth putting the tree into the correct positions on the tree. And then by the time you moved into the Big Y 500 they began to think to use auto-calling. So there's now a more automated process. Certainly in calling SNPs and putting names on new SNPs which are discovered so that now virtually every private SNP discovered in a Big Y test of quality is very quickly named by family 3DNA with a catalog number beginning with BY and then a string of numbers. There are six numbers in these names now. I think it was up to BY 180,000. The last summer check is probably even higher now. So that gives indication of how many SNPs have been discovered by family 3DNA. A small number of those with duplicates but a lot of them will now be unique SNPs discovered much more recently. And with the auto-calling it's meant to speed up the building of the tree. I think it's still being vetted manually. I don't think it's being produced automatically but it is now a very detailed haplotree which contains every SNP shared by at least two testers. So if it's private it doesn't go on until a second tester comes along and shares that in a meaningful relationship. So you can identify a branch of the tree and family 3DNA then extend the haplotree. So they press everybody in late September when they put the haplotree onto public view. And this is the public view of the new Y haplotree a little bit more recently they've also put the mito conjural haplotree online as well. And these can now be consulted by anybody. So this of course is not a total tree of all SNPs discovered because there are other companies who run tests and have discovered SNPs that may not have yet made it onto this tree particularly as there's now no longer a route for project admins to approach family 3DNA to have other SNPs not discovered through big Y added to the tree. However, there's a very comprehensive tree of SNPs discovered in the big Y and I want to just go and look at it. I have it here somewhere. Safari. Thank you very much. So what you see on the front page is you'll see essentially just the top of it. So you'll see some branches of A and some from AL1000 here at the top and you can expand these branches by clicking on the... So we'll bring it back. But in principle you can expand the branches by the spring minus, wasn't it? Yes it was. There's a plus there. So click on that and we'll expand the branch down below it. And of course if you're an R person like me you've got a long way down the alphabet so the pages that have my SNPs on for example will be much further down but I can go up the search box here and I've got search by branch name type r-fgc5494 which is the big branch SNP that I belong under within L21 and I can then search. And in theory, yes there we are it'll bring us to there. Here it is. And the moment we're seeing here the country view. So we have here our people arranged or the data rather is categorised by origin. So we've got some Irish, lots of Irish there but quite a few English, Scots and Brits who are neither English nor Scottish perhaps some of those too and Swiss down there I think and a Canadian and so on. So it depends how useful that information is of course that's telling us where the modern day testers live. So we can also go up here and choose a different view we can arrange it by surname and here I'm not quite sure how they're developing the tree because the data mainly disappears and there are a few surnames here I've got a Warren here, I've got a Griffin, McFarland, a Bush, a Tyre or Thion, a McFarland down here a few more surnames down here I'm not sure how they are putting the surname data onto this it may be something to do with whether or not the tester has allowed public access to the data within projects but I'm not quite sure what the criteria are here that's maybe something that could be clarified in the future. And you can also choose a SNP view which allows you to see all the other SNPs which are equivalents at that level So here we have for example choosing going back to my own branch here if I click it open obviously we've got 144 branches here I'll scoot down a little bit further to find where I am that's me y196, y1991 click again, there's only three here now and that's me there, that actually a little branch is moi these SNPs here are the equivalents in that block so a7633 and all of these are SNPs that I have since the last branching point and there's no one more recently who can form further branches with these but one goal of y testing in the future could be to find other people who will split that branch by having some of those SNPs but not the others and then we can create a new branch on the tree Anyway, so there are quite a few of those trees for y SNPs around this we want the most powerful if only because it has the results from every big y test in it and there are some SNPs certainly from tests from other companies and I imagine that in the future family tree DNA will accept more solutions from project administrators to make this tree as comprehensive as possible so this may well evolve to be the most comprehensive collection of SNP branches that there is out there this doesn't necessarily mean it will replace older trees you may be familiar with the ISOC tree for example, the ISOC tree serves a purpose because the SNPs on that tree are supposed to be academically verified to very high standard and very strict criteria so therefore you can be sure the SNPs logged on that tree are of better quality or more verified than maybe appearing here this is only an experimental tree but I imagine this is the kind of tree that most project admins and testers have come to refer to because of the sheer quantity of data that it holds so that's the tree and I'll say the tree is a big success for family tree DNA I would say big tick, 9 out of 10 the public tree, only 9 because I think there are questions over the way they're recording the surnames I'd like to see a bit more information about that but a very useful resource I think so going on to the next tool then the next two tools are the family tree DNA site as opposed to third party tools on other sites so here we have the steps matches list which is derived from the haplotry and the intention is to show the tester how many matches they will have at the four levels on the tree above where they sit so this person apparently sits within this subclade called YP983 and they have three matches within that and then there are some further branches above that with steadily increasing numbers of matches so there's potentially again a very useful system because one of the things we want SNPs to do is to sort matches and here we can see the matches arranged in order of distance according to the branch on the tree and presumably as you further split some of these blocks you'll then get more and more branches further down but the problem comes and I think there is a problem with this is a very useful way of displaying information this is my matches page here again is the SNP we just looked at and here's me down here and here's one of the matches some of the clearies somewhere and we see that we have two non matching variants which are a BY SNP and one that hasn't got a name yet this one hasn't got a name it's in a slightly suspect area but that's in me so that's a SNP that I've acquired whoops it's going to keep falling off I've acquired since splitting from this person this one the other one is not in me so it must be in the other clearies so we each have a little private SNP there which we can use of course for further research the problem with this I think is that it works for some people not for others and although it does seem to be a very neat way of classifying your matches by distance it also seems to be limiting the few matches that we have and just for example here this person who's a member of one of the projects I work on has no matches at all and how can that be well Family Tree DNA has set a 30 SNP limit so don't match those non matching variants if you have more than 30 variants that you don't match with somebody else then you will not see them on your matches so that's rather like the STR matching system where you only see people who match you say within 4 out of 37 or 7 out of 67 so here it's been determined that if you have 31 or more non matching variants with somebody and it could be 17 for you 14 for them or whatever then you won't see them as a match so this person seems to be so distant from any of the potential match even within these subplates and I think that's a slight problem because they're sharing these brand subplates with other testers which they cannot see this is a slightly better case here and this one even better so this person can see a few more people higher up but the point is that each of these three charts is actually a descendant from this so this is seen as the major branch market and this person rather arbitrarily can see up to 13 people at that level whereas these people can see far fewer or no matches at all despite belonging to the same subplate now there is an argument that there's a case for limiting the number of matches you see so you're not bombarding people who are not related to you in any meaningful historical period with emails for information but I think that's a very weak argument because part of what these tests are about understanding a descent over far more than just the genealogical surname use period and people are interested in deeper origins and do want to build diagrams and maps of surname clusters how different surnames may relate to each other from early in the the use of surnames and so it does seem to be a bit limiting and I think I would call for 30 cent difference to be simply removed and to allow people to see the matches they have within the selected branches there's also an issue given that all three of these come from from the same subplate shouldn't they be seeing the same standardised sets of branch markers rather than one that's simply arbitrarily for immediately above where they are on the tree so to illustrate this in a slightly different tree there's one which we've developed within the project this is the same diagram by the way there's the YP355 on the top and here we have some divisions where the person who has no matches at all is down here and so because of all these snips they have plus all the snips other people may have including people in their slightly higher subplates plus ones that may not be declared on the tree including private snips they're deemed to be too far away to see any of the other matches and just to illustrate here this person down here is able to see up to this point this person here can see up to this point this person here is able to see up to here and this person because there's less definition here in the tree is able to see much farther up to the R1A L448 marker which means they can see a lot of people in a parallel sub-branch down here so again it does seem to be a little bit unbalanced in what's on offer and I think while this steps assorted in matching it's potentially useful in the way it's implemented actually puts barriers up for building connections then the other big tool the family tree DNA developed last year is the Y chromosome browser and this is actually very potentially very useful and it allows us to see such things as the length of the Y chromosome that's covered in the test and we can also see all the individual reads stacking up at different parts different locations on the Y chromosome where the snitch may be with a snip apparently appearing here since the reference sequence at the top shows a T and here every read is showing a C so again there's a fairly firm call of a snip there and a few more views here if you hover your mouse over a column you can get some information about it telling you for example this position is derived because again there's a reference sequence a T and here's an A almost all the way down that one's not an A ears must be too narrow I think it's falling off that's even louder than it was okay so I'll just do like this like a folk singer and sing to you so here we have the use of information even given as a confidence limit confidence rating and again there's some limitations here because something you may be familiar with the online genome browser IGV which allows you to upload a band file and you get a similar interface to this show it mapping out the reads and giving you more use of information I think than this version does for example if you want to know how many As there are down this column and how many non-derived reads there are a run down counting each one and then work out your own ratios and it would be quite useful I think if they could use the little information box in here if you clicked on the reference sequence and as you can in IGV you could then get summary information telling you about this position without having to go down counting yourself so again I think there's some scope for improvement here the confidence limit the confidence information they give here again applies to each individual read it would be quite useful to have more information about the confidence of the overall call so again this summary information could be provided up here but it does potentially a very useful way of visualising results while not being especially powerful if you're checking the positions it's called in your test so in other words you're checking your non-matching variance or checking your known SNPs and you can see for example that this has got far fewer reads than some others I think there's less than 10 here aren't there 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 so with 11 reads we'd presumably call this as a SNP but if there were fewer than 10 reads chances are it wouldn't as a chance here you can find such calls and evaluate them for yourself you can also find calls like this where the SNP may not be called so you can pull out positions from your VCF file and try and find them which does take a bit of ingenuity and then you can find out information about why something that appears to be a SNP may actually not be a SNP here we see these faded reads are actually poor quality therefore the good quality ones are pulling the SNP the poor quality ones are not and that may or may not be called but that probably will not be called good quality reads but it'll call some other third party tester and they call that as a SNP and it's useful to be able to look at this visually and get further information about it I'll just skip on to the summaries here it's pleasantly designed it looks very pleasant on the eye but the data is very very limited because we had to do physical counting of the information ourselves and because the quality rating is also limited then if you discover you have an additional variant not given in your list then there's a means here to investigate it however you can't navigate to any position on this at least you can't easily navigate to any position on this you can only click on the the list of SNPs that gives you in your results file however if you know how to write HTML you can go into the HTML code set any position you want to look at and you can then take yourself there it's kind of a hack and it's a bit a bit inelegant but it's possible to force this to take you to any position you may want to look at but again it'd be easier if family truly may would just create a search box and allow you to go to any position you want to look at across the white chromosome and see what the result is there and find out if there are any other positions you may want to look at and this goes back to the panel discussion we just had there's no comparison ability with other testers as there is in the brand new version of the autosomal chromosome browser which they've just introduced would that be desirable I don't know I'll leave us to think about that over time V so that's a quick run through these new tools and because it's the final talk of the day I'm going to push things on quickly and I was asked to talk about the possibilities for the future because I think we're moving into a future where we're likely to see demand for more coverage in tests of the white chromosome and in particular we're all I think begin to hear more and more about the possibilities for whole genome sequencing so I've got a few pointers here but no answers really I've been trying to put together information I can from those who are in the know about these things but also from the very first test results which are being posted and discussed in discussion groups so there are some limitations then with the big white good test as it is as I say again I think it's done tremendously well because it's relatively inexpensive for NGS testing it's limited in the amount of coverage it has of the white chromosome but it's done a good job very well however there are many parts of the white chromosome not sequenced and of course other tests which cover more of the white chromosome are finding different types here which can give us for example more precise age calculations if you're interested in that kind of thing and can also find SNPs that may be able to break apart some of the stubborn blocks that won't be split there's also a limitation with the length of the reads so the current NGS tests read fragments or parts of fragments up to 150 base pairs in length and again these have been reassembled into the reader of the white chromosome and again there are some limitations in what they can do particularly in the case of trying to extract STRs so if you want to try and get more STRs from the white chromosome through NGS testing then once again the 150 base pair reads are a problem there so therefore other methods might be able to be developed that can give better coverage that's more length and deeper reads of the white chromosome higher quality SNP calls more white SNPs other parts of the white chromosome increasing the resolution of these historical trees within historical time more reliable calling of STRs and the costs will remain an issue but we're at the point where some of the costs of whole genome sequencing tests are beginning to come down so they're not actually that far away from the costs paid for specialist white tests so I'm going to come back to this one because the first thing I'm going to look at I've got three potential ways in which we can move into it in the future so one potential way and some of you may have seen a question I posted on Facebook in the last few days to test people's ideas about this out is there any scope for moving towards a kind of dedicated Y SNP chip so taking from the microarray technology used by autosomal testing to develop a Y SNP chip that will contain all the SNPs discovered by big white testing and if possible by other testing companies as well there are about 118,000 SNPs on the SEDMA which means there's plenty of room on a chip and the autosomal chips contain 600,000 SNPs genome-wide SNPs upwards so there's plenty of room on them for those and more besides there are probably another 100,000 or more private SNPs currently in circulation which haven't yet been paired with a second tester and of course this could be a way in which you begin to find quite cheaply people who may match others on those private SNPs rather than commissioning one or several MGS tests to get someone just to match on a single or on two private SNPs if they were put onto a microarray then maybe a cheap test could be taken to find people who may match the testers who discovered those private SNPs so the question is would it be economical to do this could any company take on the design of such a SNP and marketing it of course once it was designed it would very hard to change so as new discovery tests were going on those new SNPs wouldn't be on this chip but maybe it's possible to reach a stage where there's a certain saturation of SNP discovery that may allow a pause in which a fairly cheap test could be developed to allow other people to find out more cheaply whether they actually match the existing SNPs of course discovery testing will continue but maybe this could be an additional way forward so the disadvantages are it could distract from discovery testing you may get people saying I've done that test, I've had my Y read, I'm happy with that that's all I need and therefore there may be a slowdown in discovery of new private SNPs but as I said it could be a potential cheaper route for developing whoops for developing Y testing in future so now I'm going to go back to the slide I've skipped because I want to just put a couple of definitions here because as I'm moving to talking about the other kinds of NGS tests that we may move into a lot of the comparison between them depends on numbers there are lots of numbers at me now but particular kinds of numbers assessing how much coverage these new tests will give us so for example what kind of length coverage if we want more of the Y chromosome to be read then therefore there needs to be more positions on the Y red the big Y at the moment can read something like 12 to 14 million I think the current HD38 BAM files are usually producing reads of 12 million yet it's said that the readable space on the Y chromosome is about 25 or 26 million positions so the big Y is targeting about 45% of the readable Y and there's more of it which could be read in other tests now we have depth coverage so the number of times of particular position is read in a test so a test may aim to read the sequence that's being read 15 times 20 times 30 times 50 times and therefore the number of times the reads are done will improve the resolution of the test and therefore how many times is necessary to achieve a satisfactory quality of course the more times a sequence is read the more expensive the test is going to be so there's a trade off between cost and read depth here so the question is would the whole genome sequence of tests offering say 15x 15 times read depth would they be able to compete with the kind of results we're getting from the two big existing Y tests of the day and then of course the reads are the fragments so here we see the reads piling up so here we see the certain read depth that's the number of times this position has here been read looks to be about 20 times roughly fitting at 20 times other positions are read more frequently so once a test is actually done every position on the Y will actually be read a different number of times it may be read just once it may not be read at all it may be read two or three hundred times in certain pilot regions which seem to accumulate more reads so you'll end up once a test is done you can compare the actual the median or the average read depth that's been achieved in the test and this can then be an indicator of the degree of quality being achieved and of course as I said earlier at the moment NGS tests are based on reading fragments of 150 base pairs so is it possible to increase that is a possibility of seeing longer reads being conducted in the future so once that question first scoot forward again the second test I'm going to look at involves long read technology so we've all heard about things like the nanopore sequencer offering the potential long reads which may be anything from 10,000 bases long up to 100,000 bases long depending on the technology these are actually being used they haven't been applied to genealogy yet I don't know how long it may be before they are because these will be expensive when it comes to the kind of applications that we put them to but just recently if we can get longer read lengths we can do things like start a bridge those long repetitive sequences which aren't read in the Y tests at the moment for example the DYZ19 sequence where a lot of SNPs have been cold but some people say well are they reliable because of the degree of repetitions in that sequence means we can't accurately sequence them or be absolutely certain where these SNPs actually sit so longer read lengths may allow for more of the Y to be read they'll also allow more calling of STRs since STRs are short but they may repeat for longer than 150 base pairs so therefore if an STR is longer than the length of the reads in the test you're trying to read it by stitching together two or more reads and therefore there may be some scope for mistake for error so if your reads are longer you may capture more STRs within the test and it also allows the possibility of de novo sequencing of compiling a sequence of DNA without mapping directly onto a reference sequence so we do have the first kind of long read test currently in progress these have actually been eagerly awaited I think for about two years these have first been discussed a couple of years ago and I believe the first results are coming in from a technology known as synthetic long read sequencing which are being offered by four genomes they are very expensive about $2,900 at the moment because they are the absolute peak of wide testing but there have been some initial orders being made to test out what they can offer and essentially they're not actually strictly speaking long reads because they're still using the 150 base pair reads that we're familiar with but they have a prior stage where they fragment the sequence into 10,000 base strands which are then segregated and labeled and then you carry out the fragmentation and the reading of the fragments but because these strands have all been prior separated and labeled it's possible to then reconstruct these into those effectively 10,000 base pair reads and I've no idea how accurate these may be but it does seem as if it's a way of achieving kind of a pseudo long read effect and I have a little information here because the first tests are now being analysed by the project administrators who ordered them and it seems as if they're definitely extending the sequence of the Y chromosome that can be read to at least 16 million base pairs with possibly usable data on up to 20 million base pairs we're looking at potentially 80% here of the readable Y the read depth apparently is not very high so they're not sure that this is going to be terribly useful at the moment for STR recovery but it is increasing the rate of SNP discovery because more SNPs have been discovered on those other regions and in particular there's a suggestion from Alex Williamson that it may allow us to identify the true position of the palindromic SNP so if any of you have a SNP with a ZZ number as I do this is on one of the palindromes which are these double strands and the moment you can't be sure which palindromic one of these SNPs sits on but apparently Alex Williamson believes that this long read technology will allow those to be sequenced and those SNPs possibly to be identified in the true position so a comment here from Ima Donald who kindly informed me about the tests that are coming in, he thinks that if these projections are correct the long read tests of the power to make a measurable distance difference is I know about including my own it was also like a measurable difference to the accuracy of the ages that we can define statistically from the results which may be 20% more accurate than the current projections he also goes on to remind us that at the current cost level it's not likely to be a first choice for most testers however it's happening and so there's one new application of long read technology so whole genome sequencing thing that the apparent, the headline cost should we say of a whole genome test, a sequencing test or WGS is not that far off the cost of the current Y capture tests we've probably become familiar this year I think with the potentials of WGS testing particularly through cases like the buckskin girl and the ability of people to create spoof files which mirror the selection of SNPs and microarrays and this allowed to compare directly data pulled from a whole genome test with the data derived in the conventional family finder or ancestry or 23andMe or zonal tests so WGS generates massive amounts of data and they're often distributed to testers on hard drives, on physical media or by a very very big downward and of course so no WGS databases available are with the big Y allowing genealogists to compare the results with other WGS tests but of course you can spoof the files and in particular yfull.com are now producing or generating BAM files Y chromosome BAM files from WGS data which can then be compared with the Y BAM files they have on their database submitted from big Y and FGC tests apparently what they do once they pull the Y BAM data is they then discard the non-Y data so there are some more raising issues questions about submitting a whole genome sequencing file to a third party supplier and again we'll have to decide their risk tolerance of things like that but YFUL do say they discard the non-Y data and generate a comparable Y BAM file from the Y data so here's when I throw some numbers at you then because if you try and find out about WGS testing you'll get a multiplicity of cost levels I'm not going to talk about the companies and the costs that's beyond the scope of this particular talk but there are various levels tests offered at 15x read 30x read, 50x reads there are some smaller ones I think they're disappearing from the market as a general recognition that 15x is probably the minimum you'd want to order if you want to get the kind of quality of Y reads to compete with the current Y tests and what I have here then are some data about coverage and read depth which we're finding pulling from the current existing NGS Y tests I don't have very much information about Y Elite what I have here I've taken from the ISOG Y SNIP chart which I think is now needing some updating the two big Y versions I pulled these from files available to me from projects which are anonymized and here we have the old big Y prior to the HD38 changeover and it's interesting that actually before HD38 they had higher apparently higher length coverage than they do now and higher median read depths but it's clear there's also a very wide range of reads across the individual positions in the old tests ranging from a single read in some positions up to 8,000 in some cases and there are also far more actual reads read in the old big Y tests than there are now so it seems if one of the effects or factors within the conversion to HD38 as we filter out a lot of reads that probably shouldn't have been in here that may belong on other chromosomes so what we now have is a slimmer test that may be more targeted on the Y chromosome so the big Y in the moment is achieving read depths of about around about 20x covering about 4 to 45% of the readable Y the other area now and is estimated by those who do these kind of estimations to be generating a new SNP on average roughly every 125 years now some of you may be skeptical about the age calculations on small samples of recent historical data but of course these averages are calculated going back over several thousands of years and they are a kind of an index of the number of SNPs being generated from a test so let's compare these then with some of the works tests available so these are coming from some tests produced by Dante Labs and Y-Seq Dante are offering 30x works tests at the moment Y-Seq are offering 15, 30 and 50x reads and additionally the 4 Genomes Corporation are also offering tests at 15 and 30x I don't have any data about then but recently some people have taken these tests posted their basic statistics I'll just turn myself off yes all the battery is gone probably is the battery gone I'm all delighted to this thing of my ear now that's great before it finally falls away thank you very much I'm trying to put together what may be indexing to give a certain indication of where these are going I do have some data from which arrived just this afternoon on the length coverage of the two Y-Seq tests so I'll just get over the photograph I took earlier so along the top row we're seeing that the Dante Labs are in two tests producing around about 25 to 6 million megabase pairs in their reads and the 102% don't be misaligned by that that looks a bit bizarre but that's what Y-Full declared in their analysis of this BAM file presumably based on an assumption that the reader Y is about 25.5 million base pairs long actually now exceeding that in length and actually the other two tests are producing figures very very similar to this so the Y-Seq 15 and the Y-Seq 50 are both producing about 26 million so rather the same and the median read depth is about 20x for the Dante Lab aiming for 30 achieving 21 the Y-Seq 15 aiming for 15 achieving 14 and one test I have here the Y-Seq 50 aiming for 50 achieving 41x reads now the people taking these tests are suggesting that what the 50x test gains is not necessarily deeply improved sniff calling, that seems to be good enough at a 30x level at least maybe even a 15x but what it's doing is it's improving the quality of STR calls so along the bottom in the middle rather here we see the number of reliable STR calls being made by Dante Labs about just under 500 with Y-Seq achieving about 520 Y-Seq do offer a snip verification packet as part of their test and in the 50x achieving about 550 reliable STR calls and the people who ordered these tests did so because they wanted those STRs their particular question that they were trying to elaborate depended not just upon snip calls but also upon establishing a very rich bank of STRs trying to force new branches onto their tree and along here again as I showed you the Y is achieving roughly 125 years per snip so it seems as if the Dante Labs are the 30x giving it to about 85, I believe that's comparable to the FGC Y leads test the 50x Y-Seq not quite as good because again there's probably more single read or low read positions in the 50x test which a 30x test would eliminate and the Y-Seq 50x achieving about 78 years per snip again comparable to what Ian thinks the synthetic long read technology can achieve so just to move on away from this here's some more qualitative thoughts about these from the people who shared their data with us I'll just let you glance these over particularly from Tim down here he's here suggesting that the snip results the 50x are about as good as the 50x but the 50x finds about 30 less reliable STRs than the 50x so if STRs are already trying to pull from these tests then you wouldn't need more resolution just expanding the regions where snips are being read than the 15 of the 30 may offer enough why is this important? well essentially it's all down to the decision you make about how much money you want to spend so 50x WGS tests can now be ordered for as little as $700 or $650 I think with the Dante Labs 30x is even cheaper but of course if you want more quality if you want a 50x test there are decisions we made initially about what degree of resolution you go for before you decide to order one of these tests and therefore it may be wise to wait I think for the market to settle a bit before you go in if you're not quite sure what it is you're trying to get from them but it seems likely that I think as we move into the future we are going to see more and more band files being generated from these tests and adding to the bank of snips in the places that the big Y can't reach I want to finish by just talking about the big Y 500 or the panel 6 STRs these new 389 upwards STRs added to the big Y 500 last year and again the initial reaction I think from most people who've received these results is that they seem to be very stable very slow moving hardly moving at all so a lot of the data seem to be very very samey across projects across branches but however there's some thoughts here from Dave Vance who's been doing a study of these snips although these STRs rather and he suggests that these panel 6 STRs are mutating so slowly that they can actually be almost seen as being rather snip like so he seems to see them as almost pseudo snips there we are giving one example here in other words they provide one more source of slow moving mutation that once you see it it'll probably be stable for long enough for it to be treated as a de facto historical or genealogical branch market so again these I think the jury is still out on how useful these new STRs are going to be but there's still some work to be done I think on understanding them and finding what can be pulled from them in the future so that's just a few thoughts about where the future may go if you're confused well actually so am I I've been watching the prices of the the works tests every year now wondering should I do it should I go in now, should I wait if I do it which one would I order what I really want to know and I think that we're still at that kind of stage but I think some clarity is now beginning to emerge from the people who are very kindly posting up their statistics and their test results we're now beginning to see what is possible what can be done from these tests and I think they're fairly shortly going to become part of the scenery of my NGS testing I don't think they will replace the existing NGS tests for the time being but they definitely will cause a challenge I think to them and of course a rethink to all of us about whether if we are going to spend money on these kinds of tests do we want to do it in such a way that we'll close off data that we need to have and maybe we should spend the money to get the best test we can to achieve the most data we can in a single one-off test so thank you very much that's all our time today thank you for thinking thank you very much John great questions for John we have about 5 or 10 minutes left for questions so Daryl is saying no but we can be kicked out okay have we packed up on everything yet? James speak into my microphone too quickly John Big Y and Big Y 500 I saw your slide had 125 years average for the second model question mark is that question mark because we don't know or is that question mark because the initial findings are that it is the same? no it's because I think we can't assume that the rates will be the same as the old Big Y I think it's going to be quite different much better does anybody find any use for these extra STRs? well I think just what Dave Vance is saying here and he's he's interacting with he's actually pointing to a particular case which has been useful to someone in the discussion which this threat came from so there's one case where someone has actually got a snip like STR that can be useful as a branch market great okay well we have to call it an end there just want to say a big thank you to John Cleary for a wonderful presentation and all of you for attending you've arrived on the 18th thank you