 Hi, everyone. So my name is Carl, and I'm sure that you're all just as tired as I am. I've been in a conversation with multiple storage and computer vendors for the last eight hours, so I feel great about that. And I know you've all been here in a room, although the air conditioning in here is much better than the room I was in, so I'm a little jealous of that. So anyways, you've actually met a couple of, well, at least one of the people from my group over there, Greg, he's one of the TAs, and my group basically is located over at UHN, and we service all of the genomics data that comes in at UHN, so that includes both research sequencing as well as clinical sequencing. And we have a group of bioinformaticians, Greg being one of our leads in that group, who actually run the clinical bioinformatics workflows and who do all the validation and testing. So today we're not really going to be going into a hands-on sort of thing, but I really want to create some awareness as to sort of some of the issues that you run into when you actually start to try to lock things down for clinical use as opposed to just research use. So that's the topic for today. If anybody has any questions, please just interrupt as I go along. So the learning objectives of this module are to gain insights into the complexities that you have when you're doing clinical accreditation, understanding how the goals in genomics and research are different from those in clinical testing, and really starting to appreciate some of the importance of validating not only the results that you get, but validating the software, the infrastructure that the software is running, the software instructor surrounding it, like what versions of UNIX and Linux you're running, and all the different types of libraries and everything making sure it's all documented that have to go into your documentation for a clinical test, and then understanding some of the problems and pitfalls that you can see sometimes when you're doing panel-based testing. And so today we're going to be mostly talking about cancer, because most of our testing from on the clinical side is in the context of either clinical research trials, which are primarily cancer-focused, or just general standard of care type testing that's done in the clinical labs. So here's just a short list of some of the factors that we have to consider when we're talking about taking a test from the research domain and making it into a clinical test. You know, we talk about archiving, so when you have the data, how many copies do you have, how you protect that data, are using techniques like erasure coding, how, you know, how many, what's the lifespan of that data, how long do you have to retain that data for, how do you set those policies for retaining the data and where it's stored. Communication between the lab and the bioinformatics staff, so we actually have set up a ticketing system specifically for communication between us and the clinical lab so that we can log everything. And some of those logs actually are used as part of our validation when we go back to our accreditation agencies as well. Security is obviously a big thing because the data that we deal with is not de-identified at all. So the files that we get often contain patient names and MRNs, which are the medical record numbers for the patients as they're admitted into the hospital. So we have to make sure we enforce strong security across access to those files. How do we audit that security? How do we audit access to those files and all those types of things? And then the fact that we actually use a really vast array of panels for different types of tests. So we might use the Illumina-Aplicon type panels for certain types of tests, but then when you go into a pathology specimen and you have a very small amount of tissue in there, as you're doing a core in an FFP or something like that, you actually have to use something like the ion proton because the amount of just the Illumina just does not work at that low of a sample volume. So we have to then come up with a protocol SOP and a validation document for running the test on the ion proton as well. And then coming up with various ways to standardize to make our lives a little bit easier, especially around the pipeline so that we're not constantly writing new pipelines and just looking at it as a more of a modular approach. Those are some of the factors that we like to consider. And the goals and considerations. So we're talking about research, obviously reproducibility is a huge factor in research. Sample quality cost. It can be very exploratory in nature. It's not always hypothesis-driven, but that's typically kind of in the realm of a discovery type of application. When we move over to clinical standard of care, so this will be somebody coming in with a cancer and they need to have BRAF mutation testing or something like that, or BRCA1, BRCA2 mutation testing. We're looking really at the goal is actionability. So finding those mutations that have an actionable case that we can make for them to be treated with a certain way or for some kind of screening mechanism or whatnot. Similarly, when we're doing clinical trials, like bringing people in through different new drugs that are coming in through different pipelines, all of that falls under that kind of clinical envelope. And we have to look at a various set of requirements. So for example, CAPCLEA and OLA, CAPCLEA is College American Physicians and CLEA, those are both American accreditations. And so hospitals use that if they're actually accepting samples from the United States. And then if you're in Ontario, usually we fall under something like OLA, which is the Ontario Lab accreditation for that. As I said before, we go through a big process of validating our infrastructure, software, bioinformatics, workflows. We have to make sure everything is tracked and our pipelines are versioned properly. And we have SOPs for everything. P-HIPAA in Ontario at least is one of the major privacy envelopes that we have to make sure that we're following as well. And that kind of falls under that security realm I was talking about before. Thinking about how we interact with other labs, right? So often when you're developing tests in a clinical lab, you're swapping samples from one lab to another. And in fact, when we're talking about CAPCLEA inspections, they actually will send us some bioinformatics data, they'll send us the FASTQ files, and then we have to run them through our pipeline and then send them back a matrix of the results we have. And then they test, then they go against the, what they know is the gold standard truth for those, and then they score us on that. Actually, that's voluntary at the moment, but it probably won't be voluntary for too long. But we're doing that. And really what we're trying to do is we're trying to nail down a couple of arms here. The main arm being make this test as black and white as possible as opposed to the more fuzzy kind of research side of things and also drive the cost time, the cost down, the turnaround time for when the sample arrives in the lab and how long it takes to run through the bioinformatics pipeline. Make sure that we have a backup system to run the pipeline in case one of them goes down and all those kinds of things so that we can enforce the strict turnaround time on sample processing. So we're good, everybody? Any questions at all? So as I said before, clinical accreditation really, Ontario, OLA and U.S. samples, CAP and CLIA. And one of the things that we have to do is create a lot of documentation, which has to be approved, signed and dated by the laboratory director. And this sounds trivial, but actually like our documentation for some of this is we have validation documents that run in the hundreds of pages for some of our panels that we've had to validate. And in terms of writing the SOPs, even a simple SOP for running a small panel can be 30, 40, 50 pages in length. And I'll show you an example of that. The documentation that we have includes all the pipeline details, of course, the standard operating procedures, the procedures for updating the panel. If we take a component out, so if you're using a version of SAM tools or Indel real liner and you want to swap out a new version, but everything else remains consistent, you have to have a process in place for documenting that change and how that goes into your pipeline so that you know when you ran that particular version that it's been with that version and that's very transparent. And then, of course, data security policies are always important and making sure that we have lots and lots of logs of everything so that we know everything that's happened to the sample from the minute it comes into our hands to when we deliver it back to the molecular pathologists who then will sign out the case on the clinical side. So just going into the logs, for example, so when we run our pipeline, so today I think you've been running some of the command lines just from scratch, so you've been going for each step at a time. So what we do is we take all of those command lines and we wrap them up into bigger scripts written in Python. Some of them are written in Bash just to spread them out across our cluster so that they can compute at the same time and so that we can optimize our turnaround time. There's a lot of little tricks in that regard. And because we're running a lot of samples all at the same time, we have to make sure that we're keeping track of the exit status of each job. So when a job ran, if you ran your SNP caller that you know that it actually exited cleanly on that particular sample and that all steps for that particular sample exited cleanly with no errors, and you have to check that manually at the end to make sure that all those samples met those requirements because if they didn't then you have to go back and trace through and figure out what happened and rerun the pipeline again and document that you had to do that. And that happens occasionally and it can happen for a variety of reasons including a problem with the sample file itself or even just a note on our system that we're doing the computation on just goes down, just dies or gets overloaded or something, runs out of memory and then the job dies there and so we have to go back and make sure that that's all cleaned up. So there's a lot of details and that's what Greg gets to do. He's laughing over there because I'm making it sound like it's only five minutes of work but it's like six months of work for doing a panel like this basically. What's that? Push the button and make it go. I wish yeah. So here's an example of you know the level of detail that we have in one of our SOPs. So here's for example one part of that document you know it's very detailed notification of bioinformatics staff of the completion of a next-seeker, my-seeker run. Lab staff responsible for running it has to do notify the staff using the ticketing system. Here's the email that they have to send that ticket to. Email has to include the following information and then proceed to the next step and it's really it's so dumb down like you could I could get my mother probably maybe to actually run this because really you should anybody should be able to just sit there and blindly follow that SOP not have any idea what they're doing but be able to to run and launch that pipeline and have it all documented at each step. So that's the that's the purpose of the SOP. When we're when we're doing the panel so here's an example here of a typical workflow that we that we use and we have different workflows for different types of samples that come into the lab. So if we have a paired tumor normal sample processing type set that comes in and that's obviously preferred because then we can remove some of the the common variation that's in the in the normal portion of the person's genome as compared to the cancer portion. We have a particular pipeline that we have to run those samples through and I'm going to go into that in just a little bit more detail in a second but you can see here that actually we don't just run mutek for example which is the standard tool for calling variants in cancer samples when you have a paired cancer normal sample. In fact we actually run VAR scan for the indels because VAR scan does better on the indels we found through experience and we use mutek to look for the somatic changes that occur and for the germline snips we actually use VAR scan for both of those and so we actually use a combination of tools to try to get to what we think is the truth for that particular sample. Same thing goes if we have a single tumor blood sample processing we have a workflow that we use particularly for that. And so you might be wondering on the left side there looking at that paired tumor normal sample call processing like why do we actually do three different variant callers for the same pair of samples and the reason is on this slide here you can see this is basically showing you that if you for example in the in the clinical lab we typically have a threshold for germline calls so a variant that's found in the blood of 20 percent minor allele frequency and we use filters to remove common variants that are found in the 1000 genomes and ESP and exact and things like that. The problem occurs is that if you is that if you have a very low frequency allele a very low frequency minor allele frequency when you're using mutek mutek is so sensitive that it won't actually call somatic variant if it sees even a very low frequency SNP at that same location even if that's probably just a machine error so even it's just at the background and this can be a major problem so for example if you're looking at something like BRCA1 really at the end of the day in a cancer specimen we don't really if the person is on like a PARP trial or something like that we don't really care if the person has a germline or a somatic mutation we just want to know that they have a BRCA mutation that's predicted to be deleterious but if you just ran mutek on that particular tumor specimen you would only sorry you wouldn't see anything you wouldn't see the the uh that SNP come up in that location because maybe there's like a very very low frequency mistake in the in the germline so for that reason in order to pick those up we actually have to run multiple different pipelines on each of the specimens separately both in paired mode and in single mode to try to get the truth of what's actually there in the specimen so you don't all the collars agree what's that you don't call it if all the collars agree uh well actually they may made not they in this case they don't agree right so um so they they agree but the one of them is wrong right so when you're running the germline it's probably a mistake if you have like a an allele and BRCA1 that's at a minor allele frequency of like five or six percent it's definitely a mistake or almost definitely a mistake and all the ones that we've looked at have been mistakes and have been re-verified in the lab again running the sample onto tangential technology have been have been wrong it's been a sampling uh or a sequencing artifact um but because it's there it actually makes mutek make a mistake when it's calling or not calling that particular location so that's just one of the what's up oh yeah do you want to repeat the original question oh okay no he just he said uh his question was is that just that you're looking for the consensus of all three of them and it would be great if it was just looking for the consensus but actually we're looking for the we're looking for consensus but for consensus not but the consensus not being true though right so the germline is actually a false positive so that's what we're trying to to weed out those and then of course um this is actually going back a little bit here so this is another example of of some of the things that I was talking about before about the the types of things that we need to document and this is actually looking at the very detailed informatics flow of the data so how we actually take data and move it from one place to the other it's all really kind of boring stuff but it all has to be documented very well so that you can use that for an inspection when they show up because they'll when they do these snap inspections they'll ask you sometimes for this documentation um let me just blast ahead here actually and we're going to go right into a case study here so we run a panel at uh u h n that we designed in-house which is called the high five uh panel it's 555 genes related to cancer uh it's used for screening a variety of different uh indications and most of these are we're looking at things that may be suitable for drugs that target specific mutations or can bend the patients into one population of responders which are not responders and things like that so when we were when we were tasked to to validate this particular panel here so we'd obviously we'd run this many times on the research side but then we had to bring this into the clinical side so we had a a strategy here where we uh first of all uh we developed a panel um it's this actually from another one this is 48 genes the trucee gamble con cancer panel uh and we we have that data from an orthogonal panel basically run on the alumina the same genes are run there and those 48 genes are part of our 555 genes we have those many of those genes are also part of our uh the the comprehensive cancer panel which is an ion life product and then we have public data for example the corial cell line which we sequence in-house and we have a trucee table of what a bunch of labs around the world have said are the trucee for uh indels and snips in those particular uh samples in that cell line and then we also use synthetic uh data which we can create using different tools like bam surgeon where we can actually spike in mutations in different places along different genes and see what our recall rate is when we go back and try to re re go through our entire pipeline so we use a combination of a bunch of different tools in order to uh try to basically prove that you know we have the best of breed in terms of calling the the the snips and indels from a particular sample and the the problem is that really it's never perfect right so even if you're looking at some of the data today you'll see there's lots of well maybe you've seen there's lots of weird little edge cases that can come up and when you start looking at multiple panels across different technologies there's lots of different edge cases and false positives and false negatives that you have to weed through and this is what we spend months and months agonizing and and uh arguing about uh trying to figure out how we're going to deal with that and often what we'll do is we'll actually even go back to the lab when we have when we have something which we don't really know well we don't know what the truth is for that we'll get them to re sequence the technology maybe even or the sequence uh maybe even using something like sanger sequencing to see if they can pick that up using even another method um one of the problems that we have so when we're looking at something like the true the the amplicon panel there and the comprehensive cancer panel is that because they're two different technologies they're actually run through two different uh pipelines and what happens when you run them through two different pipelines is that often you have these weird little things these little gotchas uh that can that can get you so if you're looking at if you're trying to compare all the variants that you found in one using one technology versus another technology and you haven't done something like what we're talking about here which is left justification you may think that you're actually looking at different variants but you're still looking at the same variant so for example if you look at this particular uh one here we have a cag the variant in this case is the deletion of the a different aligners and different callers can actually give you a vcf that reports that as a different type of variant at different locations so it could be reported as a ca the loss of the a an ag with just the g and or an a with a deletion and so what we do is we have to there's different tools that you can get out there that will take all of this information in your vcs and the left justify them so that they all look like a ca to a c in that same location independent of which technology ran it on if you don't do that though you're going to end up with a lot of headaches here's another example which uh greg greg loves this example he found this one here and this is a a place where you can really get lost in the weeds and you can see here that there's a a large deletion on the on lumen up top there in that one location the ion torrent actually reports a deletion in another area just to the left of that with variants just to the right of it in fact it's the same deletion on both on both panels it's just because there it's just because if you look at the details of the sequence down below and the the acg's and t's down below there actually is a larger repeat in that area and so we've lost a portion of that and the way that the aligners have worked is that you've ended up with two separate types of deletions even though that's actually the same thing so you can really get lost in some of these edge cases when they come up and unfortunately when we're talking about clinical samples and if it's in an important gene and if it's in an important area of that gene that could have some clinical relevance you know we have to do that deep dive to try to sort that sort of thing out then there's also this here which is machine specific artifacts which i'm going to show you a little bit more in a second on but for example here's the same sample run on two different technologies one on alumina one on the ion torrent you can see the ion torrent actually calls a variant in that region but there's nothing on the alumina and then when we go back and validate that using Sanger sequencing or another sequencing run it actually the alumina is the truth in that case but if you're running just the ion torrent you might think there's a variant there so this is another one of those gotchas and we call those sorts of things sequencing machine related artifacts and we came up with a little way to visualize that so this this here is you're looking at a a panel of I think about 100 100 or so samples and on the x-axis there we're looking at the call percentage so that's the number of samples that saw a particular variant call and the minor allele frequency is on the y-axis there the degree of shading for each of those spots represents the depth of coverage for each for that particular variant so you can see right along the center there which kind of makes sense there's right around 0.5 there's a lot of a lot of calls there okay but then if you look you can see up in the corners in the top right bottom right corner over there that there's a lot there seems to be little clusters over there we can separate that out actually into point mutations versus indels so showing the indels in in in green and the snvs and in pink and then we can remove the variants that are found commonly in the population using different databases like exact and 1000 genomes dbSNP and you can see nicely we've removed all those things that were probably just common SNPs in the population but what's left especially in that bottom right hand corner are a bunch of mutations that are a low minor allele frequency but they're found in almost every single one of our samples and that those are machine artifacts and so as part of our clinical pipeline we keep track of a lot of these artifacts and we use that as a further step in filtering to make sure that that doesn't bleed through into something that might be called out by the by the molecular pathology side of the equation it also helps of going back to the the original example I showed you there before of you know this is these things over here even though they're at a minor allele frequency we may not have actually used called them in a in a clinical setting but if there's a real snip in there in the actual cancer sample and this was just the germline those would cause you to be messed up you wouldn't you wouldn't detect them if you're just using mutek in a paired tumor normal calling mode so that's that and this is basically this is it the summary is it's a lot of pain to translate something from the research saying it's fun to play in the research setting when you have to get down to business and and translate that into a clinical test that can be accredited and that you can document and have everything in place for it can be painful to do that so any questions at all and then with the the accreditation agencies required for validation like are all of those like using synthetic data they yeah they actually require to see your validation document and they don't specifically say you need to use synthetic data or use the Coriolis outlines but they want to see a methodology that you used to show that your pipeline is working in your hands basically so that can be done through sample swaps through other to other places I think actually uhn we we kind of go a little bit overboard and we've detected a lot of these problems that a lot of smaller clinical labs they just don't have my informaticians on staff to be able to go find all these little problems so we we kind of take it to the next level but yeah they just require a lot of documentation basically and because the field is so influx they're they're still not really entirely sure and they're they're their accreditation documents that they require keep changing on us as we as we move forward all the time in this particular area are all the clinical labs reinventing the wheel on their own on this or is there coordination to make your sense um yeah so you know there's I see the pathologist in the room over there having a good smile there um yeah I that's a good question and I think a lot of them are reinventing of the wheel and a lot of them are just uh you know just trying not to look out the windows as they're driving basically right they're they're they're just trying not to they just don't want to know right so they just run these things through so and I don't want to say anything bad about our lab but when we ran when we initially were getting samples from them and looking at the pipeline that they were running through a turnkey piece of software uh which I won't say anything bad about but it's a turnkey piece of software you just put your sample in you get a result at the end we found so many mistakes in the way it was calling variants but that piece of software is used commonly in clinical labs now does that mean it's a problem it does in a lot of edge cases and especially at a clinical research hospital like UHN it can make a big difference um because we're looking at more cutting edge different looking at variants in places where other people may not look at um but if you know if you're looking for your your standard like germline hereditary brach mutation those those probably work very well for 99 99.5 percent of the population being tested so it's a murky field right now yeah it doesn't take very long um I mean our turnaround time Greg is what for most panels about three three days or something two yeah two to three days the smaller panels in within a day because you know you can run those uh you know the the smaller panels that only have like 24 genes on them uh you can turn that around in a day no problem now that doesn't add that doesn't account for the interpretation side of thing which our group does not take care of the interpretation so what we deliver back to pathology is a is a VCF that they then put into a piece of software that they use for annotation and then they have an annotation team that goes through all of those variants and filters them they have a big flowchart of how they filter all the variants that are found in there to be able to make a final call on that particular sample that process is a little bit more manual and intensive so um it takes them a little bit longer to do that so our turnaround time is very fast and on just the pure data processing side yeah so in a clinical testing situation uh it becomes it definitely becomes tricky especially for cancer samples where you're dealing with lots of variation I think when you're dealing with well-known well-known locations that have been validated that they're you know commonly known in the literature um that they're in all of the the different databases online and you can see them at a good minor allele frequency especially germline at a minor allele frequency of 50 percent I think there's no problem in going to that level it's when you start trying to look in the the unknown areas and especially in cancers it gets to be a little bit more problematic and in fact we actually although we run the 555 panel in production right now the actual number of genes that are signed out by the pathologist is a very small subset of that 555 but I think it's important that it's modified too yes yeah you find all sorts of stuff yeah oh for sure we are yeah yeah for sure and there's lots of cases as you were mentioning last week where we're having a conversation about in cytogenetics of uh I forget the gene that was it's it's well known but but now it's well known that it's not a very good indicator of one particular um cancer that's when we had that conversation last week it wasn't ross one it was uh I can't remember but there's lots yeah so it is muddy it's a muddy area um and so we we just try to follow as best uh practice as we can and we just try to really you know do as comprehensive of a job as we can and try to really chase down a lot of edge cases to to try to sort out the the week from the childhood list